Why would you say performance benchmarking (aka testing) doesn’t bring real value? It’s the foundation of the scientific process. I’m asking because while I really appreciate the new approach, I’m inherently skeptical in science and engineering when someone just says 'Trust me bro!" ;).
At first glance, most of the training data for Claude/GPT4x/Gemini is in normal written sentences before tokenization, so this seems like saving tokens but at the expense of going further from the native language. In fact @bmadcode also had the same concern.
For kicks, I asked Claude 3.7 about your symbolic rules (RIPER.Tof.1.0.0.md) vs RIPER-5’s (RIPER.sigma.memory-bank.1.0.3.md) rules by @robotlovehuman / johnpeterman72 based on this thread. I got
While RIPER.Tof.1.0.0.md offers elegant abstract principles, its mathematical notation and lack of concrete examples make it less suitable for direct LLM implementation. LLMs typically benefit from explicit examples, clear boundaries, and specific procedures - all of which are abundant in the more comprehensive sigma.memory-bank document.
The ideal approach might be to use RIPER.Tof.1.0.0.md as the conceptual foundation, but implement its principles through the detailed mechanisms specified in RIPER.sigma.memory-bank.1.0.3.md for optimal LLM rule compliance.
Finally, all the leaked prompts to date, whether OpenAI, Anthropic, Cursor etc … they’re all in plain simple English. Simple, brief, to the point, but all in English.
Now, I understand the insanity of using Claude to reason like that. That’s why I was hoping to find some concrete data (e.g. 5 runs each) or some additional historical data/paper on symbolic prompt.
Even for the mermaid chart based ones, without data, it’s hard for the community to know concretely. There are several other factors like just-in-time prompt/context injection and memory which are great BTW.
Again, thanks for the work. Just trying to ground this in the scientific process.
Cheers!
Sid