We lack evaluation tools

Hi guys!

We all know what happened with 0.45 => 0.46 version upgrade. Most of us felt a downgrade in performance. Do we have any generic way to evaluate if one update/prompt/mix of prompts/settings is better than others?

Also, can I run some benchmarking tests to check out my custom modes/settings or anything combined?

Ofc, we can release and check what the community says.

I feel we lack cool, reliable evaluation for both Cursor AI releases and every user experimentation.

Any ideas on how to test it out or how to be better at this?