I have been struggling to find good benchmarks for LLMs to use with coding. Now we have ~10 models to choose from all with pros/cons. Does anyone know of any reliable benchmark that is up to date? (Meaning new o1 etc, maybe even Google Gemini 2)
I only know Swebench, which is focused on coding, but it’s specifically about tools, not models. There’s also Lmarena in the coding section, where you can see a leaderboard of models.