Models comparison table

Yeah, LMArena is questionable, I view it as worthless hype-driven nonsense (overall and coding parts). I remember the times when Sonnet 3.5 (and 3.6) was for me the best, yet on the arena was best 4o or even some gemini pro. No, from my testing, they were always few levels under sonnet. Nowhere near the intuitive beast Sonnet was and still is. I believe on LMArena people are trying some toy questions, not even those more clever ones like from leetcode.

Though for front-end, that https://web.lmarena.ai/leaderboard seems to capture the quality better. I see o3-mini-high more recommended for back-end and planning, so I find it in a realm of possibility for R1 to be better.

Edit: R1 was pretty good from my testing, but yeah, not on par o3-mini-high nor sonnet 3.6 or 3.7. Had it been priced reasonably, according to API pricing, then it could be a useful middle model, bellow o3-mini-high, sonnet, maybe even 4o, but also cheaper.

Yeah, exorcism is similar to leetcode, but still better (different languages and I believe also different operations on code, not just writing from scratch) than supposedly general programming benchmarks done virtually only on Python, creating (unrealistically) well defined function or script and especially in majority on ML tasks, usually without types. And this is coming from a dev who likes Haskell and its brevity in well-defined functions, puzzles, golf code.

Dev of Cursor confirmed it is high: O3-mini is LIVE! What version are we getting? - #70 by danperks .

And I agree. It is very strange, but because of how the pricing of Cursor works, o3-mini-high is the best value to quality ratio. In normal world o3-mini-high would cost like 2/3 or 1, and R1 should cost like the 1/3 or 1/4. It doesn’t even make sense that o3-mini-high costs in credits same as Haiku which is much worse; or Sonnet costing same as 4o which is like in API by 1/3 cheaper, so should be 2/3 of a credit… I suspect Cursor team have some special contracts with OpenAI for the o3-mini-high and failed to find US served R1 for normal cost. Another thing I recently came across that there are some issues with context window in Cursor, some new restrictions and discussions about it are being censored (here and on reddit), so that could another reason for the weird credit pricing.

1 Like