Here’s some coding benchmarks for o3 and o4-mini.
The cost of o3 is likely disqualifying for everyday usage, but o4-mini looks very promising.
Aider:
LiveBench:
SWE:
Codeforces (
IMO):
Feel free to add any more, or discuss whether the benchmarks are consistent with the experience in Cursor. 
4 Likes
GPT-4.1, I think, has released a free version.
personally for coding typescript node js, claude sonnet and Gemini much better on my project
1 Like
Both 4.1 and o4-mini are currently free.
The API price for o3-mini and o4-mini are the same, so hopefully when the honeymoon ends it’s also priced at 1/3 request.
1 Like
Interesting.
I agree o3 + Gemini 2.5 Pro could make a good combo. 2.5 flash is out today, so it’ll be interesting to see where it lands.
I’d also like to see o3 + o4-mini (~half the cost of 4.1)
Back when Claude 3.7 (thinking) was king, I suggested switching the R1 + Claude 3.5 combo to R1 + 3.7 (no thinking) since 3.5 and 3.7 cost the same, and R1 + 3.5 was only ~1% behind.
This isn’t a coding benchmark, but ‘g factor’ is highly correlated with most cognitive abilities, plus it’s fascinating.
o3 ranks in the 99% percentile, and 2.5 Pro 97% on Mensa Norway.
The gap with the offline test is pretty high, so there’s likely some Goodharting (especially by OAI).
The offline test could be a better relative comparison, but if the 115 IQ score for the best models is accurate, god help the rest of us who likely need to re-calibrate our own IQ estimates by standard deviation or two. 
1 Like
OpenAI keeps dropping new versions—I haven’t even tried 4.5, and they’re already rolling out GPT-5 