Coding benchmarks for o3, and o4-mini

Here’s some coding benchmarks for o3 and o4-mini.
The cost of o3 is likely disqualifying for everyday usage, but o4-mini looks very promising.

Aider:

LiveBench:

SWE:

Codeforces (:poop: IMO):

Feel free to add any more, or discuss whether the benchmarks are consistent with the experience in Cursor. :blush:

4 Likes

GPT-4.1, I think, has released a free version.

personally for coding typescript node js, claude sonnet and Gemini much better on my project

1 Like

Both 4.1 and o4-mini are currently free.

The API price for o3-mini and o4-mini are the same, so hopefully when the honeymoon ends it’s also priced at 1/3 request.

1 Like

2 Likes

Interesting.
I agree o3 + Gemini 2.5 Pro could make a good combo. 2.5 flash is out today, so it’ll be interesting to see where it lands.

I’d also like to see o3 + o4-mini (~half the cost of 4.1)

Back when Claude 3.7 (thinking) was king, I suggested switching the R1 + Claude 3.5 combo to R1 + 3.7 (no thinking) since 3.5 and 3.7 cost the same, and R1 + 3.5 was only ~1% behind.

This isn’t a coding benchmark, but ‘g factor’ is highly correlated with most cognitive abilities, plus it’s fascinating.


o3 ranks in the 99% percentile, and 2.5 Pro 97% on Mensa Norway.

The gap with the offline test is pretty high, so there’s likely some Goodharting (especially by OAI).

The offline test could be a better relative comparison, but if the 115 IQ score for the best models is accurate, god help the rest of us who likely need to re-calibrate our own IQ estimates by standard deviation or two. :man_shrugging:

1 Like

OpenAI keeps dropping new versions—I haven’t even tried 4.5, and they’re already rolling out GPT-5 :sweat_smile: