Coding benchmarks for o3, and o4-mini

AbleArcher · April 17, 2025, 4:44am

Here’s some coding benchmarks for o3 and o4-mini.
The cost of o3 is likely disqualifying for everyday usage, but o4-mini looks very promising.

Aider:

LiveBench:

SWE:

Codeforces ( IMO):

Feel free to add any more, or discuss whether the benchmarks are consistent with the experience in Cursor.

ciaranossel · April 17, 2025, 7:52am

GPT-4.1, I think, has released a free version.

aldinokemal · April 17, 2025, 10:08am

personally for coding typescript node js, claude sonnet and Gemini much better on my project

AbleArcher · April 17, 2025, 11:58am

Both 4.1 and o4-mini are currently free.

The API price for o3-mini and o4-mini are the same, so hopefully when the honeymoon ends it’s also priced at 1/3 request.

fungilation · April 17, 2025, 6:58pm

AbleArcher · April 17, 2025, 11:55pm

Interesting.
I agree o3 + Gemini 2.5 Pro could make a good combo. 2.5 flash is out today, so it’ll be interesting to see where it lands.

I’d also like to see o3 + o4-mini (~half the cost of 4.1)

Back when Claude 3.7 (thinking) was king, I suggested switching the R1 + Claude 3.5 combo to R1 + 3.7 (no thinking) since 3.5 and 3.7 cost the same, and R1 + 3.5 was only ~1% behind.

This isn’t a coding benchmark, but ‘g factor’ is highly correlated with most cognitive abilities, plus it’s fascinating.

o3 ranks in the 99% percentile, and 2.5 Pro 97% on Mensa Norway.

The gap with the offline test is pretty high, so there’s likely some Goodharting (especially by OAI).

The offline test could be a better relative comparison, but if the 115 IQ score for the best models is accurate, god help the rest of us who likely need to re-calibrate our own IQ estimates by standard deviation or two.

ciaranossel · April 18, 2025, 6:20am

OpenAI keeps dropping new versions—I haven’t even tried 4.5, and they’re already rolling out GPT-5

Topic		Replies	Views
Comparing Models Discussions	4	4682	August 11, 2024
The o4 mini and o3 are fantastic at solving complex problems Discussions	1	458	April 30, 2025
Add o4 mini and o3 to cursor Feature Requests	19	2506	April 23, 2025
O3 and o4-mini now available in Cursor Featured Discussions	56	7417	May 1, 2025
O3-mini VS claude-3.7-thinking Feedback	1	710	March 15, 2025

Coding benchmarks for o3, and o4-mini

Related topics