O3-mini VS claude-3.7-thinking

o3-mini is much cheaper.
claude-3.7-thinking is more expensive.

o3-mini needs multiple prompts to reach the same commit (sometimes it finishes without even a summarized conclusion or reaction).
claude-3.7-thinking will use more tools, try to analyze more things, appear to go off-road several times, then come up with a complete solution that also includes affecting other features just because it felt doing it in the same prompt.

o3-mini lags at first, then comes up with a compact (barely sufficient) code solution, then it will polish the implementation from tool use and prompting feedback.
claude-3.7-thinking shows a plan in the first 3 seconds, then uses tools, searches things and starts changing things all over the place with other steps in-between, you can interrupt to complement, clarify or finalize an implementation or processing effort.

o3-mini writes 20 lines of code to fix the issue after at least 3-4 chained prompts and lacks creative art abilities in regards to code layout and it’s badly colorized results
claude-3.7-thinking writes 200 lines of code that fix the issue on 1 try with elegant formatting and fancy execution results (also fixes other issues sometimes in the same prompt).

In theory o3-mini should be cheaper to use, in practice the cost ratio between final results favors claude-3.7-thinking. Total human watch time looks similar for both.

Does anyone else feel the same?

1 Like

For complex coding tasks I find claude-3.7-thinking outperforms o3-mini.

Often I find o3-mini tells me it’s found the root cause of a bug when it has instead completely misinterpreted my code.

I’ve basically stopped using o3-mini outside of niche cases where I want to use apostrophes, because AFAIK Anthropic models are incapable of outputting typographic apostrophes (i.e. ’).

1 Like