Hello! Everywhere I find info that claude-4 is best at coding. Here is my story.
For my app I asked 3 models the same - to analyze my code, and give me a plan to implement new feature.
Tested models:
o3, gemini-2.5-pro and claude-4-sonnet (all thinking variants)
My prompt was the same, all chats new. But I did not mention in prompt one requirement which is crucial for the right architecture - I forgot.
Out of 3 models only gemini gave me good architectural plan because it pointed out for the possible requirement which was not mentioned.
o3 and claude gave me plan of the wrong architecture - they did not foresee potential problem like gemini did. I had for add additional requirement for them to redraw plan.
Than I gave all three plans to every models and asked to compare in terms of robustness, logic, good practises etc.
As a result all 3 models gave me practically same response:
o3 - best, most robust and clever plan
gemini - almost same, but with some minor flaws
claude-4 - the plan it the worst (and interestingly, claude itself gave the most negative review of its own plan).
I understand that it is only one case. But how come the “best” coding model according to maaaany opinions out there makes worst coding decisions?
Secon - when I presented same â– â– â– â– â– plan to claude - but told that it was his plan and asked to analyze it - claude answered that is a very good plan and very smart!
What am I doing wrong?