You are right, my bad. I didn’t realise that Anthropic was charging the same for 3.5 and 4. That’s truly nuts. EDIT: I’ve probably noticed it’s “cheaper” because it does less tool calls to check things, you have to spec them properly, but it doesn’t do 50 checks and test runs. It also doesn’t call other models to do things like search for code content in your codebase. These tool and model calls are expensive. For my use case, therefore, they are functionally cheaper because the extra value of some of the claude 3.7 and claude 4 models is not worth the cost for me.
But I think you are missing a few points. Firstly, the value you put in quality, depends on your sensitivity to price and the value of your time. Personally, I’ve found that Claude is much more reliable than Gemini 2.5 Pro for coding work.
In terms of the metrics used in your link, most of them are not coding specific, and the ones that are coding specific are more related around problem solving, than building code. Ie they will tend to be optimization/algorithm problems, not “refactor this code” problems. Many of the tests are nothing to do with code, they are reasoning etc, humanities last exam has literally nothing to do with coding.
Would love to hear your criticism of the SWE benchmarks, because in my research they looked pretty reasonable to approximate what developers are using these models for.
I don’t have a wagon to push for Anthropic/Claude, I’m not a shareholder or a shill. I don’t really care what model I use, but I want to get good results. I’ve done most of my serious testing with Gemini Pro and Claude 3-4, and Deepseek R1.
Frankly for coding work, rating Gemini 2.5 Pro or R1 higher than claude does not match my real world experience using those models. There’s a reason that claude has a good reputation with developers, because it tends to perform well in real world development environments. It’s behaviour is pretty good. I’ve tried opus, but I didn’t find it anywhere near worthwhile compared to sonnet.
Mind you, I’m not vibe-coding, and I’m working with established code and interlinked systems. That’s the environment I’ve tested them in. Interested to hear about your experience.