I’ve noticed that multiple models perform exceptionally well in benchmarks like LM Arena and Artificial Analytics, but when it comes to coding, none seem to match the latest Claude models. Whenever Anthropic releases a new model, it consistently outperforms others in coding tasks. I’ve tested Gemini 2.5 Pro, and while it performs adequately, it struggles with larger codebases compared to Claude Sonnet 4’s impressive performance.
I hope teams like Qwen or DeepSeek release new models to compete strongly with Anthropic’s offerings. This could drive prices down, making it more feasible for solo entrepreneurs without external funding to build products affordably.