I’ve been defaulting to Sonnet for basically everything, but after running some comparisons I’m starting to think I should be switching models more often depending on what I’m doing.
Like, Opus seems way more careful about security stuff (caught a JWT issue Sonnet glossed over in one of my tests). And GPT tends to over-engineer things, but sometimes that’s actually what I want when I’m setting up a new project from scratch.
But in practice I just… don’t switch. I pick one and stick with it because the dropdown is right there but I never think about it mid-task.
Anyone here actually have a workflow where they use different models for different things? Or do most people just find one that works and leave it?
Curious what the split looks like. I feel like Auto should handle this but I’m not convinced it picks the right model for the right job every time.
Hi Ned, for demanding tasks and Cursor plans, I use Opus 4.6 thinking, which is expensive but creates excellent code. For tasks of medium complexity, I use Sonnet 4.5, which does quite well. For simple and repetitive tasks, I use GPT-5.1 Codex Mini, which is inexpensive and does well. Sometimes I use Auto, but it doesn’t always work well.
I’m curious to know how other programmers behave
thanks for the replies. @brales that cost tier thing makes sense, i hadn’t thought about it that way. using a cheaper model for the boring repetitive stuff is pretty smart when you’re going through requests fast.
@Artemonim ok that subagent setup is way more detailed than i expected. assigning different models to different agent roles, like having a separate one just for verification… thats a cool idea. kind of like having a code reviewer thats a totally different person than whoever wrote it.
@neverinfamous honestly thats tempting. less stuff to think about.
i ran a quick test on the Auto thing after reading @brales’s comment btw. same branded-types rule with alwaysApply: true, same prompt, 3 runs on Auto vs 3 on Sonnet 4.5. both got 100% compliance. so for straightforward rules Auto didnt do worse, at least in my test. i think the complaints people have might show up more with rules that go against what the model wants to do by default. small sample tho.
It’s an interesting battle between Cursor and Antigravity. Cursor is extremely aggressive with updates, features, and so on. AG takes a very different approach. It strives to be quiet and seamless. I’ve always been a feature guy so I admire Cursor’s approach, their balls. But they have a lot of bugs and, meanwhile, AG simply works. It’s context across threads is extremely impressive. It knows what you have been doing, how you work, how you think, etc. It adapts to you. It seems to me neither approach is wrong and the ideal might be a more balanced approach. But, at least at this time, Cursor is just too damn expensive so it loses and I say that with all due respect. This is google after all. I’m not even sure google is putting much effort in to ag but then its hard to tell.
AG does seem to be overlooked and underestimated, perhaps because it seems more simplistic on the surface. There are less settings than Cursor but also less bugs. Maybe it has something to do with the time it takes for the knowledge items system to build up enough for the agent to fully utilize it. Or maybe people are just group-thinkers and haven’t figured out its value yet. Anyway, I am not saying it is the best. It’s very good but so is Cursor and probably others. But it is a lot cheaper than Cursor for heavy users. What does it matter if one is better than another if you can’t afford it?
I use Claude Sonnet for non-trivial main development work. For small edits, simple development work and question answering I switch to Grok Code Fast. If there is a hard to find bug I use Claude Opus.
If I am adding a big feature, I create a plan with Claude Sonnet and start building with it. After Sonnet builds the initial version, I switch to Grok Code Fast for the rest of the fixes and edits if they do not require considerable refactor.
Yes, I mostly use Opus by default, but when I think the output is not good I do a second run with Codex to compare outputs. I also use Codex for code review, not Opus. When I need highly structured code, I also use Codex, rather than Opus.
I use Gemini for web search and research online.
I use Composer for searching codebase.
@Erkan_Arslan thats a practical setup. building the initial version with Sonnet and then switching to Grok for the smaller fixes makes sense, you dont need the heavy model for stuff that doesnt require a lot of reasoning.
@liquefy using Codex specifically for code review is interesting, do you find it catches different things than Opus or is it more about getting a different perspective? and using Composer just for codebase search is something i hadnt thought of.
seems like most people here do switch models, just not in a super formal way. more like picking the right tool based on feel. and nobody trusts Auto to make the call for them.
Now with subagents I dont switch models automatically as subagents have their models predefined.
Codex is much better at code review, I use xhigh for this. Opus usually gets stuck in a loop and reports a lot of false positives. Generally I use the opposite model for reviews, so in case I code with codex, I use Opus for review.
thats a really interesting pattern, using the opposite model for review. so if Codex wrote it, Opus reviews, and vice versa. kind of like how you wouldnt have the same person write and review their own PR.
the false positives thing with Opus is good to know too. i havent used it for code review specifically but i can see how a model that overthinks things would flag stuff that isnt actually wrong.
Frontier models are way too expensive to use them carelessly for any task.
For planning - GPT 5.2
For end-to-end implementation of complex plans or major features - Codex 5.3
For small features, documentation or agentic tasks (like, deploy to machine and test logs) - Auto. Because current auto budget in Cursor is very generous and I’d run out of API usage long before I’m able to utilize all the auto for a Pro+ plan.
I also put a lot of effort for careful context crafting to make sure agents get exactly what they require and are not bloated by hundreds of generic AGENTS.md lines.
Of course, I used to employ different models for different things when primarily working in Cursor. It has a heavy downside. If the model makes mistakes that a better model wouldn’t have made, you end up spending more tokens cleaning up the mess than if you just used the best model. It’s definitely less to think about in AG in that regard. But the advantage is not seeing that bill rolling up by the nanosecond. Talk about stress.
I’ve been programming 8 month using cursor, maybe 500 hours, I tried many times to switch to sonnet or codex or gpt, or gemini (also good model), but usually return to opus, even with his help it makes some average mistakes. All other models (except opus) just usually waste time/money/tokens . it also doesn’t make sense to split tasks via difficulty, they could make mistakes in any type of tasks. weird code pattern, ‘first-eye solution’ (just in the middle of complex task). They always needed to be reviewed, every step in the plan. Yeah, I could go and take a break, but only to fill a cup of tea, there is no way I could leave some agent to do stuff more than 3-5 minutes. They always make mistakes (reading their thoughts)
@DjWarmonger the context crafting point is underrated. from what i’ve tested, vague rules (“write clean code”) get ignored regardless of file size. specificity is what matters, not length. a focused 20-line file with concrete instructions works better than 200 lines of generic best practices.
@Viktor_Ilkun 500 hours is a lot of data points. the “cant leave it alone for more than 3-5 minutes” thing resonates, i keep hoping agents will get to the point where you can fire and forget but we’re not there yet.
if the cheaper model doesn’t get it right the first time, you’re paying twice. once for the bad output and again to fix it. the math only works when accuracy stays high.