To put some positivity here - I can’t say for certain if its where I’m at in this particular applet, or if my interaction with LLMs or prompting or some high level practice on my part fuels this, but I find Auto to be much improved. I had really advised against it earlier in Cursor’s lifetime, but it really is starting to pick models better than I am. Anyone else notice an improvement specifically here? I used to shy away with larger edits, but it seems to really pick well now and execute the ACTUAL agentic edit correctly and precisely.
I’ve found
o4 mini - if you struggle with other models, give this one a try, force it to web search, it is strong for specific things
google 2.5 - great for Ask on refactors, but give its output of that Ask to Agent with another model like o4 mini - I find Claude is best for high level architecture conversations via Ask, but many models fail precise edits, and interaction with the linter can be brittle
claude 4 opus - literally godlike, i cannot afford this, but it is strong. cost insensitive users, I envy you and sometimes pretend to be you
claude 4 sonnet - conceptual Ask convos. struggles with CSS as do many of the more verbose models
The problem I find is you trade one problem for another. I’ve gotten to the point where I can say it is very strong for most prototyping, and you can slog through dev which is difficult and as the node packages grow, so too does overlap and semantic and system adherence burden. Still, I remade Google Maps, interweaved 4 APIs, refactored, and we’ve stumbled at times, (it’s hard to prompt every little thing out of a move, so I just revert often and try again - sometimes the 2nd try is iteratively much closer). Still, all in all, thought I’d leave my experiences here. I’ve made some pretty complex apps to really run this thing through the hard stuff, and I feel Auto mode is getting close to feeling VERY fluid in my recent experience.
I haven’t tried Auto for a very long time. I’d like to know what the decision making process is behind ‘auto’. Does it actually use the context of your request to decide which one is most capable? If so, then perhaps it plays to the strengths of each LLM… and in that case, it’s definitely worth a try again.
If it just chooses the cheapest one or least busy one, then that’s kind of pointless.
Also, with regards to the models. If I need to do some type of SDK or API integration, I’ll generally just choose the model with the latest dataset. Pointing it at a URL with documentation is often useless, as it doesn’t crawl the site to get all of the info it needs to complete the integration.
Back here to say there is a massive blind spot in Auto.
When you switch discretely from model to model, there is massive architectural disregard and huge deviation from goal orientation. There seems to be poor model handling or context switching, or perhaps they are not re-embedding or retokenizing context upon model switch. It’s so bad I am once again saying DO NOT USE!
EDIT: I have to wonder at this point when it’s fair to say this is no longer AI fluctuation in output. I think this is pretty much (given the ups and downs) completely dependent upon the stewardship (handling our inputs to the models - I still think it fair to say if that’s your product you have to be responsible for quality especially in usage based products where the switch after a rate limit creates this issue in the first place and actively suggests you jump right into a model switch - to AUTO)