I’m curious how you all choose which models to use. These days, I use GPT-4.1 for simple questions and Gemini for complex ones. For agent mode, I uses o4-mini for straightforward tasks and Sonnet 3.7 for difficult problems. I used to run everything on Sonnet 3.5, but it’s become too slow, so now I switch models everytime. Personally, I wish I could preset which model to use for each mode to cut down on exploration overhead.
I also tried the “automatic” mode, but it doesn’t always run faster, and it doesn’t produce better results than picking the model myself—so I stopped using it.
I always look at what the current top model is (Looking at benchmarks from Aider LLM Leaderboards). For every coding task I use gemini right now through my api key and also tell it to use as many tokens as it wants to get the best performance. For general questions I use Sonnet 3.7 or O4-Mini, so like the models that dont cost/use anything of my own ressources.
I don’t know if it is a bug, but I get constant rate issues with the cursor models. I went onto the Gemini Page and added a model with the name ‘gemini-2.5-pro-preview-03-25’. Since I did that I can prompt without a problem with my API key. From my understand this is the official model name from Gemini. Here is the API Doc: Gemini models | Gemini API | Google AI for Developers
Probably because cursor still uses the experimental model which is the free and severely rate limited(5 RPM, 25 RPD) counterpart of the preview model. Can you use tool calls with the preview one? Can you enable thinking on it?
I haven’t tried tool calls. You can’t enable thinking on it but I think that it uses thinking by default but Cursor doesn’t show the thinking process since it’s not one of it’s one models, but that’s only a assumption by me.
I’m literally doing the same thing. To be honest, when I read what you just mentioned, I was shocked, because I’m literally doing exactly the same thing and it works perfectly. This is the perfect combination. Good job, mate! Cheers to us.
Yeah, this is correct. Agent mode will not automatically work. It might create a code block you can then add to a file(or create the file if it doesn’t exist), but you’ll have to manually trigger it. It can’t just edit the files itself, nor can it search your files or access any part of them unless you include them into the request, and it can’t search the web either.
Basically you’re limited to the manual mode with such models.
I start from “automatic”, whatever cursor choses. as things become more complex, I switch to gemini.
in my experience gemini is faster and sometimes smarter than sonnet, but it is not as solid and stable as sonnet 3.7. I usually use gemini until it becomes lazy or dumb, then switch to sonnet 3.7. that almost never becomes lazy or dumb. probably it is due to inference budget manipulations on google’s side.
I only use thinking models, as non thinking are much worse.
when / if I’m stuck, I use o3 to dig deeper. I might use o4-mini, but only high version, as standard o4 does not show anything outstanding that would win over gemini or sonnet.
Hot take: It doesn’t matter beyond how much $$ you pay.
Yes different models have different performance on benchmarks and some “feel” better than others. However, we are talking small % improvements on artificial benchmarks that are currently under scrutiny on the count of being gamed by all the major LLM providers. “Feels better” may be a factor; it’s up to you how much you choose to rely on your feelings or the feelings of others in choosing models.
The main differentiator is along broad categories that are not specific to any model:
can the model reason? (RL-based fine-tuning aka “thinking” models)
how large is the context window
how are you paying for the model (free, premium, consumption-based)
The choice I make is as follows:
For “Ask” I use a free model (cursor-small, gpt-4o-mini). Reasoning models are fine-tuned on code and math problems (not “fuzzy” brainstorming) so unless you ask the model for napkin math (don’t, you risk hallucination ) it won’t help you and you pay premium to use them here. The allotted input context window is about 20k tokens across all models (enforced by cursor) so that factor is irrelevant here.
For “Agent” I currently use claude-3.7-sonnet. You can also use any other reasoning model (o3, grok-3, gemini-2.5, …). We want to generate code and basically every paper and benchmark under the sun shows a clear and significant performance boost for “reasoning” models. I try to stick to premium models instead of models priced via consumption (I have infinite ambition, but a very finite wallet ). Input context is, again, limited to 20k for chad and 10k for cmd+K completion so no differentiator here.
The piece I am currently experimenting with is “normal” reasoning vs MAX models. It’s significantly more expensive, but we are getting 2x context in the model itself and file references pass up to 750 lines instead of 250 lines. I think this changes how you can structure your codebase, but it’s too early for me to have a strong opinion here. Let’s see what the community discovers