When using o3-mini in composer (agent mode), it often will say “I will make these changes” and stop generating after that sentence.
From there, it may take multiple requests to get it to make the changes it proposed. I’ve found prompts like “Implement the changes” work 50% of the time. Other times, it again returns a text response with no action taken.
This issue burns a TON of usage and dilutes the model’s context with unnecessary requests, making it hard to work with and less effective than it can be.
P.S. There’s also a formatting issue, where o3-mini will write code formatted as plain text.
Not only o3-mini, the integration of other reasoning models is not great either (meaning pretty much useless in development). Don’t think it’s models’ issues. This needs to be addressed by Cursor or any other AI coding IDEs.
Yes every model/provider has a bit different requirements and handling.
While the APIs are nowadays almost standard, the models are not.
As we know there are not just differences in context length but also how certain prompts (including Cursor internal system prompts) cause different behavior in different models. That is why Cursor team often reports they are testing new experimental models and have to adjust their internal pre-processing to gain better results.
Luckily some of those changes are done on Cursor server side which dispatches requests to the LLMs and does pre-processing like indexed docu selection or other parts that use Cursor created LLM models which reduce load on the heavier models by preparing the input.
You can notice similar behavior on other AI tools like Perplexity or Claude Chat / ChatGPT from OpenAI when they search the web before processing results by LLM.
Unfortunately there is one part that we do not see but which can cause issues, when AI providers adjust their model behavior and pre-prompts or processing like monitoring or feedback loops, parameter optimization, dynamic routing to different server configurations, distillation/quantization/precision adjustments for faster inference, adjusting concurrency limits per server,…
On that part Cursor has no influence.
I have had great experience with several of the reasoning models, through Cursor and also coding unrelated tools. Current reasoning models are not always well reasoning. DL-R1 often takes data passed from RAG (search/index,…) as more correct than the user prompt, even you tell it what the facts are. The ‘thinking’ process, if you ever read it, can be a mess.
That means that its usage as coding tool is also impacted by such issues. It just handles prompts differently than a non-reasoning conversational LLM.
Hope also that Cursor team can tweak the process and prompts for better output.
I noticed that as of yesterday, O3-mini in cursor (composer agent) has gotten a lot better in tool calling and usually can do a chain of fixes. I feel the number of times I have to say “proceed” and “use diff edit” has now been reduced by 90%. This makes the model a lot more usable now, in fact sometimes I prefer it over Sonnet when I notice that Claude is going in loops.