Cursor is the number one tool in my auto dev workflow. It out performs all other tools in it’s domain which I have used for granular, informed engineering automation. In short, it is very nice. But is it too nice?
Cursor’s responses largely start with the phrase, “You are absolutely right..”, and end with an explanation of how the problem is completely fixed. While I recognize that user friendliness and positive communication are cornerstones in agent behavior, it would be nice for these communications to be more balanced with informed sentiment.
Perhaps the Cursor team could apply Machine learning to the success/failure of the completed operations. Perhaps the same could be done for conversational prompt response.
I personally would be delighted if the agent responded to a prompt with, “This update proposal is ambitious and restructures the core architecture. Maybe we should approach this update one system at a time.” or even refuse to make the update and explain the user’s confusion with the ask, and let them choose to continue or reform their prompt.
It would be helpful if the final output of a workflow stated the agent’s statistical confidence that the update will work as expected. A ballpark would be better than no feedback at all.
These missing key insights sacrifice scientific visibility for emotional user experience. We are engineers and need a ‘hard nosed bot’ to tell us how it is. I imagine the Cursor team has the expertise to build an amazing prompt/result success prediction system. I say, do it up.
A skeptical engineering agent is higher utility. By recognizing integrating skepticism in their product, cursor could become even more powerful and high utility. Just something to think about.
The phrase You are absolutely right is a typical AI response, for example Sonnet 4 does it.
This is not something that Cursor does by itself but the AI model you use are trained to be helpful and friendly by the companies making them. Sonnet 4 is made by Anthropic,…
You can always ask AI in your rules for example to assess the update and give a percentage rating.
Sonnet is very friendly and agreeable. Sometimes to a fault.
If you want more of a “just get down to it and do the work” sort of model, I have found Grok Code just does the work, then reports on the results at the end of doing all the work for a given prompt. It can still be quite polite and agreeable, but its usually hidden in the “Thought for Xs” blocks.
I agree with the OP, IMHO this is a huge problem in LLMs at large, because companies who train them fine-tune for candid yes-man behaviour no matter what. There is some discussion about why this reduces the overall quality of answers and some abliterated (= uncensored) models perform better than base model in real-world scenario and not just for benchmark saturation (see e.g. Josified models).
The workaround I found out so far is to add explicit Cursor rules. But it only slightly mitigates the issue. I cannot but hope that in the near future Cursor adds support for local LLMs (something like MCP Ollama interfacing) so that we can use our abliterated / custom models…
Example of such rule:
If the user’s approach seems flawed or suboptimal, respectfully challenge it. Offer better alternatives when appropriate, but always explain why. Think that the user can be a junior and may only use naive approaches.
Also be aware that not all models have the same rule-adherence strength. From experience, Claude or Deepseek tend to consider rules very loosely, while GPT is trying to adhere them much more stringently. Don’t know about Grok, haven’t tested it that much…
I do sometimes. Sometimes I catch the model “pondering” why my code is structured in an inherently flawed way or “wonders” why the request is asking for something instead of a more logical solution. A lot of this is then just summarized into “You’re absolutely right!” lol which is a real loss in insight.
One of the first rules I made was
Don’t say “You’re absolutely right!” especially before analyzing the problem and context.
So it was funny reading the “thoughts” and it would be like “remember to not say ‘you’re absolutely right!’”
Oh yes, absolutely. There needs to be SOME insight into what the model is doing and why. If its not a thinking model, there should be some kind of feedback periodically, even if just a little. Without having ANY insight into what the model is doing or why, then we would really be in trouble. When things go wrong, having ZERO feedback outside of just the end summaries, would make it tough to first a) identify when the model is going off the rails early on, before it causes too much damage, and b) understand the scope of what it is doing and why and c) have material to research to troubleshoot why something was done the way it was done.
If you guys are thinking about removing the “Thinking” blocks from the agent, I ADAMANTLY EMPLORE you DO NOT DO THAT! We cannot work in a vacuum, even with an agent. Not all of us are just “vibin” and ignoring everything thats going on. Some of us are truly trying to use an agent like this to improve our productivity, however we still hold the fundamentals in value. You need to make sure that you don’t take away useful features that your users really do use, and IMO being able to investigate the thought cycles is CRITICAL to our ability to manage and control the agent and model.
In fact, I would say exploration of the thinking cycles is one of if not the primary means by which I create and refine my rules. Building good rules, requires understanding the model, and that requires either the model producing periodic feedback (i.e. Sonnet non-thinking, the way it produces textual output between bouts of searching, grepping, editing, etc.) or thinking cycle details.
So please, do not remove this output. It is very important.
Reminds me that OpenAI has a custom GPT personality experiment called Monday that is snarky and sometimes downright rude. Would be fun if we could use it in Cursor
This comment in general applies to basically every criticism that I typically read about IDE’s and LLMs in them:
If something seems wrong to you - it your mistake for not managing your context well enough.
If you’re spending less than 50% of your work time carefully crafting context and managing your rules and instructions - you’re using these tools wrongly.
These things predict the “next word”, based on CONTEXT, and training. It is 100% up to you to craft the precise context that the model needs, such that the output you get is what you want.
If that includes the tone it addresses you in, it’s thoroughness of the exploration of likelihood that the fixes it’s given you are appropriate, or even the way it’s identified which fixes to provide in the first place - that is all YOU, YOU, and YOU.
Every time you get anything less than what you want - you’re supposed to stop immediately, and work with the model to ensure that, next time, you do get what you want. Do that for a few months, and you’ll double your already 10x development speed, and save a lifetime’s pile of complaints along the way!
How many custom rules have you got in Cursor? How many did you write yourself? If either of those numbers are < 20 - you need to take a long pause now and better acquaint yourself with the best way to use these tools!
If I have to spend 50% of my time babysitting a chatbot into doing what I need, I’m just going to spend 50% more time writing code by hand and getting the better results of Actual Intelligence crafting my code.
hi @anon35988380 feel free to start a new thread in Discussions with your current approach and @condor tag me, I will give you feedback and best practices you can try out.