The good news is, Cursor’s UI, general functionality, and usability is great! I will keep checking on Cursor’s development, and will likely become a regular when these changes get made.
Also, I recognize that these are in the context of a free trial user, so I’ll confine my recommendations to improvements that are needed for basic functionality. And more importantly for y’all, these are things that are preventing me from upgrading to a pro account.
These problems are actually entirely within the LLM Agent implementation - keeping in mind that this is without knowledge of your current setup or implementation.
Cursor Rules: This is a great concept, but my experience with it demonstrates a remarkable lack of coherence to rules by the LLM agent. It seems to treat the rules more as
suggestions, than actual rules. So much so that sometimes I can’t even tell if it’s aware that the rules exist. I don’t imagine that you all have much control over the actual weighting of the rules within the conversation, but perhaps the temperature could be lowered on the models (or at least an option to control the temperature could be added to the interface), in order to encourage higher coherence to the rules.
Unrequested Changes: In a similar vein, the LLM’s seem to to be extremely cavalier about making unrequested changes and guesses about the user’s wishes. They are constantly adding files that weren’t explicitly requests, changing code, making additions, and removing functionality that ends up breaking the code base. For large codebases, this introduces huge headache to go through and verify, revert, or fix. So much so that it’s often not worth it to have them do things because it takes more time to fix their mistakes than it does to just do it myself. And unfortunately, as mentioned earlier, rules don’t seem to change this behavior. This is a deal breaker for me, and I’d imagine for many other professionals. This would also potentially add further credence to there being a temperature issue with the models.
Uninformed Code/Decisions: When doing tasks that require multifile context, models often do a half-assed job of checking the codebase. Even when explicitly mentioning the directories for them to familiarize themself with, they will usually only check two file, and sometimes only part of those files. And so without complete context (or complete relevant context), the result is uninformed code that doesn’t properly implement or utilize the existing code. Examples of mistakes made from this that I’ve observed are:
- Assuming a function takes specific arguments when it does not.
- Not recognizing the difference between a property and a method (in python)
- Hallucinating functionality
- Incorrect/flawed usage of the code architecture.
Examples of tasks where this has been noticed to frequently occur:
- Documentation creation
- Documentation updates/correction
- Multi-file context code addition/changes
- Example/Tutorial creation
UI/Control:
- Either decrease the default temperature of the models, add an option to control the temperature, or find another way of encouraging stricter coherence to rules.
- Implement a comprehensive default rule set that aims to control some of this behavior.
LLM Setup & Process:
- Refine System Prompting & Instruction Following: Vigorously iterate on the underlying system prompts to heavily emphasize strict adherence to “Cursor Rules” and the user’s explicit request, framing deviations or unsolicited changes as critical failures. Explore using or fine-tuning models specifically optimized for instruction-following within technical/coding domains.
- Enhance Code Context via Advanced RAG: Improve the Retrieval-Augmented Generation (RAG) process for fetching codebase context. If you haven’t already, implement AST-based (Abstract Syntax Tree) chunking and indexing instead of simple text splitting to better capture code structure. I’d imagine you already do, but if not, use graph-based retrieval leveraging code dependency information (imports, function calls, class inheritance) to automatically pull in more relevant context files beyond simple semantic similarity. This may be something you have already implemented, so this is just an uninformed suggestion - take it for what it’s worth!
- Integrate Pre-computation & Post-Generation Validation:
-
- Pre-computation: Before generating code, run relevant static analysis tools (e.g., linters like Flake8/ESLint, type checkers like MyPy/TypeScript) on the existing code context and feed summaries of potential issues or type definitions into the prompt for the LLM.
-
- Post-Validation: After the LLM generates code, automatically run linters, type checkers, and potentially relevant unit tests (if feasible) on the changed code. Flag any new errors, regressions, or rule violations directly in the diff view, potentially even prompting the LLM for self-correction before showing the user.
- Implement Multi-Step Reasoning & Verification: Introduce intermediate steps in the AI’s process. For instance, require the agent to first generate a plan (e.g., “I will modify file X to add function Y, then update file Z to call Y”) and potentially perform a self-critique step (“Does this plan adhere to all rules? Does it only address the user’s request? Does it risk breaking existing functionality?”) before proceeding with code generation. This internal verification can catch flawed logic earlier.
- Focus on Minimal Diff/Change: Explicitly instruct or maybe even fine-tune the model to prioritize making the minimal necessary changes to satisfy the user’s request, discouraging unrelated refactoring or additions unless specifically requested. Emphasize preserving existing code structure and functionality."
Anyway, great job so far on the editor and kudos to the devs. Aside from these issues, was pleasantly impressed with Cursor.