Idea: Optimize LLM Usage with a Local Filter & Code Caching (reduce token costs)

Hi Cursor IDE team and community!

I’d like to share an idea to reduce token costs and improve efficiency when working with large language models (LLMs) in code editors. This approach could make interactions with expensive models (like Gemini 2.5 Pro or Claude 3.7) faster and more cost-effective.


The Problem

Today, even simple tasks (e.g., “Where is calculateTax used?”) often require sending massive chunks of code to an LLM. This leads to:

  • High token costs: Thousands of tokens wasted on irrelevant code.
  • Slow responses: Models waste time parsing unrelated files.
  • Noise overload: Important details get lost in bloated contexts.

Proposed Solution: Local Filter + Smart Caching

Use a lightweight local model as a “pre-filter” to identify relevant code snippets before querying the expensive model.

Workflow
  1. Code Indexing

    • On first run or file changes:
      • Parse all files and extract key elements with AST parsing and/or with local LLM:
        • Function/class names.
        • Line numbers and file paths.
        • Brief descriptions (comments/docs).
      • Store this data in a local cache (SQLite, vector DB, or even a JSON file).
  2. Query Processing

    • User asks: “Fix the bug in validateForm.”
    • Local model:
      • Scans the cache to find validateForm (e.g., line 120 in form.js).
      • Builds a minimal, enriched prompt for the LLM:
    In file form.js, there's a function validateForm (line 120):  
    function validateForm(data) { ... }  
    User says: "The email field isn't validating correctly."  
    
  3. Send to LLM

    • The expensive model gets a focused context, reducing token use and improving accuracy.
  4. Cache Updates

    • Dynamically refresh the cache when new code is needed or files change.

Advanced: Function Relationship Mapping

If a user asks, “Fix validateForm, which uses checkEmail,”, the local model can record a dependency: validateForm → checkEmail.

  • Future queries about email checking will automatically include validateForm and checkEmail in the context.

The system builds a graph of function interactions , reducing the need for manual exploration.

Implementation Steps

Post-Processing Hook:

  • After task completion, trigger a lightweight analysis of:
    • User queries (e.g., “Why does processOrder fail?”).
    • Code snippets sent to the LLM.
    • LLM responses (e.g., “Modify processOrder to call validatePayment).

Update the Cache:

  • Store relationships in a graph structure (e.g., Neo4j, Redis Graph, or a simple adjacency list):
    {  
      "validateForm": ["checkEmail", "checkPassword"],  
      "checkEmail": ["sanitizeInput"]  
    }  
    

Example Workflow

  1. User Query:
    “Fix calculateTotal — it’s not summing tax correctly.”
  2. LLM Response:
    • Modifies calculateTotal to call a new applyTax function.
  3. Post-Task Analysis:
    • Local model infers: calculateTotal → applyTax.
    • Updates cache to link these functions.
  4. Next Query:
    “Why is applyTax returning NaN?”
    • System automatically includes calculateTotal in the context.

P.S. We can enrich cache with data only from AST parsing but its not enough because we can’t find connection between entities if it is event-based system (e.g. EventEmitter). So it can be hybrid approach (AST + local LLM) or only local LLM.


Key Benefits

  • Token Savings: Only send critical code fragments, not entire projects.
  • Speed: Faster responses from both local filtering and smaller LLM contexts.
  • Accuracy: Less noise = fewer errors from context overload.
  • Scalability: Handles large codebases by focusing on relevant parts.

Call for Discussion

Does this align with Cursor’s roadmap? Are there technical challenges I’m missing? Thank you!

Using AST can be an additional benefit, however isn’t indexing of code in Cursor already doing what you propose?

From what I see, the Agent in Cursor does take your request, analyses it for required info fetching relevantcode parts with RAG from your indexed codebase, then uses integrated tools to read those functions and make the adjustments.

The changed code gets already indexed again and this is used then for next steps.
Also as long as you keep the chat open and continue it would use any gained info continuously.

From my experience the ‘cache update’ is not ideal as otherwise manual changes to code do not get updated in cache, but the currently available indexing of code does what you need as it updates its index after manual or AI changes.

1 Like

I’m currently testing Cursor and checking how it searches for data, and it really can find relevant code areas. Cool! I didn’t think Cursor’s vector-based indexing would be so interesting and accurate.

But lets think about how we can make Cursor even better. What if Cursor will maintain two vector stores
• Code Store: what Cursor already has—embeddings of every function, class, comment, etc.
• Chat Store: embeddings of each user message, agent reply, code snippet you’ve discussed.

Then enrich Cursor request to LLM with original question + a 2–3-sentence summary of the relevant chat history (e.g. “In earlier messages, we established that FooBar = analytics pipeline. Search code areas related to ‘analytics’.”). If its not enough there can be MCP for chat history. and when LLM want to read history we first search in chat store related parts and then LLM can use MCP with info about related chat history lines from embeddings. In theory if user uses only LLM for coding, chat history can be robust database about business model of code and it can replace memory-bank. Moreover over time it becomes a semantic map of your domain: you can query “What did we say about FooBar?” or “Show me our billing event flow discussion.” This chat memory can effectively replace or augment a separate product-requirements database—because it’s instantly linked to your live coding session.

Yes a chat vector store could be practical, consider following cases:

  • Not every user is well practiced in prompting, so requests vary from basic requests to perhaps being upset with a models performance. This is not about being a developer but rather having more or less experience with AI prompting. => this would require detection of intent and outcome, plus assessment if it succeeded.

  • Often AI hallucinates, then user directs it back onto track where possible or starts a new thread with better prompt. => here detection could check if the tas was completed or abandoned, manual flagging wont be practical in long term

  • Code that has been changed over many threads or even manually, is different from what was in the chat logs. Often even requirements or standards change or new ones apply to the project.

Overall I agree there is definitely valid and important information in chat, but not all of it will be correct, relevant and not outdated.

How could we extract meaninful information and not take in errors, hallucinations or stressed users actions?

This is not much different from a codebase, just because a codebase has certain functionality, it may also be incorrect to a part, or perhaps incomplete.