Quick Question About Cursor’s LLM Integration

Hi everyone,

I’ve been really impressed with Cursor’s ability to efficiently and accurately update code, and I’m curious about the strategies behind it. I’m working on a project where I need to integrate an LLM to help users find specific segments of text based on natural language queries.

The challenge I’m facing is that each word in my text has a unique ID, and I need the LLM’s response to include information that allows me to programmatically highlight the correct words in the editor. Does anyone have experience or insights on how to approach this?

Any advice would be greatly appreciated. Thanks!

I wanted to clarify my initial question to ensure I get the most relevant advice.

To clarify:

In my project, each word in a text is associated with a unique ID. For example, the text might be represented as:

[
  {"id": 21, "word": "My"},
  {"id": 22, "word": "favorite"},
  {"id": 23, "word": "fruits"},
  {"id": 24, "word": "are"},
  {"id": 25, "word": "bananas"},
  {"id": 26, "word": "and apples"}
]

The text can be very long, such as a transcript of an hour-long conversation with around 20,000 words. I want to integrate an LLM so that when a user makes a natural language query like "find me all fruits", the system can efficiently identify and return the IDs of all related words.

What I need help with:

  1. How should I structure the input and output for the LLM to ensure it accurately identifies and returns the correct IDs based on natural language queries, especially in the context of very long texts?
  2. Are there any best practices or approaches to optimize the performance of the LLM when handling large texts and ensure that the output is precise and easily integrable with a system that highlights text segments using these IDs?

Example Scenario:

  • Input Text:

    [
      {"id": 21, "word": "My"},
      {"id": 22, "word": "favorite"},
      {"id": 23, "word": "fruits"},
      {"id": 24, "word": "are"},
      {"id": 25, "word": "bananas"},
      {"id": 26, "word": "and apples"}
    ]
    

    (Note: In practice, the text can be up to 20,000 words long, representing a full conversation.)

  • User Query: "find me all fruits"

  • Expected Output: [23, 25] (assuming "fruits" includes related words like "bananas")

I hope this clarifies what I’m looking for. Any insights or suggestions on how to handle large texts efficiently with an LLM would be greatly appreciated!

Thanks again!