Understanding LLM Token Usage

Here is a detailed explanation of LLM (AI) token usage in Cursor:

Tokens are used every time you interact with AI models in Cursor. For transparency, token counts are shown as processed and reported by the AI provider. You can check your usage in ‎Dashboard > Usage and, if you have Usage Based Pricing, also in ‎Dashboard > Billing. The dashboard is being improved to show all AI usage of a single request in one line, even when multiple tool calls are combined.

Tokens are divided into four groups:

  • Input tokens: The original user request, including attached rules, files, and MCPs,…
    Input tokens are further split into:

    • Input with cache write: Information that needs to be cached for efficient processing of future steps.
    • Input without cache write: Information used only for a single step and not cached.
  • Output tokens: The AI’s responses, such as code, chat replies, and tool call results.

  • Cache Write tokens: Messages and tool call results that are saved in the AI provider’s temporary cache to make future requests more efficient.

  • Cache Read tokens: Cached tokens (chat history and context) used in later steps to generate new AI output. These tokens are cheaper, usually costing about 10-25% of input tokens.

Each token group is counted and priced according to the provider API cost * 1.2 , and the total token count is displayed.

How a request flows:

  1. Your initial request, along with context, is sent as input tokens.
  2. AI processes the input and generates output (output tokens).
  3. Tool calls may be triggered, such as reading files or searching code. These calls use the current context and add more tokens to context.
  4. Each tool call’s response is cached, increasing the cache size.
  5. If more steps are needed, the process repeats: new input is sent, more output is generated, and additional tool calls may occur.
  6. Follow-up requests in a chat work the same way. Each step adds to the cache, and cache read tokens accumulate as the context grows.

Example:

  • A request starts with 5,000 input tokens, which are processed and cached.
  • A tool call uses the cached context and adds another 5,000 tokens, which are also cached.
  • After two steps, the cache has 10,000 tokens. When these are used in the next API call to AI provider, they are counted as cache read tokens at a reduced cost (10–25% of input token price, depending on the provider).

If you see high token counts:

  • Input tokens: Too much context (rules, files, etc.) is attached. Reduce by removing unnecessary parts.

  • Output tokens: The AI’s response is large. Limit output if possible.

  • Cache Write tokens: A lot of context is being processed and cached. Streamline your instructions and attachments.

  • Cache Read tokens: These increase with long or complex chats, as context grows with each step. This is normal for chats with multiple tool calls & follow up requests but can be reduced by starting new chats for new tasks.

Tips to reduce token usage:

  • Attach only what’s necessary; modern AI can fetch relevant code efficiently.
  • Only enable MCP tools you need.
  • Keep rules short and focused.
  • Make tasks specific and focused.
  • Start a new chat for a new task.

Model selection:

  • Use simpler models (Auto) for basic tasks like log checks or code analysis.
  • Use stronger models (e.g. Sonnet) only when needed for complex coding.
  • “Thinking” models use much more tokens than standard ones.
  • Use strongest models (e.g. Opus) when strong models are not sufficient.

Note: Switching AI providers or models in the middle of a chat can increase token usage, since the new provider doesn’t have the previous chat steps in cache and must process them as new input. For best results, switch models or providers in a new chat.

11 Likes