How to reduce cache reads

hi @tiantianaixuexi there are a few ways to reduce cache reads.

As @Ra.in says they are important to reduce your cost at no quality loss.

Why do cache reads happen?

  • When you submit a request for Agent it gets processed at AI providers into tokens and then you receive a response from AI. Those tokens are cached for next replies to reduce cost by 90%.
  • Each AI tool call and user request is an API request to AI provider. As now only the new part of the chat or AI tool call need to be processed you receive 90% token cost for cached tokens. AI still processes whole context to create a response.
  • Not using cache would mean that the same amount of tokens would cost full input price at 9x more than the cached token cost.

When is it an issue?

  • In case a chat thread gets too long it may have a large context used and therefore consume a lot of tokens accumulated.
  • Possible effect is also that AI gets confused by too much conflicting information in context when chat gets too long.

Solution:

  • Keep each chat focused on single task.
  • Use simpler models for simpler tasks.
  • Use large context like Sonnet 4 1M only if the regular Sonnet 4 model can not fit the required context in 200k tokens. Note that tokens over 200k cost 2x as much as regular context until 200k tokens.

Additional details about token usage:

1 Like