Hey, thanks for the queries here.
While we’ve not yet posted the structure of our rate limiting system, I can explain how the models work!
The impact a request has on your rate limits is almost exactly corrolated to how much it would cost via an LLM provider’s API:
- How long the specific message is
- How long the past conversation is
- How many files are attached
- How much does the specific model cost (thinking = higher cost)
- Is MAX mode (longer context cap) enabled?
While this doesn’t match or use our old pricing system, that can be a good indicator of the price of a model!
Claude 4 Sonnet Thinking is quite an expensive model, and has a decent context window, which can make it a higher-cost option - previously this used 2 requests in the old plan.
o3 is a slower but still quite intelligent model, but has a lower usage cost - previously 1 request - so will have a lower impact on your usage.
GPT-4.1 and Gemini Flash are both totally unlimited too, and don’t touch your rate limit at all.
Finally, I’d highly recommend people use Auto
, as this always provides a “premium” model (Claude 4, Gemini 2.5 Pro, etc) but at a significantly reduced effect on your rate limits - this is because we can intelligently route to models with lower usage and utilisation!