There’s been a lot of discussion about whether Auto mode quietly picks cheaper models for complex tasks. I wanted actual data instead of vibes, so I ran a comparison.
What I did: Created 5 tasks at increasing difficulty (loading spinner → refactor → React debugging → architecture design → test suite debugging). Ran each one twice: once with Auto, once with Sonnet 4.5 manually selected. Same prompts, same project, same files.
Results:
| Task | Auto | Sonnet 4.5 | Faster |
|---|---|---|---|
| Loading spinner (simple) | 21.6s | 21.5s | Tie |
| Refactor 200-line function (medium) | 41.4s | 36.9s | Sonnet |
| Debug stale React data (complex) | 26.0s | 28.8s | Auto |
| Architecture design | 66.1s | 83.4s | Auto |
| Shared state test bug (reasoning) | 44.6s | 39.9s | Sonnet |
Auto was actually faster on the two hardest tasks. Output quality looked the same across all 5. Both solved everything correctly from what I could tell reviewing the generated code.
The one interesting difference was on the test debugging task. Auto moved the setup into a global beforeEach. Sonnet called .clear() on the shared state. Both valid, different approaches.
Big caveat: I ran this through the CLI, which doesn’t show which model Auto actually selected. So I can see outcomes were equivalent but can’t tell you the routing. If anyone knows how to pull that metadata I’d love to re-run this with visibility into what Auto is actually picking.
Other caveats: Small sample (5 tasks), clean TypeScript/React project, one language. A messy real-world codebase might behave differently.
Based on this though, the “Auto is cheating you” concern didn’t hold up. At least not for these task types.
Anyone else tested this? Curious if your results match or if there are specific scenarios where Auto clearly picks wrong.
