Same refactor, 5 models, very different "extras"

nedcodes · February 14, 2026, 3:40pm

I ran a quick test: one messy Express file (hardcoded creds, duplicated auth, no structure), same prompt to all 5 Pro models. “Refactor this into a clean structure.” Nothing else.

Every model handled the basics fine. What I didn’t expect was all the stuff they added that nobody asked for.

What each model added unprompted

Opus caught that my JWTs had no expiry and added expiresIn: '24h'. It also fixed a status code bug (original used 400 for invalid tokens, should be 401) and created a .gitignore and .env file.

Sonnet caught the same JWT issue but made it configurable (JWT_EXPIRES_IN || '24h'). It also threw in a /health endpoint and a README. If you’re about to deploy behind a load balancer, that health check saves you a step.

GPT behaved in a way that felt architectural. Custom HttpError class, an asyncHandler wrapper (Express doesn’t catch async errors by default, so this one actually matters), and it split app.js from server.js for testability.

Gemini just did the refactor. Nothing extra added. It was also the slowest at ~4m 21s, which is hard to justify when the others add more and finish faster.

Auto picked Opus.

Timing

Model	Time
Sonnet	~60s
GPT	~1m 21s
Opus	~2m
Gemini	~4m 21s

Anyway

I’ve been defaulting to Sonnet for most things. Fast, practical additions. Opus when I care about security stuff. GPT if I want it to over-engineer things (sometimes that’s what you want).

The JWT expiry fix and the asyncHandler are things I would’ve missed in a real project, honestly. Worth diffing the output even on simple refactors.

I tested this with Cursor CLI agent mode. Input and outputs committed to git before/after each run.

vibe-qa · February 19, 2026, 7:08pm

Try something different.

If you prefer speed, run the same process through ALL models, IN ORDER of speed. Each model will amend the results with the missing parts that only they can determine, but input tokens are faster than output tokens, so the refined models don’t do much but still contribute details.

If you prefer costs, run the same process through ALL the models, IN ORDER of cost. Each model will only add missing details, and of course, input tokens are cheaper than output tokens, the most tokens will be generated by the low cost model.

nedcodes · February 20, 2026, 2:24pm

i hadn’t thought about chaining them in sequence like that. so the idea is each model just fills in what the previous one missed, and you’re optimizing for whether the cheap model or the expensive one does the heavy lifting? how many models deep do you actually run this before the improvements plateau?

vibe-qa · February 20, 2026, 3:23pm

Only one way to find out. But usually at about 4 levels high ending in Codex XH or Opus, there are no changes left or it’s just semantic cleanup. The empty result is still a positive as it means Opus or Codex did a quality check and confirmed the functionality is implemented sufficiently satisfactory.

Topic		Replies	Views
I tested Auto mode vs manual Sonnet 4.5 on 5 tasks of different complexity Discussions auto-mode , cli , anthropic	3	258	February 25, 2026
Models Explanation Discussions openai , anthropic , grok	6	271	May 4, 2026
Does anyone actually use different models for different tasks? Discussions auto-mode , openai , anthropic	30	1064	March 26, 2026
Cursor LLMs Lobotomized? Discussions openai , anthropic	21	536	May 11, 2026
Why does Cursor produce richer/more accurate outputs than Claude Code with the same Opus 4.6 Extended Thinking model? Discussions context , anthropic	2	586	May 16, 2026

Same refactor, 5 models, very different "extras"

What each model added unprompted

Timing

Anyway

Related topics