Same refactor, 5 models, very different "extras"

I ran a quick test: one messy Express file (hardcoded creds, duplicated auth, no structure), same prompt to all 5 Pro models. “Refactor this into a clean structure.” Nothing else.

Every model handled the basics fine. What I didn’t expect was all the stuff they added that nobody asked for.

What each model added unprompted

Opus caught that my JWTs had no expiry and added expiresIn: '24h'. It also fixed a status code bug (original used 400 for invalid tokens, should be 401) and created a .gitignore and .env file.

Sonnet caught the same JWT issue but made it configurable (JWT_EXPIRES_IN || '24h'). It also threw in a /health endpoint and a README. If you’re about to deploy behind a load balancer, that health check saves you a step.

GPT behaved in a way that felt architectural. Custom HttpError class, an asyncHandler wrapper (Express doesn’t catch async errors by default, so this one actually matters), and it split app.js from server.js for testability.

Gemini just did the refactor. Nothing extra added. It was also the slowest at ~4m 21s, which is hard to justify when the others add more and finish faster.

Auto picked Opus.

Timing

Model Time
Sonnet ~60s
GPT ~1m 21s
Opus ~2m
Gemini ~4m 21s

Anyway

I’ve been defaulting to Sonnet for most things. Fast, practical additions. Opus when I care about security stuff. GPT if I want it to over-engineer things (sometimes that’s what you want).

The JWT expiry fix and the asyncHandler are things I would’ve missed in a real project, honestly. Worth diffing the output even on simple refactors.

I tested this with Cursor CLI agent mode. Input and outputs committed to git before/after each run.

Try something different.

If you prefer speed, run the same process through ALL models, IN ORDER of speed. Each model will amend the results with the missing parts that only they can determine, but input tokens are faster than output tokens, so the refined models don’t do much but still contribute details.

If you prefer costs, run the same process through ALL the models, IN ORDER of cost. Each model will only add missing details, and of course, input tokens are cheaper than output tokens, the most tokens will be generated by the low cost model.

2 Likes

i hadn’t thought about chaining them in sequence like that. so the idea is each model just fills in what the previous one missed, and you’re optimizing for whether the cheap model or the expensive one does the heavy lifting? how many models deep do you actually run this before the improvements plateau?

Only one way to find out. But usually at about 4 levels high ending in Codex XH or Opus, there are no changes left or it’s just semantic cleanup. The empty result is still a positive as it means Opus or Codex did a quality check and confirmed the functionality is implemented sufficiently satisfactory.

1 Like