ChatGPT 5.1 Codex High vs Gemini 3 Pro vs Claude Sonnet 4.5 for coding

What’s everyone’s experience so far with these three models? I have seen reports ChatGPT 5.1 Codex High is testing highest for coding and seen people report good results with Gemini 3 Pro. But Sonnet is pretty reliable for most use cases so I would like more information before trying 5.1 and 3 Pro for complex tasks. What’s the cost comparison, also?

Article on subject:

How I work with these models:

  1. Task: Clarify the problem, context, and requirements; choose an approach
    Model: GPT 5.1 High Fast
    Why: Acts like a senior architect: defines constraints, risks, and the overall solution idea so you don’t waste effort going in the wrong direction.

  2. Task: High-level planning and architectural/technical design
    Model: GPT 5.1 High Fast
    Why: Best suited for thinking through architecture, module contracts, implementation options, and the pros/cons of each approach.

  3. Task: Detailed implementation plan broken down by files and steps (task breakdown)
    Model: Composer 1
    Why: Based on the given architecture, turns it into a clear step-by-step plan without excessive “creativity”.

  4. Task: Finding the right files, functions, code fragments, and explaining the current implementation
    Model: Composer 1
    Why: Quickly navigates the project and gives understandable explanations without heavy analytical overhead.

  5. Task: Draft code and local changes (1–2 files, without changing public interfaces)
    Model: Composer 1
    Why: Like a solid junior dev: writes prototypes and local changes reliably when the scope is clearly defined and it’s forbidden to touch anything extra.

  6. Task: Fast bulk edits following a simple pattern (renames, small mechanical changes)
    Model: Composer 1
    Why: Good for mechanical work within a limited file scope when you explicitly define the boundaries of changes.

  7. Task: Reviewing draft code from Composer 1 and making targeted improvements
    Model: Codex 5.1 High
    Why: Acts as a senior reviewer: finds non-obvious issues, improves readability, and checks consistency with the architectural plan.

  8. Task: Refactoring that touches multiple modules and public interfaces
    Model: Codex 5.1 High
    Why: Sees the big picture, can propose a refactoring plan, preserve/improve architecture, and minimize side effects.

  9. Task: Designing and writing tests (unit/integration), analyzing coverage
    Model: Codex 5.1 High
    Why: Better at designing test cases, edge-case scenarios, and updating existing tests to match new changes.

  10. Task: Diff analysis and producing a “what changed and why” report
    Model: Codex 5.1 High
    Why: Can concisely and structurally describe changes, which reduces the risk of hidden side effects and makes review/code review easier for you.

  11. Task: Final high-level check of architecture and risks (after major changes)
    Model: GPT 5.1 High Fast
    Why: Looks at the system from above: compares the final implementation with the original goals, architectural decisions, and long-term impact on the project.

Yesterday I wanted to try on same task all 3 models
GPT 5.1 Didn’t get the task right
GPT 5.1 Codex High Did the task but with a lot of changes
Sonnet 4.5 Got the task right with much better result and much less changes than Codex
Gemini 3 Pro Got the task right, slightly better result than Sonnet 4.5 and with even less changes than Sonnet 4.5

I consumes unnecessarily the quota unfortunately but I will experiment more and see in multiple scenarios who is doing better.

Scenario:

I had a page built with react, the layout is split in two columns on left side there was a table displaying a list of items on the end there was pagination, on right side there were some cards to display information

This was the task:
Update the table such that it always fits the viewport and the pagination will always be displayed at the bottom, when changing pages with different items number the table size shouldn’t not change to prevent layout jumps. If the table has more items than it can fit in viewport then the table should be scrollable

There seems to be a consensus that Gemino 3 Pro is significantly better at designing complex front ends than Claude Sonnet 4.5 but also more expensive. Other than that, they seem fairly equal, is my impression. I haven’t seen a detailed comparison with ChatGPT 5.1 models yet.