Hi Abdelrahman,
A theoretical comparison can only get you so far. A highly practical approach is to run your own personal benchmark on a real project—that’s the definitive way to discover which AI model works best for your specific workflow.
My proposal / idea of using Vitest for instant feedback and structuring the process in a .md file could be right for your case. Here’s a suggested plan to build on the idea to get the most out of your test:
Focus on a “Full-Stack” Todo App
You can set up a .md file to guide your test.
1. The Project Setup:
A solid foundation is key. You could set up a monorepo with this stack:
-
Backend: Node.js with TypeScript and Hono (a modern, fast web framework).
-
Frontend: React (or Vue) with Vite and TypeScript.
-
Testing: Vitest throughout the project for immediate feedback.
2. The Test Workflow (Repeat for Each Model):
For each model (GPT-4o, Claude 4 Sonnet, etc.), you’ll follow the same steps and evaluate how it performs.
Task A: Backend with (Test-Driven Development)
Just as suggested, have the AI write the tests first.
-
Prompt 1: “Write a Vitest test for a CRUD API for todos (Create, Read, Update, Delete). The tests should fail initially since the API doesn’t exist yet.”
-
Prompt 2: “Now, implement the Hono API endpoints to make all the Vitest tests pass.”
Task B: Frontend Implementation
-
Prompt 3: “Create a React component that fetches the todos from the backend and displays them.”
-
Prompt 4: “Extend the component to allow users to add new todos, mark existing ones as complete, and delete them.”
3. The Ultimate Challenge: The 1-Million-Token Test
This is where you can really challenge models like Claude Sonnet in MAX mode.
- Prompt 5 (with the entire codebase in context): “Analyze the entire project. Refactor the backend to replace the in-memory array with persistent storage using a simple db.json file. Update all API endpoints and tests accordingly.”
This final step will truly reveal which model can grasp the context of an entire project and apply consistent changes across both the frontend and backend.
4. Your Evaluation Scorecard:
At the end, you can fill out a simple scorecard in your .md file for each model:
| Criterion |
GPT-4o |
Claude 4 Sonnet |
Sonnet 1M |
Grok code |
| Code Correctness |
/ |
/ |
/ |
/ |
| undefined |
-— |
-— |
-— |
-— |
| Efficiency (# Prompts) |
|
|
|
|
| undefined |
-— |
-— |
-— |
-— |
| Code Quality |
   |
    |
   |
   |
| undefined |
-— |
-— |
-— |
-— |
| Context Awareness |
   |
   |
   |
   |
| undefined |
-— |
-— |
-— |
-— |
Your current approach of using Auto mode for general questions and Sonnet (with and without MAX mode) for specific tasks is very smart. This benchmark will give you concrete data on when it’s truly worth switching to a model with a larger context window.
I personally don’t like Opus — to expensive and the results are bad in combination with my rules.
Good luck with your benchmark! This is the best way to make a well-informed decision.
FYI – Not my setup, but not far away:
- Real-time research/search → Gemini 2.5 Pro
- Planning & Reasoning → Gemini 2.5 Pro and evaluate with an another model
- Coding → Claude 4 Sonnet w/ Cursor
- Write Test Cases → Gemini 2.5 Pro
- Run Test Cases → Auto Mode
- Debug → o3 or Auto Mode
Important: Basic tasks like connecting git, vercel, supabase and so on can be done with “Auto-Mode” from my point of view.
Have fun with Cursor!