(Continuously Updated) My Real-Time Review of Grok 4

I’ve been trying to develop a CAT tool for personal use for almost a week now. I wrote an incremental integration test with 48 tasks to track progress and follow a TDD-style development workflow.

I’ve hit a wall at Task 20 — none of the available LLMs have managed to pass it so far. I’ve tried everything, both individually and in relay races: Gemini 2.5 Pro, Gemini 2.5 Pro MAX, Claude 4, Claude 4 MAX, o4-mini, o3, o3-pro (Manual mode one-time situation analysis), and Auto. Auto and o4-mini break previously working logic right out of the gate. o3 messes up the terminal with commands I don’t even recognize.

The project is private, so I can’t share anything more detailed, but I’ll be over the moon if my magic wand gets released tomorrow night.

Well… Maybe it’s time to stop. For yet

2025-07-10_01-24-06

[MOD - Condor - removed due to Forum Guideline violation]

One more request and I just know I’ll fix it!

[MOD - Condor - removed due to Forum Guideline violation]

Well, I think I have Grok-ready :cat_with_wry_smile:

I’m sorry to tell you the bad news that he will output "Thinking… Thinking… Thinking… Thinking… Thinking… " and then end the conversation.

Noway! He’s tripping over edit_tool!

Bart Simpsons vibes

Well

  • In regular mode it stops because of edit_tool
  • Then I corrected the promt and It’s spend 1.24$ because of incorrect tabs editing

Trying in MAX mode with more context…

And after ~10 minutes and 0.96$ at Grok 4 MAX we have a regression


Same prompt and codebase.
Gemini 2.5 Pro MAX after 15 minutes and 0.8$ and he’s still working

UPD: surrender after 25 minutes and 1.25$
UPD2: after 25 minutes, the number of passed tests increased to 446

Interesting. And to be expected.

Cursor + Model creator usually need to collab on agentic behaviour.

Cursor is SO broken with Grok 4. So bad. The model just doesn’t think.

Hey, as with every major model release, we are working to improve Grok 4 currently, both on available capacity from xAI and on it’s stability within Cursor. Each model requires some custom tuning of it’s system prompt to ensure it behaves well inside of Cursor - they very rarely can be “dropped in” and work immediately.

By the way: when models can’t correctly apply changes via edit_tool or think they can’t, is this the problem of the applying model or the LLM itself, or both?

Grok has never been good, 4 is no better.

Gemini fails often.

Just over a week ago, I didn’t even know how to use tests.

Now I’m spending hours optimizing a four-layered parameterized integration test for idempotency, trying to get it to run in less than 11 minutes :zany_face:
(and that’s with only 47% of the test currently executing).

You complain and only complain about absolutely every LLM. Either your tasks are too complicated, or you have problems with prompt-engineering. Try using my Agent Compass. It will be interesting to see if it can help you.