(Continuously Updated) My Real-Time Review of Grok 4

I’ve been trying to develop a CAT tool for personal use for almost a week now. I wrote an incremental integration test with 48 tasks to track progress and follow a TDD-style development workflow.

I’ve hit a wall at Task 20 — none of the available LLMs have managed to pass it so far. I’ve tried everything, both individually and in relay races: Gemini 2.5 Pro, Gemini 2.5 Pro MAX, Claude 4, Claude 4 MAX, o4-mini, o3, o3-pro (Manual mode one-time situation analysis), and Auto. Auto and o4-mini break previously working logic right out of the gate. o3 messes up the terminal with commands I don’t even recognize.

The project is private, so I can’t share anything more detailed, but I’ll be over the moon if my magic wand gets released tomorrow night.

2 Likes

Well… Maybe it’s time to stop. For yet

2025-07-10_01-24-06

[MOD - Condor - removed due to Forum Guideline violation]

One more request and I just know I’ll fix it!

4 Likes

[MOD - Condor - removed due to Forum Guideline violation]

Well, I think I have Grok-ready :cat_with_wry_smile:

I’m sorry to tell you the bad news that he will output "Thinking… Thinking… Thinking… Thinking… Thinking… " and then end the conversation.

1 Like

Noway! He’s tripping over edit_tool!

1 Like

Bart Simpsons vibes

3 Likes

Well

  • In regular mode it stops because of edit_tool
  • Then I corrected the promt and It’s spend 1.24$ because of incorrect tabs editing

Trying in MAX mode with more context…

And after ~10 minutes and 0.96$ at Grok 4 MAX we have a regression


1 Like

Same prompt and codebase.
Gemini 2.5 Pro MAX after 15 minutes and 0.8$ and he’s still working

UPD: surrender after 25 minutes and 1.25$
UPD2: after 25 minutes, the number of passed tests increased to 446

1 Like

Interesting. And to be expected.

Cursor + Model creator usually need to collab on agentic behaviour.

Cursor is SO broken with Grok 4. So bad. The model just doesn’t think.

1 Like

Hey, as with every major model release, we are working to improve Grok 4 currently, both on available capacity from xAI and on it’s stability within Cursor. Each model requires some custom tuning of it’s system prompt to ensure it behaves well inside of Cursor - they very rarely can be “dropped in” and work immediately.

3 Likes

By the way: when models can’t correctly apply changes via edit_tool or think they can’t, is this the problem of the applying model or the LLM itself, or both?

Grok has never been good, 4 is no better.

1 Like

Gemini fails often.

1 Like

Just over a week ago, I didn’t even know how to use tests.

Now I’m spending hours optimizing a four-layered parameterized integration test for idempotency, trying to get it to run in less than 11 minutes :zany_face:
(and that’s with only 47% of the test currently executing).

You complain and only complain about absolutely every LLM. Either your tasks are too complicated, or you have problems with prompt-engineering. Try using my Agent Compass. It will be interesting to see if it can help you.