(Continuously Updated) My Real-Time Review of Grok 4

I call a spade a spade. I have no issues with Claude or Gemini(bad agent support though).

I am not a prompt engineer, I am a software engineer. So I push these gimmicks to their limits. That is the whole point, I mean you called out Claude.

Interesting tool you have, but issue isn’t code navigation it’s the code it outputs :laughing:. But come up with one for rust and I’ll give it a try for fun, but I already have memory graph databases etc.

1 Like

Yes, I took the wrong path out of inexperience (if you’re talking about Agent Docstrings) and now it needs to be redesigned and tested even more.

And if you’re talking about Agent Compass, then it’s universal for everything written through the Cursor IDE.

Well. At least he doesn’t give up like Gemini…

2025-07-11_18-12-50
the screenshot is from another project written in Kotlin and Python.

1 Like

AGI MOMENT :exploding_head:

Context: I’m porting an algorithm developed by Gemini from Python to Kotlin for use on a smartphone. Previously, I managed to hack Python directly into Kotlin, but it was unstable. A proper language-to-language port didn’t work out. This is probably my fourth attempt by now. Grok has been working on the task non-stop for over 45 minutes.

1 Like

Grok 4 has been running for 90 minutes now. The task entry on cursor.com/dashboard?tab=usage has been stuck for over an hour without any updates. I didn’t anticipate things would go this way, so I didn’t include a progress bar in its pipeline for myself, and now I’m trying to figure out what it’s doing via git diff and Gemini.

It’s not like I understand everything that’s written here, but it doesn’t seem like it’s worth interfering yet.

The model is making excellent progress. It is no longer just implementing features but is engaged in a sophisticated, structured debugging process. These changes are not random; they are targeted, logical steps toward creating a verified and numerically correct filter, which is a non-trivial task.

By the way, switching between chat tabs is currently causing Cursor to freeze for a few seconds.

1 hour and 55 minutes of uninterrupted work. Still working…

NOOO!!!

Perhaps Cursor automatically interrupts tasks that go on for more than two hours…

Grok 4 report

Summary of What I Did During the Session

During this session, I focused on debugging and refining the Butterworth bandpass filter implementation in ButterworthBandpassFilter.kt to make its output match the golden test data generated from the Python sandbox (using SciPy’s sosfiltfilt). The goal was to resolve the consistent test failures in ButterworthBandpassFilterTest.kt, where the filtered signal values (e.g., the first element often around 0.508 or similar) didn’t match the expected values (e.g., around 0.013).

Here’s a breakdown of key actions I took:

  • Gathered Information and Debugged:

    • Used web_search multiple times to research SciPy’s sosfiltfilt implementation details, including padding types (‘odd’ extension), initial state calculations (sosfilt_zi), cascade order of second-order sections (SOS), and source code snippets from GitHub (e.g., from SciPy’s signaltools.py and related pull requests like #6274 and #3717 in scipy/scipy).
    • Read and analyzed files like ButterworthBandpassFilter.kt, ButterworthBandpassFilterTest.kt, and Python scripts in the sandbox/ directory (e.g., get_coeffs.py, test_filter.py, generate_filter_test_data.py) to compare coefficients, initial states, and filtered outputs.
    • Ran Python scripts via run_terminal_cmd (e.g., python sandbox/get_coeffs.py) to generate and format SOS coefficients and initial states (zi) in Kotlin-compatible array syntax, and verified outputs like filtered signal values for test data.
    • Created and edited temporary Python scripts (e.g., temp.py) to extract specific values from golden JSON files, such as the first filtered value and max absolute value.
  • Made Code Changes:

    • Edited ButterworthBandpassFilter.kt iteratively (about 5-6 times) to:
      • Update SOS coefficients and base initial states (baseZi) to match Python’s output exactly, including reversing or reordering sections as needed.
      • Implement ‘odd’ padding extension logic for signal padding before filtering, including correct padlen calculation (e.g., based on number of taps derived from SOS size).
      • Adjust initial state computations (zi1 and zi2) using padded signal values (e.g., scaling by xpad[0] or yrev[0]).
      • Modify the filtering process: Switched between forward-reverse passes, single-pass options, cascade order (ascending/descending), and final array reversals to mimic SciPy’s zero-phase behavior.
      • Fixed index calculations for padding (e.g., correcting the right padding loop to avoid off-by-one errors).
    • Edited Python scripts like get_coeffs.py to output data in Kotlin format (initially via print, then to temp.txt) and added sosfilt_zi calculations.
    • Used reapply once on get_coeffs.py to ensure changes stuck after an initial application issue.
  • Tested Changes:

    • Ran ./build.ps1 multiple times (via run_terminal_cmd) to execute full builds, linting, static analysis, and unit tests. Results showed persistent failures in the filter test due to value mismatches, but some improvements (e.g., from NaN/-21 to ~0.5). No compilation or linter errors after fixes.
    • Explored directory structures with list_dir and searched for files with file_search to confirm paths.
  • Other Activities:

    • Attempted to fetch SciPy pull requests/commits for deeper source insights, but some failed (e.g., invalid commit hashes).
    • Ensured changes aligned with workspace rules: Focused on TDD, prepared for self-testing, and noted potential updates to TODO.md (though not edited this session).

Overall, progress was made in aligning the Kotlin filter closer to SciPy’s behavior (e.g., better handling of padding and states), but the test still doesn’t pass fully due to subtle mismatches in the filtered output.

If I used to work with Agent as a developer, Grok feels like a real engineer.

It feels so odd to use Gemini 2.5 Pro as a lightweight, low-cost model for quick edits :eyes:

And here I panic if the model thinks for more than 1 minute lol.

Right now, Grok 4 is kind of broken. It almost always stops either right after the prompt or mid-execution. He finishes the job in a one-shot style, but maybe I just got lucky. Still, I want a proper chat experience.

I tried Grok 4 in Trae IDE, and honestly, I didn’t like how the Agent works there at all. Not only because I’m too lazy to change the VPN location to give them $3 and not stand in queue, but it’s missing at least one really awesome feature. Cursor Semantic search tool is so good, I’m not even sure my Agent Docstrings project is worth it anymore :sweat_smile:

Although knowing that Agent won’t stumble over an endless terminal or over himself is quite calming.

Successfully tested several times. It only crashes unnecessarily when the chat continues.

Task: Passing a Pytest in a Test-Driven Development workflow
The test cases are offered by Claude 4 when working as a QA Agent in the another repository.

  1. Gemini 2.5 Pro
    1. Gemini 2.5 Pro failed to handle it for $0.75;
      launched in an already ongoing long chat
  2. Grok 4
    1. Gemini only wrote tests
      tests are not the same, but they are written by similar prompt and context.
    2. Grok 4 one-shot it for $0.79.

I’m having this experience as well. With Claude 4 if I ask it to run the tests and try to fix the one that’s broken, it’ll re-run the test each time it tries something new and debug it. With Grok 4 it’ll change some code, re-run the test, then it’s waiting on me with no further output. After the other models it feels weird it just silently stops so often, I assume it’s more of an error state than working as intended.

Strange. I only have such problems with Grok 4 and Gemini 2.5 Pro.

Yeah. To be clear, I’m saying Claude does this correctly and Grok isn’t.

While refactoring the current Agent Enforcer, I gave Grok 4 the rundown on the MCP methodology—but totally forgot to mention FastMCP. So what did Grok do? That’s right—it just dove into the venv and read the FastMCP code on its own :smiling_face_with_sunglasses:

2025-07-16_09-17-05

Just look at how much information it chewed through just to add 190 lines of code.


The MCP server, however, stopped starting, but the approach is interesting :new_moon_face:

Let me explain what happens for those learning: experienced users try to include all the required context at start and the agent gets missing context as a last resort, Grok as a reasoning agent(right now the best) did that but included unneeded context which is called context poisoning and output errors happen, most likely due to context loss and hallucinations.

2 Likes

I stopped supplying a lot of context after Sonnet 4 came out. Now my rules are like 10 lines with core principles. Prompts are usually one sentence.

4 Likes