(Continuously Updated) My Real-Time Review of Grok 4

lolshadowban · July 11, 2025, 4:58am

I call a spade a spade. I have no issues with Claude or Gemini(bad agent support though).

I am not a prompt engineer, I am a software engineer. So I push these gimmicks to their limits. That is the whole point, I mean you called out Claude.

Interesting tool you have, but issue isn’t code navigation it’s the code it outputs . But come up with one for rust and I’ll give it a try for fun, but I already have memory graph databases etc.

Artemonim · July 11, 2025, 8:41am

Yes, I took the wrong path out of inexperience (if you’re talking about Agent Docstrings) and now it needs to be redesigned and tested even more.

github.com/Artemonim/AgentDocstrings

[FEATURE] Roadmap to Plugin-Based Architecture

opened 12:26PM - 09 Jul 25 UTC

Artemonim

🪲 enhancement 🤖.

# Roadmap to Plugin-Based Architecture ## Objective Define and implement a plu…gin-based architecture for language-specific logic to improve modularity, scalability, and maintainability of the docstring generator. ## Goals - Extract language-specific responsibilities (parsing, comment styles, header handling) into separate plugins. - Simplify `core.py` to function as a pure orchestrator using a standardized plugin interface. - Maintain shared data structures (`ClassInfo`, `SignatureInfo`) and common utilities in `common.py`. - Improve testability by allowing isolated testing of each language plugin. ## Phases ### Phase 1: Define Plugin Interface - Introduce `LanguagePlugin` interface in `common.py` with methods: - `get_comment_style() -> CommentStyle` - `parse(lines: List[str]) -> Tuple[List[ClassInfo], List[SignatureInfo]]` - `get_preserved_header_end_line(lines: List[str]) -> int` - `remove_agent_docstring(text: str) -> str` - Document interface contract and update developer guide. ### Phase 2: Implement Language Plugins - Create plugin classes for each supported language: - Python, Go, Java, Kotlin, JavaScript, TypeScript, C#, C/C++, PowerShell, Delphi. - Move existing language-specific logic from `core.py` and `common.py` into respective plugin implementations. - Register each plugin in a `LANG_PLUGINS` registry keyed by file extension. ### Phase 3: Refactor Core Orchestrator - Replace `EXT_TO_LANG` and `LANG_PARSERS` maps with the new plugin registry. - Update `process_file()` to load and invoke plugin methods exclusively. - Remove all `if language == ...` branches from `core.py` and `common.py`. ### Phase 4: Update Common Utilities - Retain shared data types (`ClassInfo`, `SignatureInfo`) in `common.py`. - Optionally move default comment-style configurations into plugins or keep shared defaults. - Ensure `common.py` only defines interfaces and general-purpose helpers. ### Phase 5: Testing & Validation - Develop unit tests for each plugin method in isolation. - Update existing core and integration tests to use the plugin-based workflow. - Validate end-to-end functionality across all supported languages. --- ### Problem Statement The current architecture is difficult to support in all languages at the same time. ### Priority Level 🫠 Critical - Blocking my workflow completely ### Implementation Complexity - [ ] 🤔 Simple - Minor change or addition - [ ] 😑 Moderate - Requires new parsing logic - [x] 🛠️ Complex - Major feature requiring significant development - [ ] 😎 I don't know ### Contribution - [x] I would like to implement this feature myself - [ ] I can help with testing - [ ] I can provide sample code files for testing

github.com/Artemonim/AgentDocstrings

[FEATURE] Implement Intelligent Docstring Injection to Respect Language-Specific Conventions

opened 12:56PM - 09 Jul 25 UTC

Artemonim

🪲 enhancement 🤖.

Currently, `AgentDocstrings` prepends its Table of Contents (ToC) to the beginni…ng of a file's docstring. While this works for files with no prior documentation, it breaks the established documentation conventions for most supported languages when a docstring already exists. The core issue is the **"summary-first" convention**, where documentation generators (like Sphinx, Javadoc, Doxygen, etc.) treat the first line or paragraph of a docstring as a summary for quick reference tables. By prepending the ToC, `AgentDocstrings` effectively replaces the intended summary with the ToC, leading to incorrect documentation generation and conflicts with code formatters like `black`. ## **Concrete Example (Python)** A file **without a manual summary** is formatted correctly: ```python # test_file_1.py """ --- AUTO-GENERATED DOCSTRING --- ... --- END AUTO-GENERATED DOCSTRING --- """ ``` A file **with a manual summary** is broken, and subsequently "fixed" incorrectly by formatters: ```python # utils.py - Before AgentDocstrings """Utility functions for the project.""" # After AgentDocstrings runs, it becomes: """ --- AUTO-GENERATED DOCSTRING --- ... --- END AUTO-GENERATED DOCSTRING --- Utility functions for the project. """ # This is invalid according to PEP 257. ``` ## **Affected Languages** This is a universal problem affecting most, if not all, supported languages: * **Python:** Breaks PEP 257 convention. * **Java/Kotlin (Javadoc/KDoc):** The ToC becomes the class/method summary. * **C#/Delphi (XML Docs):** The ToC can break the XML structure if not injected carefully (e.g., outside the `<summary>` tag). * **JS/TS (JSDoc/TSDoc):** Same issue as Javadoc. * **Go, C/C++ (Doxygen):** The ToC becomes the brief description. * **PowerShell:** The ToC can be misplaced before the `.SYNOPSIS` block. ## **Proposed Solution: Intelligent Injection Logic** I propose enhancing `AgentDocstrings` to intelligently inject the ToC instead of simply prepending it. The new logic should be: 1. **Detect Existing Docstring:** Check if a module/class/function docstring already exists. 2. **No Docstring:** If none exists, create a new one with the ToC as the content. 3. **Existing Docstring:** * **Parse the Summary:** Identify and extract the first paragraph (the summary line/block). * **Identify the Insertion Point:** The ideal insertion point is *after* the summary and the mandatory blank line that follows it, but *before* the rest of the detailed description. * **Inject ToC:** Insert the `--- AUTO-GENERATED DOCSTRING ---` block at this identified insertion point. * **Update ToC:** If an old `AgentDocstrings` block already exists, it should be replaced, leaving the manual summary and other parts of the docstring intact. **Example of Desired Behavior (Python):** *Initial Code:* ```python """This is a summary line for the module. This is a more detailed description that should be preserved. """ ``` *Code After Intelligent Injection:* ```python """This is a summary line for the module. --- AUTO-GENERATED DOCSTRING --- Table of content is automatically generated by Agent Docstrings v1.3.1 ... --- END AUTO-GENERATED DOCSTRING --- This is a more detailed description that should be preserved. """ ``` This approach would ensure full compatibility with language standards, documentation generators, and code formatters. --- ### Priority Level 😎 High - Would significantly improve my workflow ### Implementation Complexity - [ ] 🤔 Simple - Minor change or addition - [ ] 😑 Moderate - Requires new parsing logic - [x] 🛠️ Complex - Major feature requiring significant development - [ ] 😎 I don't know

And if you’re talking about Agent Compass, then it’s universal for everything written through the Cursor IDE.

Artemonim · July 11, 2025, 3:14pm

Well. At least he doesn’t give up like Gemini…

2025-07-11_18-12-50
the screenshot is from another project written in Kotlin and Python.

Artemonim · July 11, 2025, 3:28pm

AGI MOMENT

Context: I’m porting an algorithm developed by Gemini from Python to Kotlin for use on a smartphone. Previously, I managed to hack Python directly into Kotlin, but it was unstable. A proper language-to-language port didn’t work out. This is probably my fourth attempt by now. Grok has been working on the task non-stop for over 45 minutes.

Artemonim · July 11, 2025, 4:16pm

Grok 4 has been running for 90 minutes now. The task entry on cursor.com/dashboard?tab=usage has been stuck for over an hour without any updates. I didn’t anticipate things would go this way, so I didn’t include a progress bar in its pipeline for myself, and now I’m trying to figure out what it’s doing via git diff and Gemini.

It’s not like I understand everything that’s written here, but it doesn’t seem like it’s worth interfering yet.

The model is making excellent progress. It is no longer just implementing features but is engaged in a sophisticated, structured debugging process. These changes are not random; they are targeted, logical steps toward creating a verified and numerically correct filter, which is a non-trivial task.

By the way, switching between chat tabs is currently causing Cursor to freeze for a few seconds.

Artemonim · July 11, 2025, 4:39pm

1 hour and 55 minutes of uninterrupted work. Still working…

Artemonim · July 11, 2025, 4:45pm

NOOO!!!

Perhaps Cursor automatically interrupts tasks that go on for more than two hours…

Artemonim · July 11, 2025, 4:56pm

Grok 4 report

Summary of What I Did During the Session

During this session, I focused on debugging and refining the Butterworth bandpass filter implementation in ButterworthBandpassFilter.kt to make its output match the golden test data generated from the Python sandbox (using SciPy’s sosfiltfilt). The goal was to resolve the consistent test failures in ButterworthBandpassFilterTest.kt, where the filtered signal values (e.g., the first element often around 0.508 or similar) didn’t match the expected values (e.g., around 0.013).

Here’s a breakdown of key actions I took:

Gathered Information and Debugged:
- Used web_search multiple times to research SciPy’s sosfiltfilt implementation details, including padding types (‘odd’ extension), initial state calculations (sosfilt_zi), cascade order of second-order sections (SOS), and source code snippets from GitHub (e.g., from SciPy’s signaltools.py and related pull requests like #6274 and #3717 in scipy/scipy).
- Read and analyzed files like ButterworthBandpassFilter.kt, ButterworthBandpassFilterTest.kt, and Python scripts in the sandbox/ directory (e.g., get_coeffs.py, test_filter.py, generate_filter_test_data.py) to compare coefficients, initial states, and filtered outputs.
- Ran Python scripts via run_terminal_cmd (e.g., python sandbox/get_coeffs.py) to generate and format SOS coefficients and initial states (zi) in Kotlin-compatible array syntax, and verified outputs like filtered signal values for test data.
- Created and edited temporary Python scripts (e.g., temp.py) to extract specific values from golden JSON files, such as the first filtered value and max absolute value.
Made Code Changes:
- Edited ButterworthBandpassFilter.kt iteratively (about 5-6 times) to:
  - Update SOS coefficients and base initial states (baseZi) to match Python’s output exactly, including reversing or reordering sections as needed.
  - Implement ‘odd’ padding extension logic for signal padding before filtering, including correct padlen calculation (e.g., based on number of taps derived from SOS size).
  - Adjust initial state computations (zi1 and zi2) using padded signal values (e.g., scaling by xpad[0] or yrev[0]).
  - Modify the filtering process: Switched between forward-reverse passes, single-pass options, cascade order (ascending/descending), and final array reversals to mimic SciPy’s zero-phase behavior.
  - Fixed index calculations for padding (e.g., correcting the right padding loop to avoid off-by-one errors).
- Edited Python scripts like get_coeffs.py to output data in Kotlin format (initially via print, then to temp.txt) and added sosfilt_zi calculations.
- Used reapply once on get_coeffs.py to ensure changes stuck after an initial application issue.
Tested Changes:
- Ran ./build.ps1 multiple times (via run_terminal_cmd) to execute full builds, linting, static analysis, and unit tests. Results showed persistent failures in the filter test due to value mismatches, but some improvements (e.g., from NaN/-21 to ~0.5). No compilation or linter errors after fixes.
- Explored directory structures with list_dir and searched for files with file_search to confirm paths.
Other Activities:
- Attempted to fetch SciPy pull requests/commits for deeper source insights, but some failed (e.g., invalid commit hashes).
- Ensured changes aligned with workspace rules: Focused on TDD, prepared for self-testing, and noted potential updates to TODO.md (though not edited this session).

Overall, progress was made in aligning the Kotlin filter closer to SciPy’s behavior (e.g., better handling of padding and states), but the test still doesn’t pass fully due to subtle mismatches in the filtered output.

Artemonim · July 11, 2025, 5:25pm

If I used to work with Agent as a developer, Grok feels like a real engineer.

Artemonim · July 11, 2025, 6:31pm

It feels so odd to use Gemini 2.5 Pro as a lightweight, low-cost model for quick edits

lolshadowban · July 12, 2025, 8:36am

And here I panic if the model thinks for more than 1 minute lol.

Artemonim · July 14, 2025, 7:47pm

Right now, Grok 4 is kind of broken. It almost always stops either right after the prompt or mid-execution. He finishes the job in a one-shot style, but maybe I just got lucky. Still, I want a proper chat experience.

I tried Grok 4 in Trae IDE, and honestly, I didn’t like how the Agent works there at all. Not only because I’m too lazy to change the VPN location to give them $3 and not stand in queue, but it’s missing at least one really awesome feature. Cursor Semantic search tool is so good, I’m not even sure my Agent Docstrings project is worth it anymore

Although knowing that Agent won’t stumble over an endless terminal or over himself is quite calming.

Artemonim · July 15, 2025, 10:36am

Successfully tested several times. It only crashes unnecessarily when the chat continues.

Artemonim · July 15, 2025, 1:03pm

Task: Passing a Pytest in a Test-Driven Development workflow
The test cases are offered by Claude 4 when working as a QA Agent in the another repository.

Gemini 2.5 Pro
1. Gemini 2.5 Pro failed to handle it for $0.75;
  launched in an already ongoing long chat
Grok 4
1. Gemini only wrote tests
  tests are not the same, but they are written by similar prompt and context.
2. Grok 4 one-shot it for $0.79.

roy7 · July 15, 2025, 5:01pm

I’m having this experience as well. With Claude 4 if I ask it to run the tests and try to fix the one that’s broken, it’ll re-run the test each time it tries something new and debug it. With Grok 4 it’ll change some code, re-run the test, then it’s waiting on me with no further output. After the other models it feels weird it just silently stops so often, I assume it’s more of an error state than working as intended.

Artemonim · July 15, 2025, 6:07pm

Strange. I only have such problems with Grok 4 and Gemini 2.5 Pro.

roy7 · July 15, 2025, 7:46pm

Yeah. To be clear, I’m saying Claude does this correctly and Grok isn’t.

Artemonim · July 16, 2025, 6:23am

While refactoring the current Agent Enforcer, I gave Grok 4 the rundown on the MCP methodology—but totally forgot to mention FastMCP. So what did Grok do? That’s right—it just dove into the venv and read the FastMCP code on its own

2025-07-16_09-17-05

Just look at how much information it chewed through just to add 190 lines of code.

The MCP server, however, stopped starting, but the approach is interesting

normalnormie · July 16, 2025, 6:37am

Let me explain what happens for those learning: experienced users try to include all the required context at start and the agent gets missing context as a last resort, Grok as a reasoning agent(right now the best) did that but included unneeded context which is called context poisoning and output errors happen, most likely due to context loss and hallucinations.

condor · July 16, 2025, 9:02am

I stopped supplying a lot of context after Sonnet 4 came out. Now my rules are like 10 lines with core principles. Prompts are usually one sentence.

Topic		Replies	Views
Grok free on Cursor - Feedback needed Discussions	71	5005	September 24, 2025
Okay grok-4 woah Discussions	15	2645	July 12, 2025
Grok 4 Fast model Feature Requests	27	2123	October 5, 2025
Grok Code Fast 1 - 1M Token Context Feature Requests	6	2045	September 20, 2025
Downgrade Request(From GPT-4 Turbo to regular GPT-4) Discussions	3	553	December 1, 2023

(Continuously Updated) My Real-Time Review of Grok 4

Summary of What I Did During the Session

Related topics