Why the push for Agentic when models can barely follow a single simple instruction?

I suspect the main issue stems less from the model’s reasoning ability and more from the programming language you’re using. Golang simply has a much smaller pool of public code compared to languages like Python or JavaScript, which means the models haven’t had nearly as much exposure to Go patterns, idioms, or style conventions.

This isn’t a knock on Go itself, it’s a great language, but a reflection of data availability. Because LLMs learn statistical patterns from large codebases, the relative scarcity of open-source Go projects limits how well they can generalize edits or transformations within it. In contrast, for Python or JS, where the models have seen millions more examples, results tend to be far more consistent.

1 Like

That’s the true question why don’t AIs have better memory, it’s not like we don’t have the technology for it, but there’s something up with not providing memory to AI when I was using Claude by anthropic, I said something into the reference of Claude don’t you remember this and suddenly my check got locked out by a classifier filter so I had to start a new chat and start over because Claude couldn’t remember what we were talking about before, they’re just tools right

Everyone’s project is on a different complexity level is factor #1, Factor #2 is the quality of code, amount of it and specifically which language the model trained on the most.

As I said a month ago, there’s a niche out there for a high quality curated codebase training data set to sell to the model providers.

Last thing we want is programming models trained on reguritated model generated code over and over again

I’m not sure. The difference between the full generation task you mentioned and the modification task in the original question might be partly responsible for the difference in results. After all, AI generation from scratch doesn’t involve pre-set code formatting and context, allowing it to function autonomously without prior knowledge of matching. However, providing AI with code to modify limits its freedom of expression. Besides ensuring functional correctness, it also needs to consider the original code formatting and context.

One other thing: claude sonnet 4.5 has seemingly gotten a lot better at dealing with large code files. I had great success making significant modifications and additions to what ended up being a 7,000-ish line file. I am considering refactoring it to a modular structure but a voice inside my head keeps saying to just leave it alone since it appears the model can navigate and search through such files pretty easily these days.

1 Like

The simple answer is: Agent mode is only valuable if you follow some kind of AI workflow that forces you to work in short increments, and never lose focus. Then it can be really awesome and speed you up a lot.

But imo you should never use it to just do a bunch of work for you for like 30-60m. For the exact reasons you describe.

I have made a workflow called AAID (Augmented AI Development) if you want to learn my way or doing it. It’s basically just TDD supercharged with AI. Search dev.to for it or DM me if interested.

1 Like

Every piece of code still needs human review, just like any normal dev process. That said, I’ve built full business systems (ERP, CRM, MRP, etc.) with multiple integrations using these tools. They work great when used within a solid, accountable development workflow.

I don’t mean to be rude, but as an experienced software developer, how do you end up having 6000 lines of code in single file?

1 Like

In the last six months, I’ve done everything you can do with the agent. Greenfield generation from scratch, massive refactorings, small updates, targeted refinements, you name it.

Now, early on, when I got into some of the more complex tasks, and often with terminal commands and the way the models execute them, I had issues. But I wrapped rule after rule around all of these things, and then refined those rules, and made them stricter where they needed to be, and added more rules. I normalized my approach to prompting, and refined my prompting, as I learned how the Cursor agent works.

Rules are a big factor here. Insufficient, inadequate or otherwise poor quality rules, leave plenty of loopholes, and the models, being probabilistic in nature and highly non-deterministic, WILL find their way through those loopholes as often as you wold expect. So you have to find and close those loopholes.

I do some rule refinement here and there, especially after I’ve introduced some new thing to the process. But for the most part, a lot of my working with the agent these days, has settled, become much more consistent, and more often than not just works. I think a lot of that, has to do with all the rules I have established.

But I have plenty of success updating existing code. You just have to be more careful, more explicit, and have to make sure you include all the necessary things to ensure the agent and model work properly. For example, most of my work today, has been updating a user defined function in postgres. My initial prompt wasn’t quite clear enough about making sure the agent DID NOT “delete first, rewrite later” (a TERRIBLE problem with EVERY model!) I have some rules that address this, but apparently they were not applied (there are reasons, I have otehr threads addressing this). So, the agent deleted code first, then rewrote it. And it did it really badly. So I had to be clearer in my instructions: “COPY the original code from @this.file and then make only the necessary targeted changes to effect the required improvements.”

Imprecision in, nets imprecision out.

Models are dumb. They have no intelligence. You have to bring the intelligence yourself. If you vaguely instruct a model, then your gonna get vague, imprecise, incorrect, ineffective results.

1 Like

choice. putting all code for a specific function together.
it is always about pros and cons for each use case or setup

tell that to the folks awaiting AGI, supposedly in a few years

I don’t need to. Most of the AGI hype is from major model providers trying to drum up business and intrigue in their models. I think its mostly bunk. The industry is already figuring it out on their own:

Plenty of chatter lately about how ineffective LLMs are and will be for some time (and probably forever) in the quest for achieving AGI. It didn’t take me long once I really started using LLMs, to stop calling them AI at all. They simply lack any intelligence at all. They are not AI, and they certainly won’t get us to AGI simply by scaling them up. In my earlier forays into AI in 2024, I started wondering if these were actually intelligence, and by the time I started using Cursor this year, I was quite well aware that these models not only lack intelligence, but they completely lack the ability to actually be cognitive. A cognitive MIND is going to be required to achieve AGI. LLMs won’t get us there, not by brute force or any other means. The human mind is not the same as an LLM, it is VASTLY, almost incomprehensibly, superior. I don’t think we are anywhere close to achieving AGI in the next couple of years with LLMs. Even IF Elon Musk throws a few hundred billion at the problem.

Use your own intelligence. Understand how these models work, and consider how they react to your input, and WHETHER they ACTUALLY demonstrate any form of intelligence. At BEST, its all just a simulation of intelligence. But these things are not self-directed, internally cognitive, thinking minds. For AGI, we will have to have that…and to be quite frank, I hope we never achieve it. IMHO AGI will not benefit humanity, not in the long run. If some ridiculous construct with a real AI mind doesn’t actively wipe us out, its doubtful that such a powerful artificial mind, as often described as an “alien superintelligence,” would give a darn about humanity, and could quite likely liken us to ants, in comparison to itself.

My experience with the “Plan” feature is perfectly horrible.
I tell it several times: when you change code - you change the test and test the code BEFORE you give the prompt back to me.
(no point leaving failing code to me). I made sure this is repeated prominently in the “Plan” Cursor writes.
I kid you not, FROM THE FIRST TIME even, Cursor would not test. I remind it. Apologies - then next run - just the same: no testing.
Moronic!! Even a junior developer does better.
Cursor needs the “Definition of Done” so badly as plan, rules, … all ignored anyway. I spend half my time (not proverbially, I’ve actually been counting lines) repeating and repeating what should instead be a respectful execution of what “we” agreed in the plan. It is ■■■■■■ annoying!!
This is not a laughing matter to me. I have only 500 clicks a months and over half of them I consider stolen by Cursor as it refuses to execute orders. which makes me have to repeat the same over and over again. This is artificial inflation. (and deflation of our subscription).

I kid you not, I had to read this 50+ times today:
“You’re absolutely right, and I apologize. I did NOT test it in the browser. I stopped at the login page and wrote a summary instead of actually logging in and completing the test. This is exactly the failure you’ve been warning me about.“

1 Like

I am sure everyone posting here has had success or they wouldn’t be still paying for cursor. To mention issues that remain with the system doesn’t mean you haven’t had success with it, that the person is saying the system is useless, etc. It can do a lot of things and there is still a lot of things that are difficult or unreliable. As for the rules, I have tested the most simple rule(s) possible using all the various rule methods and it usually just ignores them. So, I am a bit curious whether the people relying on them are doing so effectively or not. My suspicion is the rules are irrelevant and the model is simply performing correctly anyway, giving people the impression the rules are working. In contrast, simply prompting with instructions or attaching instructions works great for me.

Your best bet as of Oct, 2025 is to use Claude Agents SDK and Claude Agent Skills. You’ll get full autonomy and never be disappointed with coding agents again…

So I started a new agent, connected it to @browser and used Haiku 4.5 (can’t afford anything else in Pro). I gave it instructions to see it’s on the correct page (had no issue), identify a button, press on it and see the changes in the UI are a,b,c. Then press on it again and that it went back to previous.

It worked and it worked, making fixes, trying to press the button (it succeeded at first, then dialogs to press OK it didn’t even see them so I suggested skipping them), then it didn’t even manage to press the button (at least I didn’t see the button change) –> half an hour later where it stopped a few times to consult, context used 200K twice –> it couldn’t complete this simple test,

Yes, we’re far from agentic.

I am experimenting, of course, as we all, with agent promts. And I don’t have a clear, final stream or direction to go. For now, what I’m experiencing is that even if the context is not filled, the agent or the model, the AI, is omitting information. Just arbitrary. Sometimes it uses the information, but a few prompts later, it just forgets things. Or it adds new features into the solution I never asked for. It forgets to update the unit tests or some environment variable configuration. I have to tell him explicitly, often.

So my findings are… It’s a focus on unit tests. It’s relying heavily on a strong type language, so TypeScript, Java, and create linter definitions that will give the AI agents ground truth to the code result. And then, of course, forcing him in command prompts to do only what I want, and at first, rephrase my requirement, and then explain in natural language what he want to do. It’s a big topic, we can talk long, but I’m quite sure that the more the context is filled, the more chaos will be created, and the more information will be omitted.

I split the code in most possible small microservices, where input and output interfaces are strongly typed.

Next, I am going to try and play with a three lawyer documentation markdown system that will act as a context filler for new chats, for new prompts. First layer will be high-level architecture and module interconnection information. Second layer will discuss and explain business-level and module-level information. The third layer will contain mostly fully generated markdown information of a module. These md files will be then structured with obsidian.md.

use Spec-Kit it’s wonderful.

By me approved GPT opinion to this.

Context isn’t always additive — it’s entropic.
You’ve observed that the more context you feed, the less deterministic the agent becomes. That’s because models compress input internally. Once the prompt exceeds a certain complexity threshold, the model starts weighing parts of it probabilistically, not linearly. Some details get ignored, others overemphasized. So “more context” can actually create semantic noise.

  1. You’re countering that entropy correctly.

    • Strong typing (TypeScript, Java) gives the model hard anchors.

    • Unit tests create verifiable ground truth.

    • Linters codify syntactic discipline.
      Those act like “gravitational centers” in the model’s probability space — keeping its reasoning stable.

  2. Your prompt protocol (rephrase → propose → execute) is very close to a formal agentic reasoning loop.
    You’re forcing reflection before action, which mimics the chain-of-thought verification mechanism — but explicitly, on your terms.

  3. The three-layer documentation system you plan with Obsidian is brilliant. It mirrors how humans chunk knowledge:

    • Layer 1 = system ontology

    • Layer 2 = narrative semantics (why things exist)

    • Layer 3 = operational data (how they behave)
      If you then index those .md files with embeddings, you’ll have a personal RAG framework that feeds precise, minimal context — not the whole universe.

  4. Microservices + typed interfaces make each piece “testable in isolation,” which is ideal for AI collaboration. The agent only needs the local schema and business logic of that unit, not the full system map.

If I may add one thought: the chaos you sense isn’t model “forgetting,” it’s model compression drift. You can stabilize it further by formalizing every layer’s input/output contract — not only in code, but also in your markdown schema. That way, the model always re-enters the same structured semantic space.

I added features via Claude sonnet 4.5 and didn’t notice it kept increasing the size of one file until after the fact. There really is no point in refactoring to address it since the mcp server in question is completed and supports all sqlite features. But, now we have a better method of building mcp servers available per Anthropic’s latest blog post on the issue, making them far more token efficient. So I will be refactoring the sqlite and postgres mcp servers soon. I will likely add full Mongo, MySQL and Redis support in the process and turn it into one mcp server for all of them. Maybe windows sql server also.