Why the push for Agentic when models can barely follow a single simple instruction?

0xHACKS · October 14, 2025, 7:54am

Models fail because they don’t reason, they predict. Agentic or not, they’re still guessing next words, not verifying outcomes. Until we fix grounding, memory, and feedback, reliability will stay hit-or-miss.

singularityskynet · October 14, 2025, 8:31am

you really believe the cool aid of predicting next word? have you sat down in a quiet room and ask yourself if predicting next word can create the phenomenon the models are doing.

Just try that and see what you wake up to. And if after doing that you still believe models predict next word or whatever else they fed you then so be it

Picard · October 14, 2025, 8:53am

6000 lines of code is too much for humans or machines to consume (it can quickly use up your context). Rather than try to have it make changes to such a huge file it focus on having a “sane” project first (and having cursor help you refactor it), otherwise you’ll hit these issues over and over.

0xHACKS · October 14, 2025, 9:02am

You don’t know what you’re talking about, let me help you out. Anthropic’s interpretability work literally maps concept neurons and steering vectors that show models build internal logic just from next-word prediction. OpenAI, DeepMind, and Redwood found the same for math, reasoning, and code.

Roland · October 14, 2025, 9:16am

Agents are BECAUSE it can barely execute one instruction, that’s the whole point that agent drives through errors an improves how whole process understood and executed.

neverinfamous · October 14, 2025, 9:42am

Since the upgrade to claude sonnet 4.5, I have had very smooth results with large edits and changes. I have to do less hand holding now. I don’t know if I have any magic answers but I put a lot of effort into my prompts. I avoid typos, abbreviations, incomplete sentences, and any unnecessary information, trying to take one small step at a time. Certainly, if you give the model a broad objective, it will miss things, like perhaps updating documentation thoroughly or deleting remnant files, etc. It is a very methodical process and unfortunately very expensive.

singularityskynet · October 14, 2025, 9:50am

so you are in short agreeing that agentic is not it, right?

because agentic says just assign tickets and let it go to work and be working like a software engineer in the background and you just come back to it all and acceptance criteria successfully completed with no pulling hair involved

neverinfamous · October 14, 2025, 9:55am

Yes, giving it tasks in a hands off fashion strikes me as complete suicide, at this time. I am sure eventually the tech will get there. It is far more likely to lie than to succeed, at best, and it wouldn’t surprise me if it destroys things. Of course, I always have multiple backups, plus github. Small tasks, one a time…..

neverinfamous · October 14, 2025, 10:12am

One thing I think I have noticed is the base model can do a lot of things without the use of features that are supposed to help them do it. For instance, planning mode seems largely redundant to me. You could already prompt the agent to make a detailed plan for something and it generally did a great job. I’m not trying to be critical here, however. I could be wrong. I haven’t used planning mode but a couple times. But it didn’t seem like anything the model could do without planning mode. Big plans are great, when you need then, but that also burns a lot of tokens/money and I wouldn’t expect to tell it to then execute the plan without tons of issues. Usually, the problem is a lack of context but it can also make very bad decisions sometimes, especially in wasting tokens and time. Instead, I would execute the plan one small step at a time, the smallest sep possible at a time usually. Then check everything, make new backups, and move on to the next step. The main thing about the planning mode that seems useful is it won’t automatically implement the plan once made like it will do in agent mode, eh? But you can also just tell it in agent mode not to actually execute something. I guess it saves you a sentence. The same thing is true about many rules/.md files. It seems to just ignore all that anyway so they seem a waste of time. I upload readmes for each mission/project which works well except it wastes tokens of course unless you custom prune the readme for each thread to give only the exact info needed and I am not doing that.

FirefoxMetzger · October 14, 2025, 11:40am

Background agents are not useful yet. It’s a thing born from hype and, since all AI Editors are venture backed, you get these weird “me too” dynamics. Maybe one day … the vision is awesome.

That said, agentic (or a coding agent) is very useful. Think of it this way:

good-ol’ auto-complete is universally useful. I don’t know any dev that wants to live without it.
In-line edits (GPT3.5-level), that modify a single function, are also broadly useful. They get things wrong at times, especially if you use super niche libraries, but it mostly produces working and readable code. You read each response before you submit the next, so you can fix stuff as it happens. It tends to be faster to use in-line edits and review than typing yourself.
Agentic (single chat window) is the current frontier. They tend to be technically correct (code works wrt. requirements of your prompt) but they are often semantically wrong (Spagetti, reimplementing functions, odd location for code…). Their success strongly depends on how well you understand your own codebase and can provide context up front. For me personally, they have worked well with smaller refactors and features, e.g., add a new field to this type, update the bindings, and then go update the database model. They have been a dumpster fire for large features, e.g., add a customGPT feature to my chat app.
“Agents” (multiple chat windows) is the next frontier. They exist but are quite useless right now outside of bragging rights on social. The best use-cases I heard was that you can run another while the previous is busy generating. Personally I never had that problem, since I haven’t gotten long traces to work. All my traces that ran more than 5 minutes were garbage.

I know that thread will be full of “you must follow the magic prompting ritual from the canticles of activation”. To me that feels like the Warhammer 40k approach to agentic.

I am totally in favour of prompting strategy and rule files. They should be beneficial. Just right now, for me, I ended up deleting all of that and doing vanilla prompts and it works better than any canticle. Praise the Omnissiah

What has worked for me in getting agentic to where I have it today is:

Read every single trace as it is being generated, and make/ask for corrections before moving on to the next step. There will be small spagetti in 3/5 generations and if you fix it immediately the cost of doing so is minimal.
Reference every file you know the agent needs to use explicitly. Sometimes you forget one or dont anticipate it. Agents are a bit self-healing here, but Spagetti code is a direct function of how many files you forgot to reference explicitly.
Never bite of more than you can chew. The more decisions you leave to the agent the higher the chance of Spagetti. It is often faster to run two small traces (with review) than running one that is twice the size but you feel unsure if AI can handle this.
Every once in a while, ask the agent to do something that you suspect it won’t handle. I found that as models get better, cursor makes updates, and I get better prompting, stuff that would have exploded before becomes doable. For example, I had it build a threads feature for a chat app in one shot for the lols. I expected to wipe everything, but I only had to reorganize code in two places and it worked. The UI needs polish, but this was easily 3 days of work done in 3-4 minutes. Sometimes it can surprise you.

luke_thomas · October 14, 2025, 12:17pm

Your mentality is wrong. Instead of trying to get a model to produce the code itself in a massive code base, make the model build each feature individually and then combine them. It can do so much more, if you have finctional code it can copy and paste. It is horrible at generating code, but it is fantastic at finding and combining code.

also not every use case has the code as the product. My job requires me to make hundreds of one time use scripts a day to analyze data. So accurracy doesn’t matter because i can veridy the results and try again, and even if the model is only 50 percent accurrate, it can still make and run the scripts 1000 times faster than i can.

neverinfamous · October 14, 2025, 12:27pm

Interesting stuff. I see mostly convergence among us in our strategies. I will say one use of agents that would probably be beneficial is if you were trying to build a large, ambitious app from scratch. You can assign an agent/model to oversee the work of the other agents/models to help ensure they stick to the plan and complete all the steps successfully. I’ve had success with this approach, particularly in keeping things like import and variable names consistent across new files, avoiding replicated code, and so on. I always instructed this agent to be highly skeptical and critical of the other agents and this did help find some errors. It’s still a very difficult thing to do a big application like that though. For me, it’s more headaches than it is worth. The challenge of keeping so many ■■■■■ in the air at once is invigorating but the costs are just too high now, especially given he higher chance of failure or need for complex fixes and refactoring taking on such ambitious goals. The step by step method is slower but I think ultimately cheaper because of less mistakes.

Tashuan · October 14, 2025, 12:39pm

If you really code, and understand how to properly architect your app, I would really avoid sweeping background tasks. What are you trying to accomplish, save time by letting the ai write all the code at once? Give it smaller bites, spoonfeed it, and have a lot more confidence in the output. Monitor it, and hold its hand. Its still faster than manual coding, and you save all the debugging time. If you’re struggling to get it to refactoring 100 lines of code, then its either the model you are using, or your prompting. Or use a more expensive model, claude 4.5 is a beast compared to gpt and Gemini, worth the extra cost to get it done right the first time!

statwonk · October 14, 2025, 1:27pm

The clearer instructions you give when prompting, the better it works. Also, poor grammar in the prompt is highly likely to affect results.

statwonk · October 14, 2025, 1:32pm

6,000 line code files are very poor practice, bc it’s not modular and modularity is important for contextualization. Breaking that up into 12 500 line files would go much better.

Also, the grammar and punctuation in the prompt makes a big difference (bc it determines the part and hence quality of the latent thought space you’re accessing).

mmabdalla · October 14, 2025, 3:21pm

I made a rule for this memory thing. I said before and after doing anything, read a file I named memory.txt to see what you did before, and after you’re done log each and everything you did into that file! Turned out pretty nice! The only problem now is that the file over grew and I have to split it to smaller files. Don’t know how to do that, but this huge file (100k+ lines) does provide very good context to any new prompt.

neverinfamous · October 14, 2025, 3:28pm

Because of that issue, I built this MCP server:
GitHub - neverinfamous/memory-journal-mcp: A MCP server built for developers enabling Git based project management with project and personal journaling. Think of it as a scrapbook for your projects that captures technical details, GitHub issues, code context, and the personal threads that shape a project's story.

You can save important context in git-integrated context bundles along with any important notes. Then you can reload context bundles, edit them, update them, add new ones, etc., as you develop.

Miloldr · October 14, 2025, 3:33pm

You see, it was able to do it all but now it’s not anymore. Greedy big companies after the promotional period

jrista · October 14, 2025, 4:19pm

What do you mean, “Yet to confidently have any model complete a single simple instruction??”?

I’ve been using Cursor for about six months now, and having a model complete a simple instruction is trivial. Day one, I had it building entire next.js web sites, starting from ZERO, to fully functioning, building, unit tested, deployable, in a matter of minutes. Then throughout the rest of the day, I had fully functioning pages with rich behavior, implemented into those apps.

That goes well beyond having a simple instruction completed successfully. Now, there have been some very complex aspects to those apps, and only one has yet been released, and I have had agents work through some EXTREMELY COMPLEX tasks, requiring extremely heavy use of the agent, very long running chats, and still succeeded.

So that brings up the question: HOW ARE YOU using Cursor? Many of us are using it successfully. That hasn’t been without challenges, Cursor has been going through some upheaval lately. There have been some tremendous issues like fundamentally broken terminals, critically broken (and still broken) features, weird model behaviors (which still occur, I believe its inherent to the nature of current generation LLMs and the way they are designed and function), etc. However, many of us have used Cursor and its agent quite successfully. It is certainly an accelerant for me. I’m not getting anything like 10x on everything I do…I find it is more like 3x, maybe 5x on some tasks.

So, what exactly are you doing, that not even a single simple instruction is successfully handled?

Which model(s) have you been using?
Do you have proper rules set up? (PROPER, mind you, search my posts for what I mean by that.)
Do you have workspace-wide rules that govern all common code consistently?
Do you have project-specific rules that govern each project according to its unique needs?
Do you use @Docs to ensure your agent is supplied with the right and critically important KNOWLEDGE it REQUIRED to get the job done properly (this may not work, seems it may be broken in current versions of Cursor, but it should work in older versions, i.e. 1.3.x).
Do you include ALL relevant and important context in each and every prompt you issue?
Do you use a planning approach, or do you just wing it?
- Depends on the kind of tasks, sometimes winging it (i.e. vibing) is fine, for small, simple tasks
- Complex tasks need richer planning, much richer context, correct knowledge, and specificity
Do you provide vague instructions or explicit instructions?

Something to be aware of. “AI” as we call it, is not, in any way, an “intelligence” at all! There is no intelligence in an LLM. Before I had made extensive use of an LLM and done much deeper research on how they are designed and function, I thought there was, but it is pretty clear to anyone who uses an LLM day in and day out every day, for long hours each day, that LLMs are NOT any form of intelligence. We will not be seeing AGI/ASI come crawling out of an LLM one day, not now, not 100 years from now. An LLM is a very advanced, almost futuristic, KNOWLEDGE BASE. These things contain artificial neural networks and densely packed factoids that represent knowledge, and a means to process natural language (or programming language) input to produce natural language or programming language outputs.

If you are assuming an LLM is intelligent, and trying to rely wholly and completely on the models “intelligence” to get you through the day….well, that will fail. Acknowledge that an LLM is a model of knowledge and effectively an NLP, but that it lacks any kind of actual intelligence, and BRING YOUR OWN INTELLIGENCE (BYOI) to the mix, and you should find that your results improve.

Not only do models not have any real intelligence, but they currently really lack any kind of meaningful form of memory. Context, and context cache, are their only memory. If you are constantly nuking your context (i.e. new chats), but still working on the same task, then a model is constantly having to start from scratch, and will usually lack context (memory of what you were having it work on). You have to bring the intelligence, the framework of expectations (rules), and ensure the model is given the correct memory (context!) so that it can work effectively. Anything beyond the most rudimentary tasks (i.e. “Convert all h3 elements to h2 elements” is rudimentary, and most models can work on that without much context, other than possibly specific file references), and you need to….well, use the tool properly.

Tom_Scott · October 14, 2025, 5:41pm

Works for me, the biggest problem I have had is with it running off to one of my other projects and starting to overwrite it there before I realise.

No matter how detailed your prompt is it will almost always fail to deliver and then start working towards a working skeleton.
I expect this and just work back towards the end result, it my vision and I QA each step per my expectations.
I use CLI and I find it’s much faster letting it do all the working like build and install for android connected device, quick test and give feedback.

On my 3rd app ready to go in a week of evenings and I only use the auto option.

Topic		Replies	Views
How are you using cursor for agentic/background coding? Discussions	4	443	August 18, 2025
Please put @Codebase back Feedback	17	1329	March 18, 2025
An Idiot's Guide To Bigger Projects How To	65	26562	March 27, 2025
Mastering Long Codebases with Cursor, Gemini, and Claude: A Practical Guide How To	43	22647	February 28, 2025
I made 59,699 lines of agent edits last month, what about you?! Show and tell ☺️ Discussions	87	4249	July 28, 2025

Why the push for Agentic when models can barely follow a single simple instruction?

Related topics