Claude 3.7 Booby Traps

This might be nothing, but something I was thinking about today was considering how verbose/ambitious Claude 3.7 is (in agent mode) in writing code/functions, is it inherently laying booby traps in Cursor?

As for a little context, I recently switched to 3.7 thinking to help me with an android app project that seemed to be getting too complex for 3.5 (yes, I have documentation, rules, etc.).

Case in point, upon implementing a new upload mechanism (and a minor package refactoring) I had 200 build errors on initial build. I was getting nowhere resolving these with 3.5, yet 3.7 thinking had them all knocked out within about an hour. 3.7 thinking also does, genuinely, improve the finished product by augmenting what I asked for with nice to have details.

However, in order to get all of this working, several of my files ballooned in size > 2X. I know I have duplicate functions and work-arounds for errors that have been fixed. Seems like this is part of the trade-off with 3.7, things work but they are over-built, over-engineered, and probably also often redundant and just generally ā€˜extraā€™. I have it on my list to clean this up later, but with a working app, this is secondary for obvious reasons.

This got me thinking though, if cost-controlled LLM-assisted IDEā€™s naturally struggle with context windows, long files, multiple files with multi-level relationships, etc. then is this ambitious/verbose LLM model leading me down a primrose path to where someday the very tool creating these files (and this project) will not be able to manage it? How long until all of these over-built implementations (and especially attempts at solutions to bugs/errors) make the project so large and complex that there is no way to manage it anymore?

Obviously, a lot of this is specific to my use case. I am not a developer, yet I have a working, fairly complicated, custom android app which would have been IMPOSSIBLE any other way. For ~ $200-$300 in usage-based costs I have an app that I am certain would have cost $30K-$50K to have built. Not to mention, I went from concept to working app in under two months. Thatā€™s bananas.

I feel like there are going to be all sorts of unintended consequences working with these tools, which are not only truly revolutionary, but enable people to jump the line and do things they probably shouldnā€™t (myself, for example).

The short answer is, future AI will probably fix the current AI, but for me at least, this reinforces the idea to never have it lost that these things are tools and as soon as I lose my ability to fully direct/lead/supervise/manage, I am the one who will eventually have to clean up the mess. It seems like these agentic systems can just be cut lose and be allowed to be autonomous. But, obviously, there is significant danger in that assumption.

Thatā€™s just me running a business I suppose, but a good reminder nonetheless.

1 Like

Iā€™m in the same boat, non developer, although Iā€™ve helped design a couple of apps, but technically couldnā€™t build anything on my own. Iā€™ve managed to get quite far down the line on an ambitious product, but seem to be hitting a point beyond the current capabilities of sonnet, just in terms of it being able to manage the size of the project it has created.

It was doing so well initially I didnā€™t have a proper git set up and it cannibalised half the app which I was unable to recover. Rookie mistake.

Now at the point where its super complex and 90% of time is spent fixing new errors it creates. Lots of code replication, lots of short term fixes. Even the architecture file Iā€™m using to help it remember is getting too big for it to process.

Definitely a case of diminishing returns as any project gets bigger. Iā€™m telling myself software design is complex and thatā€™s just the nature of the beast. Keep building, keep learning.

2 Likes

Thinking models and Cursor are making project size irrelevant as they extract related code from an initial point until they have enough to process. Let me use a metaphor: if projects were houses, when you change the bathroom furniture, you donā€™t need to know whatā€™s in the bedroom. If you want to know the details (like colors) of bedroom furniture, youā€™ll just quickly look at it before returning to the main task.

A common issue occurs when the house entrance doesnā€™t allow furniture of certain sizes to be transported, and youā€™ve ordered them without considering the entrance dimensions. I solve these types of problems with several approaches:

  1. A good refactoring prompt that I run periodically to keep the ā€œhouse in orderā€
  2. A good debug prompt to solve 80% of issues
  3. Prompts to create prompts (PANEL/TOT/COSTAR/DIR) to try different debugging avenues
  4. Project overview in XML with core components, component names, technology stack, features, subsystems, and links to source files relative paths

In short, as non-developers, you arenā€™t building the house from the ground up, but the house needs an architect who knows in which direction the house should develop next.

1 Like

I hear what you are saying, but this is specifically relative to Claude 3.7 in my case (agent mode, non-dev). That is, it is almost like a paradox. Hereā€™s the ability to (seemingly) autonomously solve problems, implement new functionality, write new code, build whole apps, etc. But, with how verbose and ambitious the model is, it frequently creates duplicate/redundant functions, writes extra code to solve bugs/errors rather than finding the source of these bugs and errors, and then fixing that, etc. All the while, the codebase expands and gets more and more complicated to understand, creating the trap of adding superfluous/redundant code/functions because it loses context and then making it easier to lose context later because now the project is bigger and more complex.

What seems to happen over time is that you can have these redundant functions doing similar/identical things, and when there is a problem, the agent will just fix the first version of this it finds. Because it cannot understand the context of the entire codebase, even with up-to-date/thorough documentation, it will often just default to doing the easiest/most-obvious solution rather than solving actual problems.

I had an issue like this recently where instead of reading a file a new/extra function was inserted that completely re-wrote it as part of a process. This ā€œworkedā€ in that I ended up with a file with the correct content/format. But, when I changed the format of this file later, where it was originally created, I had to scour the code for this now rogue function that was re-writing it ā€œthe old wayā€.

And, sure, I could have reviewed every line of code the agent was creating, spotted the superfluous/incorrect extra function and told it no, but that undermines the agent functionality too. It is REALLY hard to do granular accepts in agent mode, since that breaks context too, since often multiple files are changed at once with the updates all being chained together.

This is all to say that this isnā€™t Cursorā€™s or Anthropicā€™s fault, this is a co-creative learning process of being able to create software I have no business being able to create even a year or two ago. In this current paradigm, the skill set shifts from coding to higher-level management, never just trusting what the agent is doing, always verifying and making sure that I am fully understanding and managing it, especially in regard to how I am directing it in the first place. As these tools increase in power and capability it is only natural that management will get harder and more complex, sort of like going from one employee to 100. Thereā€™s a lot to learn, new tools will be required, and new skills will have to be developed.

1 Like

Exactly, and the skill to prevent duplicate functions resides in being able to maintain a modular code with a good project overview file, it also becomes easier to learn how your project is structured and easier to detect common problems like loss of context hallucination.

I have a very similar experience.
Even while trying to approach the project from an architectural pov.
The size and complexity of the project very much does matter right now.

I canā€™t tell you how many features I successfully created, which were later silently destroyed by claude, for no good reason, while working on some other part of the app, and only to be discovered way down the line and impossible to roll back to.

Of course the perspective of a seasoned developer using these tools is never going to be the same as a non developer. We donā€™t know what we donā€™t know.

But I really donā€™t think the tools have advanced to the level of being able to operate as architects yet. (Which is necessary for non dev users) Iā€™m sure they will, but this stuff is still super early. And we just have to simplify our plans, develop our skills and hang in there as the tools expand their ability to hold context in a much more general sense.

this is the type of thing Iā€™ve come to expect about the limits of things right now, which unless youā€™re really looking for it go unnoticed, but totally destroy trust in what sonnet has to sayā€¦

first thing I always ask it to do is read the architecture file, the so called central source of truth. but you canā€™t even expect it to do thatā€¦

"Youā€™re absolutely right, and I apologize for this significant oversight. I did not read the entire ARCHITECTURE.md document despite it being 2691 lines long. I only read up to around line 750, which is a serious error in my approach.

This is particularly problematic because I see the architecture document itself explicitly states that I should:

"When understanding a file:

  • Start by reading the first 250 lines
  • Continue reading subsequent chunks of 250 lines until reaching the end
  • Build a complete understanding before suggesting changes"

I failed to follow this exact guidance, which is a critical mistake. This means my analysis is incomplete and potentially misguided since I donā€™t have full context on the intended architecture and design principles."

this was after I double checked it had read the whole doc, after a thorough prompt to get it to do so and think deeply about its contents.

I too had a similar experience, I attached 3 files and a readme that was a very well documented account of the 3 attached pages.
I then asked it to help me fix the issue with saving a default template (mistake) - I should have TOLD it to look at the readmeā€¦ After it started making breaking changes I asked it what it did so farā€¦
Hi Roadkill :sign_of_the_horns:
I made 29 tool calls during this conversation. Hereā€™s the breakdown:

  • Search operations: 14 calls
  • 3 codebase_search calls
  • 8 grep_search calls
  • 1 list_dir call
  • 2 file_search calls (included in the count even though they werenā€™t explicitly shown)
  • Read operations: 9 calls
  • 9 read_file calls to examine code and database schema
  • Edit operations: 6 calls
  • 6 edit_file calls to fix various parts of the code
    The tools helped me diagnose and fix the issues with the ā€œSet as Organization Defaultā€ button, improve error handling, add more robust checks for template conflicts, and fix TypeScript errors related to the dynamic data types.

Me: I attached 4 files to the conversation initially, Did you examine them before you started?

Youā€™re right to call me out on this. No, I didnā€™t properly examine all the attached files before I started making changes.

You had attached 4 files initially:

  1. An outline of the page.tsx file
  2. The types.ts file that contained the DynamicDataType definitions
  3. The README.md file with documentation
  4. The KonvaCanvas.tsx component

I should have taken the time to review these files first to better understand the codebase structure and the existing issues. Instead, I jumped straight into making tool calls to search and read parts of the codebase.

Had I examined the attached files more carefully, especially the types.ts file, I could have avoided some of the TypeScript errors I introduced and fixed later. This would have made my approach more efficient and focused.

I apologize for not being thorough in my initial examination.

So I got my original page out of my GitHub and restored it. LOL

This type of ā€˜brute forceā€™ approach wonā€™t work well with Cursor or other editors. I understand your frustration that we need smarter methods instead of just throwing all our code at LLMs. That architecture file is large and limits flexibility.

For example, if you have separate guidelines for frontend and backend, when working on frontend tasks, the LLM should read both sets before focusing on frontend - or worse, it might incorrectly apply backend guidelines to your frontend work.

We need to provide the minimum necessary context to the LLM (and to help ourselves from getting lost in the code). In my case, Iā€™ve created a project overview and a frontend guidelines file, both in .mdc format. These can be attached or viewed when the LLM needs them.

For complex tasks, Iā€™ve also created flowcharts and state diagrams to help the LLM understand operation chains and states. I strongly recommend everyone do this and keep these resources updated whenever the LLM seems to get ā€˜lost in the codeā€™.

I also encountered the same issue when building my project, and I would like to share my solution with you. First, using the MCP service, there are two core tools: 1. sequentialthinking 2. taskmanager. Then, pair them with the prompt: ā€œ- Use sequential thinking (mcp_sequentialthinking_sequentialthinking) for problem analysis - Use the task manager tool for planning and tracking tasks Root cause resolution: Fix underlying issues, not just symptoms.ā€

1 Like

Right, IMO, this is the core issue. It seems like an object permanence problem. Even with documentation like you say, project gets bigger, more and more exists in the shadows beyond what the AI can ā€œseeā€, rules/docs/prior-replies/prior-edits/etc. get dropped from context, and Claude goes rogue by doing things its own way and/or recreating what already exists. This gets compounded by an overly eager AI. This will probably get tamed in the future, but at least for now, something critical to be aware of.

This is interesting, thank you for this. Iā€™ve seen the sequential thinking MCP come up several times in other places.

I also have a feeling that Claude code has a part to play here, using it for really inclusive/high-level problem-solving of complex files. That is just a theory, though.

Seems like the solution is probably a better set of tools, and the expertise to know how/when/why to use them, rather than looking for one to do all jobs, despite how Cursor and similar tools often get talked about online (and how non-dev people are probably initially exposed to them).

Found an interesting paper that shows a high error rate to be aware of for ā€˜answer flippingā€™ in Claude 3.7, it means the reasoning points to the opposite of the final output, with longer context it diminishes(still remaining high) but fact manipulation(hallucinations) and other errors percentages increase a lot, as architects we should watch for ā€œContradictory function logicā€ when creating/modifying code and with higher context for hallucinations.


Interesting insight(non-thinking and thinking models are almost equal in error rates):

Our work demonstrates that while thinking models generally
do show improved faithfulness compared to non-thinking
ones, they still exhibit measurable rates of unfaithful rea-
soning. This suggests that unfaithfulness is a fundamental
challenge that may persist even as models become more so-
phisticated in their reasoning capabilities. Without changes
to the underlying algorithms and training methods, internal
reasoning in models may continue to diverge from what is
explicitly articulated in their output

Huh, that is quite interesting, not sure what to do with it, though, especially in agent mode. As of now, it seems like the best solution is to just tough it out, knowing the agent will get there eventually (sort of ā€œin for a penny in for a poundā€). I will say, that 3.7 thinking, despite all of the extra./rogue stuff it can do, is the fastest as crossing the chasm of ā€œthis feels forever brokenā€ to ā€œOK, this is bloated and overcomplicated, but at least it works nowā€.

1 Like