Agent/Model issues understanding the proper...precedence with unit test fixes

I’ve been having cursor generate unit tests since I first started using it, about five months ago now? I have noticed a very common problem with the agent + model when it comes to solving problems with tests:

They will often change correct production level code, to accommodate BADLY DESIGNED TEST CASES!

Simple example case here. I have been working on the auth/auth apis, guards, etc. for some soon-to-be implemented new apis with Nest.js. When all these units of code were first generated, I took the time to review each, and I was satisfied with how they were implemented. Apparently when I committed these changes, one particular test suite fell through the cracks: the JwtAuthGuard tests. They were written, but I guess the non-deterministic model did not have the agent include that suite when testing the whole body of work.

Well, two more bodies of work later here, I told the agent to run the unit tests. I did not include for @Recent Changes (simple oversight I guess), so it ran the full suite! Lo and behold, there was an issue with the JwtAuthGuard tests. The agent noticed that, and proceeded to “fix” the test. Problem is, when I reviewed its changes, it changed the production code in the guard, in a way i really did not like!

Investigating deeper, the agent/model told me, strait up and rather explicitly, that the DESIGN OF THE MOCKS used for the test, was the reason the production code was changed! Instant alarm bells for me!

Badly Designed Test Fakes → Change Previously Working Production Code

I’ve actually encountered this issue quite often, to my dismay. In a general sense, if bad test fakes leads to a problem with the tests, then that should generally be an indication that the fakes need to be fixed, rather than that the production level code needs to be changed.

I can understand that this may seem tricky, as if you change your tests, you might not catch the errors they were intended to catch, right? Well, change YOUR TESTS, your actual test cases, yes, I would agree with you. However, in this case, and its an all too common case, the design of the test cases themselves was actually sound (I reviewed those too!) The problem was the test fakes, not the design of the test cases themselves.

Changing the production level code here was a really bad move. One that I wonder, if it could be head off more often, if the agent could somehow make the distinction between test fake issues vs. test case issues. If its a test fake issue, then I think at the very least, the first course of action, should be to try to fix the fakes, or at least investigate if the issue can be fixed that way. I think that would head off the vast majority of these issues where the agent/model change working, perfectly fine production code, to fix a badly designed test case/test fake insufficiency.

When it comes to resolving actual test case issues…I don’t know what the best course of action is. The test cases themselves, are supposed to be designed to verify that your production level implementation meets requirements (or specification) which will usually come from a user story. So I don’t think it is quite as cut and dry here, which side of the equation needs to be “fixed” in the event of a failing test. The test case or the unit being tested.

Even in this case, though…I wonder how we could better guide the agent/llm towards making the best possible decision there. I guess if you actually have access to the ticket (and with MCP servers that is becoming ever more likely by the day now), maybe there is a way to build cursor rules to help guide the situation, and support more automated and correct resolution of such test failure issues.

More and more now, I start my work with a story number reference…so that might help, at least in some cases. I wonder, though, if there are certain concrete queues that could be used, sourced from the code itself, to help guide which side of the equation needs to be fixed to fix a unit test failure without needing the user to step in and figure that out themselves (and burn tokens/requests trying to get it right via more manual intervention!)