I ran a complicated request to build unit tests in C# in ChatGPT with O3-high after noticing the quality of responses of sonnet with agent mode in Cursor has gone down lately. To my surprise, O3-high basically nailed a prompt that involved 16.000 characters of code context, and a short, lazy instruction: “please understand this code. and understand the caching issue, and make some appropriate automated tests for it so ensure that the cache invalidation works for outdated ones.”
Among the 16.000 characters of code, it is not immediately obvious what the “caching issue” is and what “ensures invalidation works for outdated…” “ones”. O3-high understood exactly what the problem was, and what the “ones” referred to, making the unit test exactly how I imagined it to be with the Moq framework. The only change I had to make was deleting some superfluous mocks, and some models that already existed, and mocking AutoMapper. Other than that, it understood to make a in memory database with entity framework, and understood how to mock all the interfaces to isolate the testing scenario precisely.
Even though I’ve been using AI coding for while, I am still jump out of my chair on a regular basis on what it can achieve.
So I decided to experiment with the same prompt with other models. I also wanted to see if it mattered that I use cursor or the “native app,” whether normal vs agent mode had any difference, or whether there was difference between pasting classes in the prompt entirely, or referencing them with @ in cursor. Turns out there was a big difference!
My verdict is that it’s for this particular use case - where there was a lot of context, ambiguous instruction, and that there was no need to take into account existing code conventions (there was no need for an agent to find more context than provided), it was better to paste the context in the native app and pasting in the relevant context, than using cursor.
O3-mini-high on ChatGPT gives me a very satisfactory responses. It made a few easily correctable mistakes, but overall very impressive. From the prompt, it understood that my intention was to create ONE test, and that’s what he did.
O3-mini-high on Cursor in normal mode with pasted context gives also really good response. It understood the problem, provided unit tests with mocking interfaces, inmemory database. It was slightly more hallucinatory and made superfluous tests / changes.
O3-mini-high on Cursor with @files references and normal mode gave even more wobbly output. Overcomplicating mocking, a bit weird logic, but still is onto something.
O3-mini-high with @file references and agent mode was now getting really awful. It decided to leave all sorts of weird assumptive comments, saying it omitted details for brevity, and even re-implemented the class I wanted to test in the test file.
Sonnet 3.5 on the Claude app gave also good response. It provided simpler structure than O3-mini-high on ChatGPT. It decided to spot potential issues with the code I gave, and thus gave three additional tests that tests for race conditions, which I didn’t think of. While O3-mini-high did adhere more closely to my request, Sonnet’s response could actually be more useful, were it not for the bunch of compile errors and the missing instantiation of used fields and making assumptions of non existing classes.
Sonnet 3.5 on Cursor was a bit more unstable, and had a similar downwards trajectory from plain context → file references → agent mode. Moreover, however, the responses seems to vary in quality a lot. The first time I ran the request on with Sonnet 3.5 agent mode with file references, it misunderstood the task completely and decided to make changes to original files. A second attempt it made some trainwrecks of unit tests, Same with deep-seek with file references. Pure garbage. Strangely enough, after having run the prompt with other models a lot of times, then Sonnet on agent mode started getting his its together. Though, it seems like agent mode seems to have a affinity toward editing existing files and not listening closely enough to the request.
Deepseek R1 on chat was good, on-par with sonnet.35, but it made a lot of rookie mistakes such as mocking the dbcontext, and leaving comments such as “simplifying for example.”
It turns out that O3-mini-high is the best model right now, especially when you have ambigious, “lazy” instruction, hoping the model will just “get” it.