OK I have no reason to make this post other than I am dumbfounded by what I’ve just witnessed.
I had a service provider who made a significant, complicated change to their API that I needed to accommodate into an existing Svelte app. Not only were the changes complicated, but I needed to maintain backwards compatibility with the prior API version for specific data. It took me 40 mins just to formulate the prompt explaining everything, including key examples of the previous API and the new format. Used o3-mini w/ composer agent mode.
In one shot it made the changes 100% perfectly, and caught an edge case I didn’t even think of. I actually thought I was looking at the wrong environment at first because I couldn’t find a single thing broken.
The agent mode with o3-mini in Cursor is working? Just yesterday it was giving worse code than 4o or medium sized local models… O3-mini not agentic? - #23 by Kirai
Watching progress here with interest. o3-mini is an amazing model, and seen it sing on chatgpt.com. I thought maybe the model just didn’t support tools well but having integrated it into my own systems (which has 20+ tools) I now know that the model supports tools brilliantly. I am assuming that the Cursor team will iron out the problems with it and it has the potential to become the best model for coding.
Currently, this model, after a few messages, does not want to perform file change actions. It has to be asked several times to do so. Other than that, he does a good job. I am waiting for a fix from the Cursor team.
I am seeing the same. Sometimes it says “I’ll now apply the diff” and then does nothing. I’m also getting errors where the code snippet doesn’t get formed, and instead is printed out in the output itself.
Yeah it’s strange, like anything else I’m finding it’s better for certain things than others. For instance - while it nailed this particular refactoring problem, in a completely different scenario, o3-mini bizarrely missed that April 28 2025 is a Monday and thought it was a Tuesday.
As of now, it is not usable; it behaves unpredictably, does not apply code changes, and lacks consistency, and this is very weird, because it works quite well in chat
I’ve found it sometimes has the same bug in ChatGPT when performing Deep Research. If you attempt to do multiple Deep Research within a chat, it writes out what it intends to do, says it is going to do it, and then does nothing.
With all due respect, o3-mini is not some “insane” model. These are the results of the Aider leaderboard (models solve hard exercises from https://exercism.org/), and o3-mini is not some “super-model”:
Yeah - well my experience with mini in Cursor was weird – the response couldnt handle proper formatting of an .rmd and the fonts were all wonky sizes…
I think (Lemme know if anyone else has tested this:)
Switching models mid context and then feeding the new bot (claude → 03-mini → r1)
Basically I was in a scratchpad and I thought "huh, I read that o3 was better at this next task I want, so lemme just change model - refered the prompt and context file I am asking about – and it gave me a Fn weird response.
Then I attempted same - from that O3 point to switch to R1 and attempt the same and R1 bonkers refactored a bunch of ■■■■.
Ill have to try again and see which happens first:
Context-Lost-Loop, Bonky refactors+++, fully deleting all the ■■■■ in the project folder mid deep into the project context session…
===
(EDIT: UPDATE: I still lost ALL the files, but Specstory did save all my Compose hist for the project (I just started the scratchpad, so it wasnt that long and I started from scratch… I am going to attempt to feed the entire compose spec to an agent at once and see how it behaves, Ill also feed it to my local R1 1.5 and see what weirdness I can pull from that.
I want to be able to have my composer agent talk to my R1 as well
I wonder if I can get my agent to be the product manager, R1 the dev, and me the Customer)
OK so - I just did a much more in-depth comparison between o3-mini, deepseek-r1, and claude-3.5-sonnet with a different project, in this case an existing python urwid application. I tried to be as scientific with this as possible: I created 3 separate branches in a repo with the same exact codebase and the same exact prompts for each (same initial conditions). Each time running the prompt, I reset composer beforehand so they all had the same exact starting point and initial context.
These are the results (listed in the order I ran the tests):
o3-mini: Executed all the listed specs in one-shot without any further interaction. Technically, of the 3, it produced the ugliest chart (necessitating iteration), but functionally, it is correct. The calculation was accurate No other functional regressions noted.
deepseek-r1: After the initial prompt, the application failed to load entirely. After a 2nd prompt to detail the error message ValueError: time data 'out' does not match format '%Y-%m-%d', Deepseek r1 fixed it and the application loaded. The calculation was accurate The chart it produced looked better visually than o3-mini, but it formatted it very strangely and randomly added the word “left” and with no spacing between words. Very bizarre. Regression: It inexplicably removed the existing summary card entirely, which I didn’t ask for.
claude-3.5-sonnet: Executed in one-shot (application loaded immediately). Visually, the chart looked the best of the three and maintained the color scheme I had in the application which all other models completely ignored. There was one tiny thing which was that the client name was technically not listed inline with the progress bar (which the other models got), but visually overall it looks so much better than the others I consider this acceptable and OK to iterate over further. The calculation isn’t right (think due to a rounding error that other models accounted for) No other functional regressions noted. The app still works.
I actually anticipated Deepseek doing better than Claude did but just laying out exactly what happened.
As I stated - I am going to use my SpecStory for the lost project - but Ill try your workflow while trying to incorporate:
A huge sprinkle of this: as a first glance test of this structure.
–
However, I am going to feed it a Composer Context Hist instead of a file and see how it works – but I have been lacks in my keeping up with /rules (because I (like most) have already spent many a Token one rules…