O3-mini agent mode is insane

jchase · February 3, 2025, 11:20pm

OK I have no reason to make this post other than I am dumbfounded by what I’ve just witnessed.

I had a service provider who made a significant, complicated change to their API that I needed to accommodate into an existing Svelte app. Not only were the changes complicated, but I needed to maintain backwards compatibility with the prior API version for specific data. It took me 40 mins just to formulate the prompt explaining everything, including key examples of the previous API and the new format. Used o3-mini w/ composer agent mode.

In one shot it made the changes 100% perfectly, and caught an edge case I didn’t even think of. I actually thought I was looking at the wrong environment at first because I couldn’t find a single thing broken.

Insane.

him · February 4, 2025, 9:50am

That’s beautiful man

MaxTeabag · February 4, 2025, 12:35pm

its a great time to be alive

Kirai · February 4, 2025, 1:51pm

The agent mode with o3-mini in Cursor is working? Just yesterday it was giving worse code than 4o or medium sized local models… O3-mini not agentic? - #23 by Kirai

ecsplendid · February 4, 2025, 3:03pm

Watching progress here with interest. o3-mini is an amazing model, and seen it sing on chatgpt.com. I thought maybe the model just didn’t support tools well but having integrated it into my own systems (which has 20+ tools) I now know that the model supports tools brilliantly. I am assuming that the Cursor team will iron out the problems with it and it has the potential to become the best model for coding.

Robert137498 · February 4, 2025, 3:11pm

This is my o3 experience as well - Although composer still needs some kinks worked out… i find it can do more in a single run than sonnet.

R

mkarolczyk · February 4, 2025, 3:20pm

Currently, this model, after a few messages, does not want to perform file change actions. It has to be asked several times to do so. Other than that, he does a good job. I am waiting for a fix from the Cursor team.

eonus01 · February 4, 2025, 3:34pm

I am seeing the same. Sometimes it says “I’ll now apply the diff” and then does nothing. I’m also getting errors where the code snippet doesn’t get formed, and instead is printed out in the output itself.

jchase · February 4, 2025, 4:55pm

Yeah it’s strange, like anything else I’m finding it’s better for certain things than others. For instance - while it nailed this particular refactoring problem, in a completely different scenario, o3-mini bizarrely missed that April 28 2025 is a Monday and thought it was a Tuesday.

jimmyn · February 4, 2025, 5:04pm

As of now, it is not usable; it behaves unpredictably, does not apply code changes, and lacks consistency, and this is very weird, because it works quite well in chat

swiftraccoon · February 4, 2025, 5:25pm

I’ve found it sometimes has the same bug in ChatGPT when performing Deep Research. If you attempt to do multiple Deep Research within a chat, it writes out what it intends to do, says it is going to do it, and then does nothing.

xedro98 · February 4, 2025, 6:05pm

Same here, hate it when it does that!

josh-stephens · February 4, 2025, 6:43pm

Would you mind elaborating on how your tools are set up?

younio · February 4, 2025, 8:03pm

With all due respect, o3-mini is not some “insane” model. These are the results of the Aider leaderboard (models solve hard exercises from https://exercism.org/), and o3-mini is not some “super-model”:

cf.

SoMaCoSF · February 4, 2025, 8:46pm

I would appreciate how you crafted the prompt if you can anon the details

jchase · February 4, 2025, 11:45pm

Yeah … still, aren’t these just exercises or are they real-world situations?

SoMaCoSF · February 4, 2025, 11:53pm

Yeah - well my experience with mini in Cursor was weird – the response couldnt handle proper formatting of an .rmd and the fonts were all wonky sizes…

I think (Lemme know if anyone else has tested this:)

Switching models mid context and then feeding the new bot (claude → 03-mini → r1)

Basically I was in a scratchpad and I thought "huh, I read that o3 was better at this next task I want, so lemme just change model - refered the prompt and context file I am asking about – and it gave me a Fn weird response.

Then I attempted same - from that O3 point to switch to R1 and attempt the same and R1 bonkers refactored a bunch of ■■■■.

Ill have to try again and see which happens first:

Context-Lost-Loop, Bonky refactors+++, fully deleting all the ■■■■ in the project folder mid deep into the project context session…

===

(EDIT: UPDATE: I still lost ALL the files, but Specstory did save all my Compose hist for the project (I just started the scratchpad, so it wasnt that long and I started from scratch… I am going to attempt to feed the entire compose spec to an agent at once and see how it behaves, Ill also feed it to my local R1 1.5 and see what weirdness I can pull from that.

I want to be able to have my composer agent talk to my R1 as well

I wonder if I can get my agent to be the product manager, R1 the dev, and me the Customer)

–

jchase · February 5, 2025, 12:31am

OK so - I just did a much more in-depth comparison between o3-mini, deepseek-r1, and claude-3.5-sonnet with a different project, in this case an existing python urwid application. I tried to be as scientific with this as possible: I created 3 separate branches in a repo with the same exact codebase and the same exact prompts for each (same initial conditions). Each time running the prompt, I reset composer beforehand so they all had the same exact starting point and initial context.

feat/client-breakdown-o3-mini (agent mode)
feat/client-breakdown-deepseek-r1 (normal mode)
feat/client-breakdown-claude-35-sonnet (agent mode)

My original prompt was this:

These are the results (listed in the order I ran the tests):

o3-mini:
Executed all the listed specs in one-shot without any further interaction.
Technically, of the 3, it produced the ugliest chart (necessitating iteration), but functionally, it is correct.
The calculation was accurate
No other functional regressions noted.

deepseek-r1:
After the initial prompt, the application failed to load entirely. After a 2nd prompt to detail the error message ValueError: time data 'out' does not match format '%Y-%m-%d', Deepseek r1 fixed it and the application loaded.
The calculation was accurate
The chart it produced looked better visually than o3-mini, but it formatted it very strangely and randomly added the word “left” and with no spacing between words. Very bizarre.
Regression: It inexplicably removed the existing summary card entirely, which I didn’t ask for.

claude-3.5-sonnet:
Executed in one-shot (application loaded immediately).
Visually, the chart looked the best of the three and maintained the color scheme I had in the application which all other models completely ignored. There was one tiny thing which was that the client name was technically not listed inline with the progress bar (which the other models got), but visually overall it looks so much better than the others I consider this acceptable and OK to iterate over further.
The calculation isn’t right (think due to a rounding error that other models accounted for)
No other functional regressions noted. The app still works.

I actually anticipated Deepseek doing better than Claude did but just laying out exactly what happened.

SoMaCoSF · February 5, 2025, 12:41am

Ah - so we are doing the exact same thing…

As I stated - I am going to use my SpecStory for the lost project - but Ill try your workflow while trying to incorporate:

A huge sprinkle of this: as a first glance test of this structure.

–

However, I am going to feed it a Composer Context Hist instead of a file and see how it works – but I have been lacks in my keeping up with /rules (because I (like most) have already spent many a Token one rules…

ctg5 · February 5, 2025, 4:49pm

This. 1,000x this. Needs to be fixed.

Also, pasted output or code doesn’t consistently get recognized as coming from the terminal, etc.

That said, o3 is pretty amazing.

Topic		Replies	Views
O3-mini is LIVE! What version are we getting? Discussions	69	13491	February 25, 2025
O3-mini not agentic? Discussions	32	3474	February 22, 2025
O3-mini in agent mode fails function calls (credits dump) Feedback	8	564	February 16, 2025
Deciding which model to use (Claude vs O3-mini) Discussions	18	5073	February 16, 2025
🚀 O3 Update incoming Discussions	27	4326	April 20, 2025

O3-mini agent mode is insane

Related topics