Is o3-Pro good as Agent?

Has anyone tried writing a lot of code through it?

What are the results, the sensations, the size of the hole in your pocket?

1 Like

I got some useful answers in a similar post about Opus, but not one here yet. :thinking:

I don’t think I’m rich enough for him :skull_and_crossbones:

o3 failed a lot with my code, and I mean a lot.

o3 or o3-Pro?

Both.

I honestly don’t know what the rage for o3 is about for complex programming tasks it performs horribly.

3 Likes

i must agree, o3 is not usable based on my experience, i never had any luck with it… it takes long time before it starts making any changes and they are nearly all the time completely incorrect. also since the pricing has changed, each prompt of o3 costs me around 0.70$ so i dont even want to see the bill for o3-pro

2 Likes

o3 didnt work well when i tried it out :((

It’s pretty bad for me; I can say it’s completely useless. Also, the regular version is better than the pro one, which is strange.

Not sure how representative this is, but here’s a case:
https://github.com/Artemonim/AgentDocstrings/pull/14

  • The first commit was an attempt to modify the code using Claude 4 Thinking. Claude ended up changing some critical functionality, and attempts to modify those changes with Auto or o4-mini only introduced new errors.
  • The second commit, I believe, was my git reset --soft. I then tried to resolve the issues using Gemini 2.5 Pro, but it failed to fix them.
  • The third commit was a fix implemented with o3-Pro (see screenshot above), which resolved 8 out of 13 failing tests. I then refined the solution using Claude 4 Thinking. Gemini 2.5 Pro failed at this stage too, but I did use it post-factum to check that Claude hadn’t altered any core functionality.
2 Likes

o3 is good for planning and inspecting bugs and creating docs (good tool calls), but executing is meh. I use it to make todos and use Sonnet 4 to implement them

2 Likes

well even 2.5 flash is good at that and it is free…

My take: o3 is solid for really complex problems when other agents get stuck - it’s great at identifying root causes and giving detailed analysis. But man, it’s expensive as hell.

I’ve found a better workflow using Opus for debugging and Sonnet 4 for implementing, but honestly it depends on your experience and prompting skills. I’m still learning a lot of this stuff and not great at prompting yet, so o3 actually helps way more since it can work with my mediocre prompts better than smaller agents.

The one place o3 really shines is app architecture reviews - minimal prompting needed for solid feedback and reasoning. For building small/medium apps I rarely use o3 unless I wanna do something fancy or need to compare and contrast approaches.

When I do need architecture brainstorming/feedback, I prefer using it in ChatGPT over Cursor. In Cursor it drains my usage fast and the refresh wait is brutal, but in ChatGPT I can manage my usage better and know exactly how much I’m spending.

Haha okay confession time - this was totally a joke! I saw you ask the exact same question about Opus earlier so I decided to be cheeky and just swap my entire response. Literally did find-and-replace from ‘Opus’ to ‘o3’ and ‘Claude’ to ‘ChatGPT’ :joy: Couldn’t resist when I noticed the identical phrasing.

1 Like

Realistically speaking, my experience with o3 is that it’s great at troubleshooting or finding errors that Sonnet 4 or GPT 4.1 spend forever on or just can’t solve. Of course understanding structure, tools, and prompting matters - again comes down to your skills and prompt engineering.

But I only use it for finding complex logic errors and fixing those specifically. Outside of that, I don’t find it more useful for regular tasks - I prefer Sonnet 4 or GPT 4.1 in those cases. In my experience with continuous tasks, o3 fails more often or can’t stick to what I’m asking, although the recent updates with To-dos made it better at staying on topic.

Even better workflow is using o3 to find all the errors, identify and analyze root problems, build a plan on how to fix it, then let regular agents implement it - basically not wasting too much credits. With proper prompts I’ve managed to use auto sometimes to complete fixes based on o3’s feedback on the issue.

Still, for continuous work or regular daily use I stick with Sonnet 4 or GPT 4.1, and only pull out o3 when I’m stuck or need feedback on logical errors that normal agents can’t handle.

Regarding o3-Pro, I haven’t found any use cases where I need it in my experience. Regular o3 was more than enough and o3-Pro pretty much gave similar results but way more costly. But I also don’t develop complex apps.

2 Likes

Apart from this being an LLM thread, O3 is good for planning, and better for revising plans from Gemini Pro / Sonnet 4 Thinking. Will give it a go to 1-pass planning.

2 Likes

Compared to what models? And for what programming tasks? I think O3 does best for planning and backend, whereas the Sonic models do way better for frontend.