Sonnet 3.5 vs o3 mini

o3-mini looks promising on the benchmarks but is it good in practice

To those who tested both which is better in youre opinion o3-mini or sonnet 3.5 is still the way to go.

1 Like

lol, reading the forum gives mixed vibes. I would also be interested in specific prompts where o3-mini is preferred and why vs. claude.

Some say o3-mini is sooo good, while others struggle with it completely.
As far as i understand o3-mini isnt yet at optimal integration for composer agent but Cursor team should update us when they made changes.

I think it depends on what you ask. My guess it’s a smaller model, so if you code in less popular language, it perform worst. If you are coding a snake game in Python, it marvellous. Hence the all over the place comment.

For me it hallucinate more because it know less than Sonnet 3.5 in Elixir. Sonnet or even 4o is better for me. R1 if you need some thinking. o3 if you need some fast thinking.

1 Like

Have you tried adding Elixir docs to @Docs in settings? That should improve knowledge.

Thanks for the feedback about R1 and o3-mini.

It can, but in the end sonnet 3.5 already perform very well. Annoying to call doc on each request.

1 Like

03 mini is very fast, it responds well to simple requests, but for complicated things claude is still better at contextualizing and building…

1 Like

Why not both, they are different tools for different jobs.

Claude 3.5 Sonnet: Chat
o3-mini: Thinking

For chat you want low latency, but sometimes you have a problem that is worth spending the extra time getting the model to think about it.

Claude: Gather requirements, context and build up a plan (chat)
o3-mini: Review plan then go write this complicated code or fix this non trivial issue.
Claude: Make these minor adjustments to the code (chat).

wouldn’t it be more complicated to use 2 different gpt’s at the same time in sequence when I start a chat? i don’t run 2 gpt’s in a chat because i think this is the case. have you done something like this before?

Just change the model in the drop down takes a second.

If I want to do something really simple I move from Sonnet to Haiku.

For example committing and pushing the code, I don’t need to pay 4 cents to do that, Haiku can handle that for 1 cent.

I understand that brother, my question is this:

when you switch between different gpt’s in the same chat window, isn’t the gpt more likely to get it wrong? because you started the chat window with sonnet, but when you want to solve a simple problem by choosing haiku instead of sonnet, isn’t haiku more likely to get it wrong? or after haiku, isn’t sonnet more likely to get it wrong? because whichever gpt you started with in a chat window has more context history.

Sonnet seems more talkative in the agent mode which I actually like.

1 Like

My understanding and from experience is that every request is independent, the context is stored locally and sent to the server with every request.

Effectively when you make a second request it sends up the current context/request and the past chat history.

It’s the same if you do Sonnet → Sonnet or Sonnet → Haiku each request is fully independent.

In other words it’s serverless, so no state is stored on the server except caches.

There would be a performance and cost (for Cursor) switching from Sonnet → o3-mini but I’m have not noticed an issue.

2 Likes

Sonnet talks a lot but makes far more mistakes and is atrocious at following basic commands. That said o3 is buggy and you have to tell it several times to do one thing. It will say “doing that now” and then nothing happens. however it does thing with more “thought”

The output tokens of o3-mini are way bigger than 3.5 Sonnet, so I’ve noticed that I’m less likely to get frustrated by omitting redundant context in my responses.

Based on the Aider leaderboard, the o3-mini wins against the Claude 3.5 Sonnet.

People on the internet tend to like Sonnet better. On the other hand, I do primarily scientific computing and find that o3 does better not just at “architecture” type tasks but even single-line of code requests.

Your findings aligns with benchmark reports, especially concerning math and instruction following, Sonnet has the best “tooling” support so these ‘reasoning’ models require a stronger effort from Cursor team, that’s why R1 still hasn’t an agent mode and why o1 got at least usable in the last days.
Benchmarks:
LiveBench
OpenLM