o3-mini looks promising on the benchmarks but is it good in practice
To those who tested both which is better in youre opinion o3-mini or sonnet 3.5 is still the way to go.
o3-mini looks promising on the benchmarks but is it good in practice
To those who tested both which is better in youre opinion o3-mini or sonnet 3.5 is still the way to go.
lol, reading the forum gives mixed vibes. I would also be interested in specific prompts where o3-mini is preferred and why vs. claude.
Some say o3-mini is sooo good, while others struggle with it completely.
As far as i understand o3-mini isnt yet at optimal integration for composer agent but Cursor team should update us when they made changes.
I think it depends on what you ask. My guess itâs a smaller model, so if you code in less popular language, it perform worst. If you are coding a snake game in Python, it marvellous. Hence the all over the place comment.
For me it hallucinate more because it know less than Sonnet 3.5 in Elixir. Sonnet or even 4o is better for me. R1 if you need some thinking. o3 if you need some fast thinking.
Have you tried adding Elixir docs to @Docs
in settings? That should improve knowledge.
Thanks for the feedback about R1 and o3-mini.
It can, but in the end sonnet 3.5 already perform very well. Annoying to call doc on each request.
03 mini is very fast, it responds well to simple requests, but for complicated things claude is still better at contextualizing and buildingâŚ
Why not both, they are different tools for different jobs.
Claude 3.5 Sonnet: Chat
o3-mini: Thinking
For chat you want low latency, but sometimes you have a problem that is worth spending the extra time getting the model to think about it.
Claude: Gather requirements, context and build up a plan (chat)
o3-mini: Review plan then go write this complicated code or fix this non trivial issue.
Claude: Make these minor adjustments to the code (chat).
wouldnât it be more complicated to use 2 different gptâs at the same time in sequence when I start a chat? i donât run 2 gptâs in a chat because i think this is the case. have you done something like this before?
Just change the model in the drop down takes a second.
If I want to do something really simple I move from Sonnet to Haiku.
For example committing and pushing the code, I donât need to pay 4 cents to do that, Haiku can handle that for 1 cent.
I understand that brother, my question is this:
when you switch between different gptâs in the same chat window, isnât the gpt more likely to get it wrong? because you started the chat window with sonnet, but when you want to solve a simple problem by choosing haiku instead of sonnet, isnât haiku more likely to get it wrong? or after haiku, isnât sonnet more likely to get it wrong? because whichever gpt you started with in a chat window has more context history.
Sonnet seems more talkative in the agent mode which I actually like.
My understanding and from experience is that every request is independent, the context is stored locally and sent to the server with every request.
Effectively when you make a second request it sends up the current context/request and the past chat history.
Itâs the same if you do Sonnet â Sonnet or Sonnet â Haiku each request is fully independent.
In other words itâs serverless, so no state is stored on the server except caches.
There would be a performance and cost (for Cursor) switching from Sonnet â o3-mini but Iâm have not noticed an issue.
Sonnet talks a lot but makes far more mistakes and is atrocious at following basic commands. That said o3 is buggy and you have to tell it several times to do one thing. It will say âdoing that nowâ and then nothing happens. however it does thing with more âthoughtâ
The output tokens of o3-mini are way bigger than 3.5 Sonnet, so Iâve noticed that Iâm less likely to get frustrated by omitting redundant context in my responses.
Based on the Aider leaderboard, the o3-mini wins against the Claude 3.5 Sonnet.
People on the internet tend to like Sonnet better. On the other hand, I do primarily scientific computing and find that o3 does better not just at âarchitectureâ type tasks but even single-line of code requests.
Your findings aligns with benchmark reports, especially concerning math and instruction following, Sonnet has the best âtoolingâ support so these âreasoningâ models require a stronger effort from Cursor team, thatâs why R1 still hasnât an agent mode and why o1 got at least usable in the last days.
Benchmarks:
LiveBench
OpenLM