The performance was unmatched in my entire experience with Vibe Coding.
It refactored a complicated piece of code, modularized it, raised the abstraction level, and added 44 tests—all of which passed.
It took about 3 hours of work. I didn’t check the app in the browser even once, I promise.
Today, when I opened the app, it was working flawlessly.
The only downside is the cost. 178 o3 requests * 30 cents per such request $53.40
If I want to use this every day, I’d end up with a bill of almost $2,000 every month, which means I’d have to raise my fees. It shows that quality and speed come at a price.
Final note: I supervised its every move, every single step of the way, and didn’t let it just code away. I forced it to adapt to my pace and follow my requirements. It wasn’t always willing to adhere to the .cursor/rules.
My experience hasn’t been so great frankly! It’s a hit or miss.
It did build and refactor a whole new feature. But it also struggles with things that sonnet 3.7 does better.
How did you make it so autonomous? When I use Gemini 2.5 pro, it does a lot of “assumptions”, sometimes it just tells me what’s going on without carrying out the task, it can get side tracked, and also it can ignore some files and recreate solutions instead of looking it up… maybe I’m prompting it wrong.
Thanks for that video tip. Prompt injection problem is not solved. Curious to what the latest is. A bit bummed, that they switched off the comments to that video on YouTube.
Yeah well the old & the most persistent rule still applies:
Price, Speed, Quality
But you can only choose two
Joking aside my experience with the new open AI models has been exactly the opposite.
I couldn’t get it to clean up the comments from a single file without me intervening like 10 times. No exaggeration.
However Gemini flash 2.5 is crushing it.
Did a similarly complex thing as you mentioned in one sitting.
For me O3 is extremelly slow (he does a lot of reading of code and calling of tools all around before coding and it takes a lot of time), but final outputs are pretty good. The price is not nice - other models are not that much worse for me to pay it and having to wait on top of it I guess… At least for now.
yes, agree, gemini very often does not do the task, just tells me what should be done. I have to ask it to do the task - but the outputs are very good in some areas.
I was prompting the tests, like over 50 of them for a script, and I have realized that tests are much more trustworthy than manual tests. I mean when vibe coding, one needs to adopt TDD.
Hmm if you had it on Auto mode and it cost $53 it would still be quite pricey.
When I’m doing similarly complex stuff with Claude 3.5 in auto mode with good planning steps and implementation steps, clear rules etc… its a 1/10th of the price with occasional step in to give more info or adjust direction.
Assuming that the old code is not removed, any reasonably capable model could make a new module and transfer features. The complexity of doing that depends on programming language, framework and naturally the actual complexity of the original code etc.
It does somewhat feel there is no need to use O3 then, or could you explain what for example Claude 3.5 or similar level models cant do that O3 achieved and why? Sincerely curious about the difference in your usage.