Any chance we will see this in Cursor?
Cool, I actually had great experience with grok-code-fast, way more so than I first thought I would, since with initial testing it didn’t compare well with coding intelligence to GPT-5 and Claude. But it’s quite capable with smaller tasks, especially if you give it good instructions, not much worse than GPT-5. Except it costs 5-10x less and it’s lightning fast, like almost too fast, like it should make more mistakes with how fast it grasps the concept and executes it. Depends of course on the kind of tasks and scope you give it.
How was your guys experience? Any tips on how to best use it?
Edit:
Hmm, anyone tried this, is this maybe it?
I have had good experiences mostly.
If I am being wholly honest, I think Claude Sonnet and Opus are the best coding models around. Well, at least, with Cursor, I can’t say if that is true with other agentic IDEs. With Cursor, the two companies worked closely to produce a tighter integration, something that is very clearly and obviously lacking with other models, including GPT-5 and Grok Code.
In fact, I just learned today, that Grok Code explicitly has no support for web searching, which I find rather disappointing. I figured that was the case, as whenever I used @Web or pasted a link, it never SEEMED like it was actually doing a web search. Turns out Musk explicitly designed Grok Code without web search capabilities, and honestly I’m BAFFLED as to why, because its a fundamental part of the job. Additionally, it does not appear as though Cursor has integrated @Docs support into Grok Code’s integration, and that is another area where Sonnet is superior.
If I were to compare the top four models that I’ve used: Sonnet, Gemini, GPT-5, Grok Code Fast. I would rank them this way:
- Sonnet
- Grok Code
- GPT-5
- Gemini
Gemini is a hateful beast! Every time I’ve used it, it has just run roughshod over my codebase and bulldozed massive swaths of it. Both Gemini and GPT-5, seem to excessively “pack” the thinking cycles as much as they can, and honestly it feels like its intentionally designed to STACK and PACK the output token count (which is what they get paid on), as the bigger the output the higher the cost. I don’t like either, as their long thinking cycles (Gemini usually averages around 30s or so, GPT-5 ranges from 20-90 (!!!) seconds per thinking cycle!) waste a TON of time, that I think Claude and Grok Code have proven is NOT required. Sonnet tends to think for an average of around 5-7 seconds, and Grok Code AVERAGES a bit over 1 second! Grok Code tends to have a bit more thinking than Sonnet, but at the same time, it also has ZERO inter-tool call chattiness, and it really only produces human-targeted output in its summaries at the end of a prompt session, whereas Sonnet both thinks and chats. I would take Sonnet even, over the ridiculously insane thinking cycles of Gemini and GPT-5 any day. Its results are generally far superior, it is more consistent, cleverer, smarter, and while it is slow…if its GOING to be slow, I’d rather use Sonnet than the other two.
With Grok Code, its above GPT-5 and Gemini for me. Its not quite as good as Sonnet, but, its not quite that cut and dry either. Sometimes its actually superior, sometimes its a fair bit worse, however I think I’ve identified reasons why it can be worse. The main thing I LOVE about Grok Code is its speed and cost. Its a BUCK FIFTY per MTok output!! That is WAY cheaper than the rest…its like $10/Mtok for GPT-5 IIRC, and more than that for Gemini. Its something like $24/Mtok for Sonnet!! (IIRC, its a whopping $75/Mtok for Opus!!!) So, even though its generally not as SOLIDLY good as Sonnet or as consistently good….Grok Code is very good, BLAZING FAST, and LUDICROUSLY CHEAP! Its hard to beat that combination, IMO.
Grok Code does a very good job in small, medium or properly “corralled” tasks. It can be hit or miss for big tasks, however….that is honestly not that much different for most models. I have noticed times when Grok Code, just seems to go off the rails, and the first couple times it did that, I wasn’t really aware of why…it just suddenly seem to go from GREAT to WTF! over night (literally, great the evening the night before, wacko the next morning.) Most of the time its done a fine job and writes good code, understands problems deeply without wasting monstrous amounts of times “thinking” about it (studies in fact show that too much reasoning effort and the quality of the results actually diminish, which is another reason I loath GPT-5 and Gemini…the science shows that long thinking cycles actually produce worse results, and yet they are PACKING the “thinking” in as much as they can anyway…it seriously SMELLS!)
What I think I’ve discovered more recently, is that as context grows, Grok Code really does have a harder and harder time dealing with it, and as such, you have to task it more explicitly, specifically, and on small to medium-smallish sized tasks. With the more recent versions of Cursor, it “summarizes” the chats. I have HAD to keep some chats recently, around literally for DAYS. The summarization, in fact, ALLOWED me to, and it ended up being very critical. I tried to start new chatrs, then supply context as needed…problem was, there was such a DEEP understanding of the problems I was building software for in this one particular chat, that there was no way I could rebuild it all in other chats. The original chat had all the input context I supplied, PLUS all of its own analysis, PLUS its implementation efforts, testing efforts, reanalysis, refinement planning, refinement efforts, etc. There was just TOO MUCH critical knowledge in that one chat, that I couldn’t start over. Even though I tried. So, this one single chat, literally spanned at least four days of work, and I actually still have it and go back to it for work specific to a few particular pieces of code that depend on having that context.
Thing is, the summarization of this really LONG chat, has become extreme. I have noticed this on other chats now, that go beyond a couple cycles or so of summarization: Grok Code, does seem to start to “lose it” the more the chat is summarized. A chat that lasts a day, by the afternoon, will get a bit wonky, and it has a harder time understanding and following instructions. So I end up having to be more and more explicit, which uses up more context, and I start seeing summarization more and more often. Which just seems to exacerbate the issue, at least for a while.
What I have also found, is that when this “wonkiness” starts occurring, I can sort of re-task the model on something a bit different, but still related, or have it approach problems from a different angle….and, it suddenly seems to do fine again. It is still important that I be explicit, clear, and think about and note edge cases (I’ve learned if I don’t be explicit about certain things, the model will often misinterpret or assume poorly, and do things I don’t want.)
NOW, to be perfectly honest, I don’t know that Sonnet, would NOT have this problem… My guess is, it very likely would, most likely would. I think the difference with Grok Code, is that it is SO FREAKIN FAST. You can do WAY more work in a given amount of time, with Grok Code, than you can with Sonnet. Not only does Sonnet cost something like 15-20x more, its maybe 1/5th the speed. So what I do in a day with Grok Code, is very likely to take me more like several days to the full week, just given Sonnet’s very slow output rate. If I actually spent five solid days using the same Sonnet chat, I suspect it would encounter many of the same problems.
So I can’t honestly say here that this is really a Grok Code problem…I think its a context and excessive summarization problem. You can only pack so much into 265ktok, and if you summarize over and over and over, there is effectively a COMPRESSION of your chat going on, which NECESSARILY means a loss of information, information specificity, in this context. I believe ALL the models would have issues, if they were actually able to reach teh same point. It just happens so much faster with Grok Code, because IT is so much faster, IMHO.
So overall, I think Grok Code is one of the best models, second only to Sonnet. However its not just so cut and dry. Both models have a good understanding of code, software, software design, architecture, software principles, patterns and practices, libraries, code styles, code idioms, etc. I use both to plan, however I lean to Sonnet, whenever the planning is going to REQUIRE web search, @Docs, image analysis, etc. Because Sonnet CAN do those things…while Grok Code, at least currently, explicitly does not support web search, and is not tied into the Cursor @Docs feature, and even though it supports image analysis, Cursor does not seem to have that tied in either. I would probably use Grok Code for the majority of things, because of its cost and speed, if it were not for these shortcomings. Sonnet, as much as I like it, actually costs significantly more than just its high MTok output cost, but there are additional time costs there…I spend more time waiting on Sonnet, and I am not cheap either, so the overall cost is honestly well beyond just the 15-20x output MTok cost. I would love to see Anthropic get some investment money, and put some money into expanding their data centers, hosting on higher performing hardware, as their model is truly phenomenal, its just excruciatingly slow.
I’m surprised you put GPT5 that much lower than Grok though. For me GPT5 handles a bit bigger tasks, with more room for interpretation, better. It requires less specification. Like if there is an existing system and I want it to implement another child to that system, and supply it with a rules file on how to do that, but the rules files doesn’t contain weird special cases, the GPT5 will correctly interpret the requirements/wiring for those special cases, while Grok just won’t, most of the time. So for grok I’d have to then give it way clearer instruction, which takes time to conceptualize/write, and do more back-and-forths, however GPT5 will take like 10x longer to do it, even 30-50x longer for high reasoning version, so in the end it can be about similar, except one costs 10x less of course, depends what other tasks I can be doing in the meanwhile lol.
I hope the new grok code (code supernova?) will be a great update, especially when adjust for cursor, to really make it competitive alongside claude.
Same here, GPT5 seems to be more reliable than Grok, at least for now
I have had wild variation in how well GPT-5 does. I don’t find that any model does well on large tasks. The research I have found, indicates that long reasoning cycles actually limit and degrate the quality of the output, and GPT-5 has the longest reasoning cycles by far. That at the very least, means it takes significantly longer to get things done.
Now, maybe it might do a little bit better, when it finally does something. Thing is, even though Grok might not be perfect first go around, its so fast, I MYSELF can iterate on any issues and get em resolved, before GPT-5 would have even completed in the first place… And then some!
There are times when Grok Code just can’t seem to resolve an issue. I then have to step in and guide it more. GPT-5, though, can get stuck the same way. GPT-5 is a bit different, though…Grok Code will try, and try, and try, and sometimes it will break out of an iterative cycle on its own and resolve the issue. Burns up a bit more tokens, but they are CHEAP! GPT-5, though…has this “I can’t fix it! I’m gonna just bail!” approach, where it might spend 5, 8, 10 minutes just thinking, DO ABSOLUTELY NOTHING, and then just bail and say “I’m too stupid to figure this out. Whaa.” and leave you hanging, but you paid for a ton of tokens…
That latter bit, I ran into all too often, and is one of the main reasons I bailed on GPT-5. I did not use GPT-5 high much, as a the time most of my problems were fairly normal coding problems. I have more recently delved into much more mathematically intense problems, AI, embedding, vector math type problems. It might be that GPT-5 could do a better job. I can’t say. Thing is, Grok Code, does a fairly good job, and I’ve thrown some theories at it that it seemed to crunch through, understand well, and then implement according to, and it did so very quickly. The results were not perfect, and I did have to have it go through and verify its work and find potential issues, which it did, and then I had it resolve them. The sheer speed of Grok Code allows me to just keep on moving, while I would have still been waiting for GPT-5 to finish reasoning spit out its first result…
Maybe GPT-5 might have nailed it on the first effort. But, I hate sitting there, watching the dumb thing, think…and think, and think, and think think think think think…and DO NOTHING, for minutes at a time… :\
I guess also…I am not generally one to expect awesome results from three hastily written lines of prompt either. I think about my prompts, craft them, consider the necessary context attachments, review, then go. So maybe, that is helping Grok Code do a better job, vs. if I just pounded out two or three half-sentences of prompt and just let it rip.
Doesn’t MAX mode include the maximum context limits of each LLM include GCF?
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.