Be realistic with Grok-4

Yes, Grok-4 is powerful. But do not lose your mind over Elon’s hypemachine.

To anyone who doesn’t know where to look for reliable informationon AI models, AI Explained on Youtube is the only AI channel I actually trust to be decently objective.

He goes through the ups and downs. I won’t be using the model myself due to Elon’s character, but for the rest I’m sure this will be useful to just keep your head on your shoulders.

Link: https://www.youtube.com/watch?v=dbgL00a7_xs

NOTE: I forgot what I was doing. Here is a summary provided by Gemini 2.5 Pro from analyzing the video.

1. Potentially the Smartest AI (on some tests)
According to xAI’s own benchmarks, Grok 4 outperforms its rivals (like GPT-4o, Gemini 2.5 Pro, and Claude 4) on several key academic tests, including high-school math, coding, and the “Humanity’s Last Exam” (HLE) benchmark. Elon Musk claims it’s smarter than most graduate students across all disciplines simultaneously, but he later clarified this was specifically for academic questions.

2. Take Benchmarks with a Grain of Salt

  • Cherry-Picking: Like all AI companies, xAI selectively presents benchmarks where it performs best. In some tests not shown in their main charts (like the “Live CodeBench” coding test), other models like Google’s Gemini DeepThink actually beat Grok 4 Heavy.

  • Exaggerated Charts: Many of the charts don’t start their Y-axis at zero, making small performance differences look much larger than they are.

3. Strong on Abstract Reasoning (ARC-AGI)
Grok 4 shows impressive performance on the ARC-AGI-2 benchmark, a test for “fluid intelligence” or abstract reasoning. It nearly doubles the score of previous models, suggesting it’s genuinely good at identifying and applying latent patterns, which is a big deal.

4. Struggles with “Felt” Intelligence & Spatial Reasoning

  • In custom “SimpleBench” tests (designed to measure how smart a model feels), Grok 4 still falls for trick questions and struggles with complex spatial reasoning, similar to other models.

  • It can be very slow to respond, sometimes taking over 200 seconds for a single answer.

5. “Grok 4 Heavy” is an Ensemble Model
The top-performing “Grok 4 Heavy” isn’t a single model. It’s an ensemble of multiple Grok 4 agents that work on a problem independently, compare notes, and then decide on the best answer—like a digital study group.

6. The Price is STEEP

  • SuperGrok (Grok 4): $300/year

  • SuperGrok Heavy (Grok 4 Heavy): A whopping $3,000/year or $300/month.

  • For comparison, Gemini Advanced is around $20/month. xAI is promising more features like video generation for the heavy tier, but it’s a huge price difference for now.

7. API Pricing is Competitive
For developers, the Grok 4 API is priced similarly to Claude 4 Sonnet ($3 input / $15 output per 1M tokens), making it a viable, if not cheap, option for a frontier model.

8. Safety & Alignment Concerns

  • Grok is designed to “not shy away from making claims which are politically incorrect,” which has led to some bizarre and problematic outputs, a trend also seen with Grok 3.

  • Musk’s very casual approach to AI safety (“I’d at least like to be alive to see it happen”) suggests a ‘move fast and break things’ attitude, which might not be ideal for AGI development.

8 Likes

TLDR: grok 4 is on par with opus but hallucinates more often and is not so well optimized for coding

just to add:
the heavy version is not available via api
coding model coming later this year

Yeah, and the heavy model is not affordable to most.

grok is the best model right now if your not going to use it your missing sure elons character isn’t the best but we should take advantage of the very powerful model he gave us

1 Like

He isn’t giving it though, he’s selling it. The money goes to him, that’s all. I won’t argue with anyone here to not use it. Not worth my energy, but I figure it is worth my energy to hamper any hype with actual realistic expectations. That’s all.

5 Likes


i agree its not perfect XD

1 Like

It uses 3x thinking tokens, so it is much more expensive. Only Opus 4 is a bit more expensive. Do not count the token price. The cost difference can be 10x based on the number of thinking tokens used.

That’s profound ethics I respect. To counter-balance other messages in this thread, I totally align with your vision. The future world we will live in is what we shape today with our choices and spending.

9 Likes

I have an app requirements document I use to check out all the models. I did manage to get Grok 4 to produce a solution proposal before the performance dropped off a cliff, but it wasn’t as good as Gemini 2.5’s and wasn’t even on the same planet as Claude 4’s.

I sense Elon has over hyped his model, which may be great at some tasks, but so far isn’t impressing me when used with Cursor. Perhaps that will improve as Cursor refines the interface.

But I think Elon doesn’t understand the code development space. He actually posted that Grok 4 was better than Cursor - even though at the point he posted, Cursor already had Grok 4 as a selectable model. And chances are most developers will get their first experience of Grok 4 through Cursor.

Model creators need to be partnering with Cursor during the beta period or they’ll lose potential customers. My experience with Grok 4 under Cursor hasn’t been ideal.

But my guess is x.ai had no idea millions of developers might switch to Grok 4 to see if his hyperbole had any truth. Ironically all they needed to do was ask Grok.

3 Likes

Fully agree. I have been switching between Gemini 2.5 pro and Grok 4 a lot today (Because rate limit hit for sonnet 4.0), and lets ignore it is slow as fck - I don’t see it that great results from it tbh.

1 Like

My one-shot post has become constantly updated review. You might also be interested.

Looks awesome. I’ll be definitely trying and using Grok-4 as a model in Cursor :slight_smile:

its good but in cursor it has been really lacking in terms of the model starting something then abruptly stopping mid task

1 Like

I am happy to hear this, someone on the same wavelength. <3

1 Like

I think they really overhyped it.. Too much. Its a good model, but wayyyyyyy too much hype, and probably contaminated on benchmarks too. I’d remove 3 points on grok 4 and grok 3 mini to get a good estimate

Its not about the model but how much they nerf it trying to make it sustainable

tbh not really, yes cursor literally makes grok 4 retarded. BUT with the right prompts it is really good with deep research and finding bugs/race conditions, now in terms of implementing stuff yeah its bad but research/ tool use its really good, im replacing it with o3-pro rn for a week to see what happens

mind sharing the prompt for finding bugs and race conditions?

here is a distilled version of it:

currently this interactive globe feature i’ve implemented is very un-optimized and resource heavy. can you take a look at all the code for it?

i want you to explore the codebase and follow the logic and try imagining a scenerio or edgecase and figure out if the app or code breaks when that scenerio or edgecase happens, and create a report on what needs to be fixed to handle it better, what can we implement for industry standards and best practices

now find the most complex piece of code for this interactive globe feature and propose a plan to optimize it

please do not change any code yet

Hello, Grok 4 doesnt work in my cursor. What do I have to do?
Is it not included in the pro plan with cursor?