Yes, Grok-4 is powerful. But do not lose your mind over Elon’s hypemachine.
To anyone who doesn’t know where to look for reliable informationon AI models, AI Explained on Youtube is the only AI channel I actually trust to be decently objective.
He goes through the ups and downs. I won’t be using the model myself due to Elon’s character, but for the rest I’m sure this will be useful to just keep your head on your shoulders.
Link: https://www.youtube.com/watch?v=dbgL00a7_xs
NOTE: I forgot what I was doing. Here is a summary provided by Gemini 2.5 Pro from analyzing the video.
1. Potentially the Smartest AI (on some tests)
According to xAI’s own benchmarks, Grok 4 outperforms its rivals (like GPT-4o, Gemini 2.5 Pro, and Claude 4) on several key academic tests, including high-school math, coding, and the “Humanity’s Last Exam” (HLE) benchmark. Elon Musk claims it’s smarter than most graduate students across all disciplines simultaneously, but he later clarified this was specifically for academic questions.
2. Take Benchmarks with a Grain of Salt
-
Cherry-Picking: Like all AI companies, xAI selectively presents benchmarks where it performs best. In some tests not shown in their main charts (like the “Live CodeBench” coding test), other models like Google’s Gemini DeepThink actually beat Grok 4 Heavy.
-
Exaggerated Charts: Many of the charts don’t start their Y-axis at zero, making small performance differences look much larger than they are.
3. Strong on Abstract Reasoning (ARC-AGI)
Grok 4 shows impressive performance on the ARC-AGI-2 benchmark, a test for “fluid intelligence” or abstract reasoning. It nearly doubles the score of previous models, suggesting it’s genuinely good at identifying and applying latent patterns, which is a big deal.
4. Struggles with “Felt” Intelligence & Spatial Reasoning
-
In custom “SimpleBench” tests (designed to measure how smart a model feels), Grok 4 still falls for trick questions and struggles with complex spatial reasoning, similar to other models.
-
It can be very slow to respond, sometimes taking over 200 seconds for a single answer.
5. “Grok 4 Heavy” is an Ensemble Model
The top-performing “Grok 4 Heavy” isn’t a single model. It’s an ensemble of multiple Grok 4 agents that work on a problem independently, compare notes, and then decide on the best answer—like a digital study group.
6. The Price is STEEP
-
SuperGrok (Grok 4): $300/year
-
SuperGrok Heavy (Grok 4 Heavy): A whopping $3,000/year or $300/month.
-
For comparison, Gemini Advanced is around $20/month. xAI is promising more features like video generation for the heavy tier, but it’s a huge price difference for now.
7. API Pricing is Competitive
For developers, the Grok 4 API is priced similarly to Claude 4 Sonnet ($3 input / $15 output per 1M tokens), making it a viable, if not cheap, option for a frontier model.
8. Safety & Alignment Concerns
-
Grok is designed to “not shy away from making claims which are politically incorrect,” which has led to some bizarre and problematic outputs, a trend also seen with Grok 3.
-
Musk’s very casual approach to AI safety (“I’d at least like to be alive to see it happen”) suggests a ‘move fast and break things’ attitude, which might not be ideal for AGI development.
