Midjourney/Every illustration.

Vibe Check: Grok 4 Aced Its Exams. The Real World Is a Different Story.

The smartest model isn’t always the most useful one

24 2

Was this newsletter forwarded to you? Sign up to get it in your inbox.


Grok 4 is topping some big AI benchmarks. So why have the responses to it been so mixed? And how come Every’s engineers aren’t using it much?

xAI’s latest launch, Grok 4, is positioned as an LLM with advanced reasoning capabilities. The model debuted last week in a livestream featuring Elon Musk and other members of the xAI team seated on black sofas and pointing at graphs that seemed to indicate Grok 4’s superior performance on prominent benchmarks like Humanity’s Last Exam and ARC-AGI.

But the TL;DR from our Studio team is this: Grok 4 is smart, but seems overtrained on benchmarks—while not being useful enough to be a go-to for everyday tasks. It should be good at coding, but without its own built-in command-line interface (CLI), the barrier to trying it is high. (A CLI is a text-based interface where developers type instructions directly to the model, without needing to switch between apps or windows.)

“There are new competitive dynamics here—Claude Code [which has its own CLI] is sticky,” Every CEO and cofounder Dan Shipper says.

Here’s what’s new, what the team thinks, and what everyone else thinks.

The nuts and bolts of Grok 4

Grok 4 is a reasoning model where you can’t see the reasoning tokens or turn the reasoning mode off. In other words, it always thinks deeply before answering, but won’t show you how it got there or let you stop it from thinking so deeply.

xAI trained the model through reinforcement learning tailored to increase its reasoning capabilities—and as a result, Grok 4 is touted to excel in technical domains like math and physics. It accepts both images and text prompts and has a context window of 256,000 tokens, double that of its predecessor, Grok 3, and more than both OpenAI’s o3 and Claude Opus 4, which are currently capped at 200,000 tokens.

The launch also included Grok 4 Heavy, described as Grok 4’s more powerful version. While explaining how it worked, Musk said it “spawns multiple [Grok 4] agents in parallel,” and then they compare their work “like a study group” to find the best answer.

The models are available to consumers through two subscription plans: the “SuperGrok” plan at $30 per month, or the “SuperGrok Heavy” plan at $300 per month, which includes access to Grok 4 Heavy. For developers, Grok 4 matches the cost of Anthropic's Claude Sonnet 4: $3 per million input tokens and $15 per million output tokens.

When a model gets the answer right but misses the point

Grok 4 should, in theory, excel at coding tasks thanks to its reasoning-first training. But early signals suggest that it’s been overfitted to do well on benchmarks—or to correctly answer what writer Zvi Mowshowitz calls “exam-shaped questions.” Physicist Casey Handmer asked Grok four questions where the process of answering mattered more than the result, and found that the model did not perform very well. “Grok 4 is routinely nailing Physics Olympiad style problems,” Handmer tweeted, “and yet it seems to still be missing the core of insight which is so critical to physics.”

This leaves Grok 4 seeming more useful than it actually is in the real world. Its lack of tooling adds to the friction: Grok 4 doesn’t come with a built-in CLI, so using it takes more setup—unless you go through a third-party tool like the AI code editor Cursor. (Most of the Every team has moved away from Cursor, since Claude Code is now more tightly integrated into the day-to-day workflows.)

And then there are the safety issues associated with xAI’s models. Writer Simon Willison found that Grok 4 appeared to reference Elon Musk’s views when responding to controversial questions. An Anthropic researcher openly criticized xAI for not releasing any documentation of its safety testing, a standard practice in the industry. Grok 3 also showered praise on Adolf Hitler earlier this month, and though xAI issued a statement of apology, this kind of behavior from a frontier model does little to build trust.

What everyone at Every is thinking…

Here’s how Grok 4 performs on some of the benchmarks the Every team uses internally to evaluate new models:

Diplomacy: Top-level performance
Create a free account to continue reading

The Only Subscription
You Need to Stay at the
Edge of AI

The essential toolkit for those shaping the future

"This might be the best value you
can get from an AI subscription."

- Jay S.

Mail Every Content
AI&I Podcast AI&I Podcast
Monologue Monologue
Cora Cora
Sparkle Sparkle
Spiral Spiral

Join 100,000+ leaders, builders, and innovators

Community members

Already have an account? Sign in

What is included in a subscription?

Daily insights from AI pioneers + early access to powerful AI tools

Pencil Front-row access to the future of AI
Check In-depth reviews of new models on release day
Check Playbooks and guides for putting AI to work
Check Prompts and use cases for builders

Comments

You need to login before you can comment.
Don't have an account? Sign up!
@federicoescobarcordoba 4 months ago

I jumped on the Grok 4 ship by subscribing as soon as it launched—and then canceled within two days. It was obvious the writing was off (try asking it to complete a paragraph, and it’ll produce something that sounds like it’s from another century). Its analysis wasn’t as sharp as Grok 3’s, either. I’ve seen some improvement over the past few days—at least in analysis, which now feels on par with Grok 3. But for my use cases, Gemini 2.5 Pro is still leagues ahead of the pack. Thanks for the insights from this post. Ethan Mollick posted about the McNamara Fallacy recently, and my mind went to Grok 4 right away.

Tintin 4 months ago

@federicoescobarcordoba Gemini 2.5 Pro for me as well. I don't code at all, I use it for technical writing and health/longevity related advice.