Vibe Check: Grok 4 Aced Its Exams. The Real World Is a Different Story.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Grok 4 is topping some big AI benchmarks. So why have the responses to it been so mixed? And how come Every’s engineers aren’t using it much?

xAI’s latest launch, Grok 4, is positioned as an LLM with advanced reasoning capabilities. The model debuted last week in a livestream featuring Elon Musk and other members of the xAI team seated on black sofas and pointing at graphs that seemed to indicate Grok 4’s superior performance on prominent benchmarks like Humanity’s Last Exam and ARC-AGI.

But the TL;DR from our Studio team is this: Grok 4 is smart, but seems overtrained on benchmarks—while not being useful enough to be a go-to for everyday tasks. It should be good at coding, but without its own built-in command-line interface (CLI), the barrier to trying it is high. (A CLI is a text-based interface where developers type instructions directly to the model, without needing to switch between apps or windows.)

“There are new competitive dynamics here—Claude Code [which has its own CLI] is sticky,” Every CEO and cofounder Dan Shipper says.

Here’s what’s new, what the team thinks, and what everyone else thinks.

The nuts and bolts of Grok 4

Grok 4 is a reasoning model where you can’t see the reasoning tokens or turn the reasoning mode off. In other words, it always thinks deeply before answering, but won’t show you how it got there or let you stop it from thinking so deeply.

xAI trained the model through reinforcement learning tailored to increase its reasoning capabilities—and as a result, Grok 4 is touted to excel in technical domains like math and physics. It accepts both images and text prompts and has a context window of 256,000 tokens, double that of its predecessor, Grok 3, and more than both OpenAI’s o3 and Claude Opus 4, which are currently capped at 200,000 tokens.

The launch also included Grok 4 Heavy, described as Grok 4’s more powerful version. While explaining how it worked, Musk said it “spawns multiple [Grok 4] agents in parallel,” and then they compare their work “like a study group” to find the best answer.

The models are available to consumers through two subscription plans: the “SuperGrok” plan at $30 per month, or the “SuperGrok Heavy” plan at $300 per month, which includes access to Grok 4 Heavy. For developers, Grok 4 matches the cost of Anthropic's Claude Sonnet 4: $3 per million input tokens and $15 per million output tokens.

When a model gets the answer right but misses the point

Grok 4 should, in theory, excel at coding tasks thanks to its reasoning-first training. But early signals suggest that it’s been overfitted to do well on benchmarks—or to correctly answer what writer Zvi Mowshowitz calls “exam-shaped questions.” Physicist Casey Handmer asked Grok four questions where the process of answering mattered more than the result, and found that the model did not perform very well. “Grok 4 is routinely nailing Physics Olympiad style problems,” Handmer tweeted, “and yet it seems to still be missing the core of insight which is so critical to physics.”

Make email your superpower

Not all emails are created equal—so why does our inbox treat them all the same? Cora is the most human way to email, turning your inbox into a story so you can focus on what matters and getting stuff done instead of on managing your inbox. Cora drafts responses to emails you need to respond to and briefs the rest.

Try Cora today

Want to sponsor Every? Click here.

This leaves Grok 4 seeming more useful than it actually is in the real world. Its lack of tooling adds to the friction: Grok 4 doesn’t come with a built-in CLI, so using it takes more setup—unless you go through a third-party tool like the AI code editor Cursor. (Most of the Every team has moved away from Cursor, since Claude Code is now more tightly integrated into the day-to-day workflows.)

And then there are the safety issues associated with xAI’s models. Writer Simon Willison found that Grok 4 appeared to reference Elon Musk’s views when responding to controversial questions. An Anthropic researcher openly criticized xAI for not releasing any documentation of its safety testing, a standard practice in the industry. Grok 3 also showered praise on Adolf Hitler earlier this month, and though xAI issued a statement of apology, this kind of behavior from a frontier model does little to build trust.

What everyone at Every is thinking…

Here’s how Grok 4 performs on some of the benchmarks the Every team uses internally to evaluate new models:

Diplomacy: Top-level performance

“Grok 4 is o3-level in Diplomacy… that’s to say S-tier, a step above all others.”—Alex Duffy, head of AI training at Every, on the new model’s performance on the benchmark he co-created: evaluating LLMs through their performance on the complex strategy game Diplomacy

Reader engagement test: Still behind Claude

“Ran the same ‘reader engagement check’ we've tested the last few frontier models on with Grok 4, [it] does as well as [OpenAI’s] o3 does, which is worse than Claude Opus 4.

The reader engagement check is a prompt we've used to have an LLM read some writing (specifically tweets so far) and then see if the piece of writing keeps a reader engaged as [they] go from sentence to sentence. We’ve found that Claude Opus 4 is the only model that is able to do it reliably.”—Danny Aziz, general manager of Spiral

Visual generation prompt: Grok edges out new open-source model Kimi

“Grok generated something good looking. [New open-source model] Kimi [did] too, [but] less correct than Grok.”—Cora general manager Kieran Klassen, comparing Grok and Kimi K2’s performance on his trademark test of prompting new models to generate an image of a specific kind of forest

“Grok 4 is kind of slow.”—Kieran, a few minutes after tweeting about the image Grok created

And a few more thoughts from Every about Grok 4:

Strong in the exams, but weak in the real world

“It crushes the benchmarks but isn't clearly that much better when you actually use it, AND because it's not integrated into its own CLI like Claude Code, it raises the bar for trying it, unless you're using Cursor, but most of us aren't using Cursor much anymore.”—Dan

xAI moves really fast

Grok 4’s performance on the ARC-AGI benchmark. Source: Grok 4 livestream.

“I'm seeing a new AI player that did not exist until last year. This player absolutely crushed the most difficult AI leaderboard by a margin so huge that the graph is almost looking like it's a meme, but it's real. And then I think of the person who created this and the history he has as being the ex-cofounder and now-nemesis of the now-oldest company in this AI space. It hits me how crazy a moment in history we're living in right now.”—Nityesh Agarwal, an engineer working on Cora, after seeing the graph pictured above on his X timeline

“It’s just not in one of the AI apps I use regularly so I haven’t used it much… [but xAI has the] first-mover advantage [because of its] speed of shipping. OpenAI got consumer adoption quickly, so if you're going for consumer adoption now you've got a huge hill to climb. With that said, honestly, as cringe as it is, [xAI is] the first major lab to launch AI companions and it looks like that's getting adoption.”—Alex, after seeing the traction the AI companions inside Grok seemed to be getting

Grok 4 isn’t a great writer

“I replaced [Grok 4] as the model inside the new Spiral, [and] first drafts were noticeably worse.”—Danny

What everyone else is thinking

It’s really hard for developers to trust Grok

Referring to the hate-speech incident with Grok 3, Simon Willison remarks: “If xAI expects developers to start building applications on top of Grok they need to do a lot better than this. Absurd self-inflicted mistakes like this do not build developer trust!”

Grok 4 doesn’t have a clear use case; worse, it also has a vibes problem

Agreeing that Grok 4 comes with “severe brand risk,” machine learning researcher Nathan Lambert adds that “catching up in benchmarks is one thing, but finding a use for expensive frontier models isn’t automatic—it is the singular challenge as model performance becomes commoditized.”

Scaling reinforcement learning still works

Writer and academic Ethan Mollick tweeted that Grok 4 suggests that “scaling still works (with the diminishing returns predicted by the scaling law), and that tool use can unlock performance gains.”In other words, while you can still get better results by making models bigger, it’s not the only way forward. Innovations in how models interact with tools are opening up new possibilities, so the future of AI is likely to be shaped by a mix of tool use and continued model scaling.

Grok 4 as a planning expert for coding tasks

AI coding assistant Cline noted that its users are using Grok 4 as a "planning specialist.” The bot added: “What's catching attention: (1) It fixed bugs that Opus and o3 couldn't solve (2) It's expensive, but worth it for complex reasoning (3) Some are using Grok 4 to architect, then handing off to cheaper models for execution. Real workflow emerging: Grok 4 in plan mode → DeepSeek in act mode.”

A smart frontier model that just isn’t that useful

Anonymous AI critic xjdr had a balanced review: “On the plus side, xAI and co. have built a model that puts in squarely in the frontier…on the minus side, it’s a bit deep-fried with all the RL, and [it’s] verbose and sycophantic. It's slow enough to need to compete with [OpenAI’s reasoning model] o3-pro, but I prefer the latter every time.”

Tyler Cowen prefers o3, too

Economist Tyler Cowen cryptically tweeted, “o3 still better.” His blog post about prompting Grok didn’t give readers any more clarity as to why, but it seems clear that he prefers OpenAI’s model to Grok 4.

Rhea Purohit is a contributing writer for Every focused on research-driven storytelling in tech. You can follow her on X at @RheaPurohit1 and on LinkedIn, and Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Automate repeat writing with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Get paid for sharing Every with your friends. Join our referral program.

What did you think of this post?

Amazing Good Meh Bad

Comments

You need to login before you can comment.
Don't have an account? Sign up!

@federicoescobarcordoba 6 months ago

I jumped on the Grok 4 ship by subscribing as soon as it launched—and then canceled within two days. It was obvious the writing was off (try asking it to complete a paragraph, and it’ll produce something that sounds like it’s from another century). Its analysis wasn’t as sharp as Grok 3’s, either. I’ve seen some improvement over the past few days—at least in analysis, which now feels on par with Grok 3. But for my use cases, Gemini 2.5 Pro is still leagues ahead of the pack. Thanks for the insights from this post. Ethan Mollick posted about the McNamara Fallacy recently, and my mind went to Grok 4 right away.

♡ 1 · Reply

Tintin 6 months ago

@federicoescobarcordoba Gemini 2.5 Pro for me as well. I don't code at all, I use it for technical writing and health/longevity related advice.