o3-pro Vibe Check—A Slow, Steady Last Resort

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Sometimes the only way to solve a hard problem is to surrender. It’s 9 p.m., and you've worn down your whiteboard, emptied your coffee pot, and thrown your Oblique Strategy cards across the room. So you pop a melatonin, crawl into bed, mutter “fuck it” into your pillow, and conk out. If you’re lucky, you jump out of bed at 4:15 a.m. with the answer fully formed in your head.

That’s what using o3-Pro—the more powerful version of o3—is like: You take a quick pass with every other model first, and when you’re stuck, you head to o3-pro, type your prompt, hit “return,” and surrender. It’s very slow and doesn’t work every time, but sometimes it’s smart enough to one-shot an answer you wouldn’t have gotten with any other model.

It’s been out for about two weeks, so this vibe check violates our day-zero promise—sorry about that! I have a good excuse: I went on a week-long meditation retreat and OpenAI dropped the model while I was, presumably, deep in a Jhana. Rude.

But, as they say about both o3-pro’s responses and tardy reviews: Better late than never. So let’s get into it. As always, we’ll start with the Reach Test.

The Reach Test: Do we reach for o3-pro over other models? No.

Every’s Alex Duffy summed up o3-pro well when he told me that it’s the last model he tries with a basic prompt before he spends time making a more complex, detailed prompt with prompt engineering. ”It does a good chunk of prompt engineering for me, so it raises the floor of the responses I get without putting in effort,” he said.

That’s a common o3-pro pattern after two weeks: No one is using it all of the time, but a lot of us are using it every once in a while.

For day-to-day tasks: no

It takes 5-20 minutes to get a response, so it’s way too slow to be usable for day-to-day tasks, like quick searches or basic document analysis.

For coding: no

Claude Code is by far my most reached-for coding tool, and it (obviously) doesn’t include o3-pro. It’s the same thing for everyone else at Every—the whole company has been Claude Code-pilled for the last few weeks, so o3-pro hasn’t been incorporated into our development workflows.

Part of the problem is that in order to use o3-pro with Claude Code, you need to copy and paste its responses into your editor. Cora general manager Kieran Klaassan told me, “It’s too hard to use. Copy and pasting code out of ChatGPT feels so 2023.”

o3 Pro also doesn’t yet natively support Canvas (ChatGPT’s version of Anthropic’s Artifacts that renders codes and documents) in ChatGPT, which makes it even harder to use for quick coding tasks.

For writing and editing: no

Its writing and editing aren’t noticeably better than o3, but it takes noticeably longer to get results, so it fails here. I tested it on one of the prompts we use inside of our content automation tool Spiral to judge whether or not writing is engaging, and it failed.

So far, only Claude Opus 4 passes this one.

For research tasks: yes

This is where o3 Pro shines. If you have a ton of context that you want a model to sift through, it will give you a well-thought-out answer that’s concise and to the point. (This makes it a better first option than deep research, which tends to write dissertations.) o3-pro seems to be able to use more of its context window and reason more effectively than other models.

For example, when I asked it to predict my future career trajectory, it returned some interesting ideas:

“Dan Shipper has repeatedly fused clear writing with hands‑on product building. If he sustains operational focus, Every could mature into a small‑cap ‘AI Bloomberg for operators,’ with Shipper evolving into a public intellectual‑founder bridging journalism, product design and responsible AI. Failure to execute, however, could see him back in the role of prolific essayist/EIR—still influential, but sans scaled platform.”

o3-pro’s response stood out because in its full response (not shown above), it considered three cases: upside, average, and downside. It felt like it was reasoning through likely options rather than just giving me what I wanted to hear (which is what Claude Opus 4’s response felt like.)

It’s possible I liked this test because it was complimentary (actually, it’s probable), but it predicted a few things that we’re doing but haven’t announced yet.

I also fed it a draft of the book I’ve been writing—about 45,000 words—and asked it to tell me what it thought the book was about. I typically say that it’s about how the history of AI is speed-running the history of philosophy, and the same changes that AI and philosophy went through are coming for the rest of culture. o3-Pro told me it was about my own internal journey: