Why o3 Is the Best Model Yet for Real-world Learning

Was this newsletter forwarded to you? Sign up to get it in your inbox.

When OpenAI’s new reasoning model o3 came out, Every’s CEO Dan Shipper and OpenAI’s Sam Altman agreed that AI is changing the future of learning: If you aren’t using it to learn every day, they said, you’re “not going to make it.”

OK, I thought, I’ve got a challenge for o3: Make me physically stronger. Ten times stronger, in fact.

It’s been a life goal of mine to improve my chinups. I started 2024 unable to do even one, and months of working out alone got me nowhere. It wasn’t until I started working with a calisthenics trainer, Silvia, that I finally, after half a dozen focused sessions, got my first shaky repetition.

Now I want to do ten.

What better way to test AI’s capacity for teaching people in the real world than to ask it to help me achieve a goal I’ve never even come close to?

The more I thought about it, the more I liked this plan. I’d pit GPT-4o against o3 and see which model gave me a better chance of progressing from one to 10 unassisted chin-ups. I wanted to know which one would be a better teacher: 4o, the fast and reliable model I’ve been using as my daily driver, or o3, the more advanced reasoning model. Would either be up to the task? Would one emerge victorious? Let’s find out.

What I’m going to judge GPT-4o and o3 on

I would use OpenAI’s older standard model GPT-4o and o3 separately to generate a training plan. I created a set of rubrics against which to evaluate the models, based on what I think matters when you’re trying to learn something in the real world: quick feedback so you don’t make the same mistake over and over again, advice that’s tailored to your specific situation, incremental progress, and the motivation to keep going.

Responsiveness: How quickly do I get feedback?
Personalization: Is the advice tailored to me?
Progress: Does it help me get closer to my goal?
Motivation: How excited am I to keep showing up and putting in the work?

To judge the LLMs’ training plans, I also needed to define what “good” looks like. I trust my trainer, and she’s already delivered real results—so her guidance and the techniques she uses with me will serve as my baseline, the standard against which I’ll measure everything else.

GPT-4o’s training plan

Alright, first up: GPT-4o. Language models are only as good as the context you give them, so I made sure to be specific. In my prompt, I included my age, height, weight, the number of chin-ups I can currently do, available equipment, and training schedule. I also attached videos of me doing one unassisted and one assisted chin-up.

What worked

It set an achievable target

4o starts by telling me that going from one to 10 unassisted chin-ups in just one month is an unrealistic goal (I did this exercise a few days before OpenAI pushed the infamous update that made 4o disingenuously agreeable toward users). This tracks with what Silvia told me, and just like her, 4o gave me an interim goal of four to six ones to keep me motivated. OpenAI’s flagship model is off to a great start.

It picked up on key details in the video

The video I uploaded records my clenched face and pursed lips in the struggle to get my chin the last few inches over the bar. If you saw it, you’d definitely notice how hard I was trying—and to my surprise, GPT-4o did too. It said that I was getting over the bar “with solid effort” and even called out the slight tension in my shoulders before I started the chin-up (ideal form would have me doing a chin-up from a dead hang; in other words, no tension in my shoulders). Props to the model for pulling out such granular detail, comparable to the advice of a personal trainer.

It structured the training well

The model split my training into two parts: strength and control on one day, volume and endurance on the other. Every week in the plan followed this structure, which lined up closely with how Silvia designs my workouts—a good sign.