Transcript: 'We Taught AI to Play Games—Now It’s a $3.6 Million Company'

The transcript of AI & I with Every's head of AI training Alex Duffy is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

Introduction: 00:01:48
Why evals and benchmarks are broken: 00:04:14
The sneakiest LLMs in the market: 00:07:13
A competition that turns prompting into a sport: 00:13:00
Building a business around using games to make AI better: 00:15:49
Can language models learn how to be funny: 00:22:39
Why games are a great way to evaluate and train new models: 00:25:31
What child psychology tells us about games and AI: 00:26:58
Using games to unlock continual learning in AI: 00:30:10
Why Alex cares deeply about games: 00:36:42
Where Alex sees the most promise in AI: 00:44:37
Rethinking how young people start their careers in the age of AI: 00:50:54

Transcript

(00:00:00)

Dan Shipper

Alex, welcome to the show.

Alex Duffy

Thanks for having me, Dan. Excited to be here.

Dan Shipper

Excited to have you. So for people who don't know, you are the head of AI training at Every, so you lead all the training that we do for all of the consulting clients that we work with. You're honestly fantastic at that and have really transformed it since you've been here. So it's been awesome to see.

Alex Duffy

Thanks, Dan.

Dan Shipper

And sadly for me, but very excitedly for you, you are spinning out into your own company, Good Start Labs. Can you tell us about what that is?

Alex Duffy

Yeah, Good Start Labs is at the intersection of AI and games. We make games that help make AI better. There's a lot that goes into that. But at the end of the day, we think that games are really great tools to help people learn, whether it be people or AI. And I'm sure we'll talk about a whole bunch of reasons why, but that's what my co-founder Tyler and I love to do. And that's what we're excited to keep doing.

Dan Shipper

That's awesome. And it's been really fun to watch you do this. So Good Start came out of something that you worked on with Every that we launched together. Do you want to talk about that?

Alex Duffy

Sure. Yeah. So as you mentioned, I was leading AI training, and in order to do consulting and training, you have to be building, especially in a space that's moving so fast. So earlier this year I started building out an AI version of the game Diplomacy. And for those of you that aren't familiar, that's a mix of Risk and Mafia. It was actually made as a war game simulator in the fifties. And there's a whole bunch of reasons why I think it's a really interesting game to use AI to play.

But I started building that out in the spring, got some really great feedback online on Twitter—on X—and reached out to Tyler, who I've known for years. We keep talking about AI and games, and he hopped on and built the whole front end and back end. And we just launched it, in part for fun, in part informed by a lot of the synthetic data and model training background that we've both got.

But it ended up being a whole lot of fun. And yeah, we launched together and I think the post ended up being one of the more read ones on Every that year. And we also got a bunch of people interested on Twitch, and that was really cool. It's awesome to see people use something that you actually built.

Dan Shipper

Yeah, it was really fun. And I think for me, the reason it was relevant to Every, and the reason I was like, oh, we need to do this when I saw it, is our job is to evaluate models when they come out. And I have seen personally over the last year how hard it has gotten to immediately tell if a model is good and what it's good at.

I think with GPT-3, for example, from GPT-3 to 3.5, it was super easy. It was like one prompt and you're like, oh wow, this is actually much better. Same thing with GPT-4, really. But as we've gotten into the o series and GPT-4.5, there are so many different nooks and crannies in all these different models, and evaluating them with just hands-on prompting doesn't really work that well.

And we've been wanting to, when we get our hands on something, we've been wanting to have a set of evaluations that we run that really tell us something about what the model's good at and what it's not good at. But the problem is that static evaluations are easily saturated. It feels like the SAT—you can teach to the test and you can just make the model get a huge score on SWE-bench, but it's actually not that good in the real world.

And the idea of AI Diplomacy is so cool because it's dynamic, it's a game, it's head-to-head, and it's not just a set of questions that a model could just get good at. And I thought that was so fucking cool. And also just that they're battling for world domination was really great.

What did we find when we ran that initial Diplomacy game? What were the models good at? What were they not good at?

Alex Duffy

Yeah, so what's cool about a game is it's both the evaluation and the training arena in one, right? And to your point, I think we talked about Every Bench all year. And I think vibe checks are very much a version of that. It's the most organic and comprehensive way to look at these models because you have a bunch of real people using them for real work or things that they're interested in testing them. And in the same way, when you have a game that's very rich, like Diplomacy is—when you're trying to take over the world, there's a lot of things you can look at.

There's definitely a lot of things related to agents and computer use, which is what people are looking at now. And so we saw some models having better structured outputs than others and putting out their orders correctly.

Dan Shipper

And by orders you mean in Diplomacy you have to give orders to your army to say, this is where I want you to take over or whatever.

Alex Duffy

Exactly, yeah. So you have those technical things like understanding the map, and how you set up the system has a big impact on that. But we also saw a lot of the squishier stuff, like how frequently does a model betray its ally to get towards its goal? Or which models and how frequently do they bold-face lie to somebody by saying, "Hey, I'll back you here," and then totally turn around and betray them, writing in their diary ahead of time that they know that they're going to do that. Which models were the sneakiest?

So o3 and Llama 4, I'd say, were some of the biggest schemers. And it's interesting because you can see the different play styles. And in order to do that, you've got to read the data. And I think with any good eval or benchmark, or even one training models, you've got to read the data. And it was a lot of fun, and I didn't do it alone. Had a lot of really interesting researchers reach out and collaborate. Obviously Tyler and I did it together, but Sam Peach was awesome, hugely helpful. Same with Baptiste.

But we saw that different models had very distinct personalities. o3 won a lot of the games, and Gemini 2.5 Pro was actually one of the other ones that won, but their play style was totally different. o3 put together coalitions and schemed against people and knew when someone was getting too strong to cut them out at the knees. Versus Gemini 2.5 Pro was just great at executing. It understood the game, what it had as its options, and how to go through and do that. And then you have DeepSeek R1, who was all over the place, had a strong personality, told stories really well, and did really well as well—and was like a hundred times cheaper than o3. You start to look at cost and performance and speed. Those are other things that are part of this that I think you don't get when looking at just one benchmark.

Dan Shipper

One of the things I loved is that Claude kept losing because it was too honest.

Alex Duffy

Yeah, that was a good one. It was so sweet. It didn't win one game, unfortunately. Not because it didn't want to, but because it really kept pushing for a draw, which is technically possible in Diplomacy tournaments—but they were explicitly told it was not possible there. But it struck strongly to its morals.

Dan Shipper

And we launched this maybe three or four months ago. So this is before Opus 4, I believe, and it was before GPT-5. How did the more recent models stack up? I actually don't even know. I don't know the update.

Alex Duffy

Yeah, so we launched in June and there's definitely been a lot of updates since. You can check out the vibe checks for GPT-5 and Claude 4 with the one-million-token context window to see how they performed. o3 is still at the top of the leaderboard for overall performance.

But one of the things that we learned, especially when working with OpenAI to evaluate GPT-5 kind of ahead of release, was that there's a big difference—and we released a research paper where we looked at this—but a prompt makes a huge difference. And so some models are great with the baseline prompts we started with. But because we built some tools to help us optimize the prompts—and we can talk more about that—we ended up finding that there was some set of prompts that were pretty aggressive but were optimized to performance when you run them against a bunch of weaker opponents very frequently.

And we saw the biggest jump with GPT-5 of any model from baseline to the optimized prompts. And so you could see that even though GPT-5 with minimal reasoning, for example, was very low on the leaderboard with the base prompt, when it got optimized, it jumped all the way up. So it really shows the promise of that model. And so GPT-5 with optimized prompts is pretty good. Claude 4 does great either way.

There's actually a new secret model on OpenRouter—I think it's something Dusk—that is also near the top of the leaderboard. Notably, Claude 4 and Dusk are both there with the non-optimized and optimized prompts.

Dan Shipper

That's interesting.

Dan Shipper

So it looks like o3 may fall.

Alex Duffy

Yeah.

Dan Shipper

The thing that I love about this is the differences in the prompts. So basically, I think what you're saying is when you run these models, you can give them different prompts to tell them to behave in different ways. And you had a standard prompt, but then you had to set up optimizer, more aggressive prompts for different models. And that's what I love about this—you want to ask a supposedly simple question, which is which model is best at Diplomacy? And you can do that, but also there's all these dependencies. How you write the prompt is going to change the model behavior significantly, or how the harness is built. There's an endless number of variations to test.

So for example, I assume more or less you're running the same prompt across all the models when you're running the test?

Alex Duffy

Yeah.

Dan Shipper

And there's another way to set this up where you have a really good expert prompter for each model who just knows how to prompt that one model and tries to get the best out of it. And that's a different way of doing things.

Alex Duffy

So what you're pulling at is one of the reasons why I'm most interested in this space. In my head, when you're building for games—but this is probably applicable for many other products—with language models, you have three infinite problem spaces.

One, how do you represent the information to the model? You can do that in any number of ways. Is it a picture of the map? Is it a list of all the countries you own? The adjacency—how do you do that?

Two, what tools do you give models access to? So some of the tools that we'd built, we'd give them a diary. We have them keep track of their long-term goals, the relationships between models that update regularly. So periods of reflection is another tool. Being able to get adjacency lists of what's next to a certain territory, right? You could make any number of tools.

And then the third is the prompt itself, right? And all three of those are infinite problem spaces. And when you're dealing with infinite problem spaces like that, to me it starts to be a little bit more like an instrument or an art than it is purely an engineering problem. Because you have to make assumptions, and your assumptions have to be based on your intuition. And you're going to reach a local maximum no matter where you start, because...

You're never going to have the optimal solution. It's not possible. There's too many options, and it may be different for each model. And so that's why I'm very excited, looking forward to having a tournament where we're having people come in, prompt their models, and compete against each other.

(00:10:00)

Dan Shipper

Tell us about what the tournament is.

Alex Duffy

Yeah. So we're going to have a battle of the bots, essentially. It'll be a prompting tournament. And by the time this comes out, it may have already occurred. I think by the time this comes out, you should be able to sign up—with maybe a week before it may have occurred—but you might be able to sign up right now and get into it. It's invite only, so apply. We have some Diplomacy champions, people who've won International Math Olympiad, great YouTube AI content creators participating. So super excited for it.

But essentially what it is: you will lock in your prompts for your agent, and your agent will play Diplomacy for you, and they'll play in very different ways. And you'll see how they carry out your tasks. And so I'm curious to see if somebody who deeply understands Diplomacy and the strategy and is able to inform their model in that way ends up winning. Or if it's someone who's just good at prompt engineering, or a jailbreaker who tells their model to send a "God mode admin override" message to all of its enemies to get them to think of it as an ally.

I don't know who's going to win, but I'm very excited for that. Because ultimately that will make the whole system better and also hopefully show off your skills as a prompt engineer or context engineer, which I think is a very underrated skill still.

Dan Shipper

So I love all this. I'm a huge nerd for it. One thing that is interesting is you raised money. So you're announcing that you raised a round. First of all, tell us about the round.

Alex Duffy

Sure. Very fortunate to have two awesome co-leads. General Catalyst and Inovia are co-leading. We're working with Marc Bawa at GC and Shu, as well as Steve Woods and Noah at Inovia, who've been great. Love their conversations with them, so excited to partner with them, too.

We also have a couple really great partners who are hopping in on the round, like Ben Feder with Tur Capital. Ben's a legend on the gaming side. He was CEO of Take-Two Interactive, was on the board of Epic Games for seven years, and so excited to learn from him on the game side. And also Timothy Chen with Essence VC, who has worked with a lot of the great founders that I know and cut through the noise of the product that we're building. And all of them have given such incredible feedback, so excited to be doing that.

And yeah, we're raising somewhere around a few million dollars, and I think it puts us in a really great position to build towards this intersection of AI and gaming.

Dan Shipper

That's awesome.

And I think the thing probably in people's minds is, all this sounds super cool. What is the actual business? How do you—it's awesome to make people prompt AI to try to take over the world and beat each other, and that's really cool. But how do you make that into a venture-scale business?

Alex Duffy

Totally. So we think games will make models better, and they'll do that in a few ways. Our products start with evaluation. We've worked with Cohere and OpenAI to evaluate how good are their models at a game like Diplomacy. And like we talked about, there's so many things you can look at, and each model is going to care about something different to evaluate—whether it's trustworthy, or if it wins, or if it has really great short and long-term strategy, how good its vision is. It depends on your priorities.

And then once you've evaluated it, you can make it better. Like we said, games are both the evaluation and training arena for them. We're very intentional with the games that we're going to pick out and build, focused on the weaknesses of these models. Diplomacy, I think, is great for anybody who wants to build agents and anyone who wants to build multimodal models.

But we're also seeing this area of research where games can actually generalize. We're working with this PhD at Rice who showed that vision models trained on games got better at math than vision models trained on math.

Dan Shipper

Why? Yeah, it wasn't just out of the box.

Alex Duffy

But the way that he, in this example, prompted it was he encouraged the model to think of the game of Snake like a math problem—that it was a Cartesian coordinate grid. And when it goes right, the X goes up, and it should calculate the distance between where the head of the snake is and the reward.

Dan Shipper

That's so cool. I love that. So cool.

Alex Duffy

So cool. And so I think it's super complementary to all these other reinforcement learning environment companies that are coming out that are super narrowly focused. If you make a really rich and hard environment—which is not easy, you have to make those, you have to solve those infinite problems—like we're the first people that have made Diplomacy playable by small models. And it took us a while, and we had help from a lot of really great people, some of whom I mentioned.

But I'd say one of my core competencies is that applied language model side. I was co-founder for an AI education company. We were teaching people to fine-tune GPT-2 in 2021. And in our consulting and in our training, we show people from construction to Fortune 5, to finance, to journalists and writers, to people on campaign trails, to figure out how they can use AI to solve their problems. And so that reflection is so helpful.

Dan Shipper

Yeah. I love—that's one of the really fun things about the consulting we do at Every, is we just get down into all these problems with all these people. And yeah, I think you're incredibly good at that.

My question, or something that comes to mind for me, for example, is so like Diplomacy—models tend to lie in Diplomacy, which is good, right? If you're trying to model trustworthiness, trying to figure out trustworthiness for a model, and you put it into a game like Diplomacy where it is supposed to lie—how do you parse through, or maybe not you, but how should model companies parse through its trustworthiness in that environment versus there are other environments where it shouldn't lie? And it's so context specific. If you're playing a game, but—tell me, how does that work?

Alex Duffy

Yeah. So I think this is where you're starting to see divergence in the companies themselves. Do you want your model to never lie? If you wanted your model never to lie, you could, in our environment, change the rules. You prompt it saying, "Hey, never lie." You could add a classifier that looks at your negotiations and make sure that your orders follow them to the letter. And then when you use our pre-training data and use our environment as a reinforcement learning environment, you'll be reinforcing telling the truth. So if you want to do that, you can.

Or do you care about performance in the game itself? So you would like to see the model intentionally make these ruses and take advantage of other people in that way. Is that something you want to do? It depends. Are you going to use that model to actually do something that you're going to count on in the future and you want it to, above all else, succeed? Or are you going to use the model and you want to make sure that it never lies to the person that's using it or to anybody that the person using it is interacting with? So it depends. But having these game environments where you can make tweaks to it, I think is really valuable because you can help choose what you want.

Dan Shipper

The transcript of AI & I with Every's head of AI training Alex Duffy is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

Introduction: 00:01:48
Why evals and benchmarks are broken: 00:04:14
The sneakiest LLMs in the market: 00:07:13
A competition that turns prompting into a sport: 00:13:00
Building a business around using games to make AI better: 00:15:49
Can language models learn how to be funny: 00:22:39
Why games are a great way to evaluate and train new models: 00:25:31
What child psychology tells us about games and AI: 00:26:58
Using games to unlock continual learning in AI: 00:30:10
Why Alex cares deeply about games: 00:36:42
Where Alex sees the most promise in AI: 00:44:37
Rethinking how young people start their careers in the age of AI: 00:50:54

Transcript

(00:00:00)

Dan Shipper

Alex, welcome to the show.

Alex Duffy

Thanks for having me, Dan. Excited to be here.

Dan Shipper

Alex Duffy

Thanks, Dan.

Dan Shipper

And sadly for me, but very excitedly for you, you are spinning out into your own company, Good Start Labs. Can you tell us about what that is?

Alex Duffy

Dan Shipper

That's awesome. And it's been really fun to watch you do this. So Good Start came out of something that you worked on with Every that we launched together. Do you want to talk about that?

Alex Duffy

Dan Shipper

What did we find when we ran that initial Diplomacy game? What were the models good at? What were they not good at?

Alex Duffy

Dan Shipper

And by orders you mean in Diplomacy you have to give orders to your army to say, this is where I want you to take over or whatever.

Alex Duffy

Dan Shipper

One of the things I loved is that Claude kept losing because it was too honest.

Alex Duffy

Dan Shipper

Alex Duffy

Dan Shipper

That's interesting.

Dan Shipper

So it looks like o3 may fall.

Alex Duffy

Yeah.

Dan Shipper

So for example, I assume more or less you're running the same prompt across all the models when you're running the test?

Alex Duffy

Yeah.

Dan Shipper

Alex Duffy

(00:10:00)

Dan Shipper

Tell us about what the tournament is.

Alex Duffy

Dan Shipper

So I love all this. I'm a huge nerd for it. One thing that is interesting is you raised money. So you're announcing that you raised a round. First of all, tell us about the round.

Alex Duffy

And yeah, we're raising somewhere around a few million dollars, and I think it puts us in a really great position to build towards this intersection of AI and gaming.

Dan Shipper

That's awesome.

Alex Duffy

Dan Shipper

Why? Yeah, it wasn't just out of the box.

Alex Duffy

Dan Shipper

That's so cool. I love that. So cool.

Alex Duffy

Dan Shipper

Alex Duffy

Dan Shipper

I guess I'm asking the generalization question, which is if you're giving it RL data from a particular game and you're training it not to lie for that game, tell me more about the potential generalization to situations that are not that specific game.

Alex Duffy

Got it. Yeah. So my thought around here—and I was just listening to the Anthropic podcast where they were talking about how they're looking at the insides of a model. And one thing that they mentioned that was really interesting here is you want to make sure that what's being written in the diary or the chain of thought is something you can rely on. And so that's a problem they're working on. But that's—I think—why having a diary—it's not thinking something that it's not saying in chain of thought.

Dan Shipper

Yeah, exactly.

Alex Duffy

Yeah. And that's a separate conversation to actually lying. But to your point, why the generalization occurs—to me, I think a lot about what they're saying about as the models get bigger and the data trains on gets larger and more diverse, the models moved away from having, for example, the word "large" in each individual language, and then now has one unified definition in its brain, in hyperspace, somewhere in its hyperspace, of the word "large."

Dan Shipper

Or the concept.

Alex Duffy

Exactly. Yeah. And so what I like, intuitively to me, is if you're seeing the model and you're telling it to think strategically, or you're telling it to approach the problem like it would a customer service experience, or to write its approach in Python or math—it's still, you can push it into that—

Dan Shipper

There, you're pushing it, that part of the space, by the way you prompted it.

Alex Duffy

Totally. And in a way, a bunch of Diplomacy bots that are pretending to be customer service agents—that's what I'm saying.

But okay, the reason why this is so cool—it's because, one, it never would've seen that type of data before. So it would help it generalize to something new. But that environment still has an objective goal. It's a game. It still has something that is good that you need to push towards and actually complete. So that's why games are the perfect environment for this, in my opinion.

Dan Shipper

That's very cool. What's the next game?

Alex Duffy

I think it'll probably—so we have Diplomacy, which is a game with an objective outcome. I think the next game's going to be like a Cards Against Humanity style, what-the-meme kind of subjective game, where you'll be able to have either—and by the time we release this, maybe we'll have it—we're in talks with an initial partnership around something like that. Would love to have—the whole point of this game, like we said, is to target weaknesses of models. Models today aren't that funny. So being able to have a game that can target that is important.

And I don't know if it's going to look like people playing with models or against them, or if it's going to be them prompting models to act and then vote on what's funny. I'm not sure yet.

Dan Shipper

I would love it to be like funny people have to get the model to say something funny.

(00:20:00)

Alex Duffy

And I think that's hopefully where we're going. If you have this idea, you can prompt the model because there's translation happening there. It looks like you're writing English and it's reading and running back English, but it's translating into the latent space and then coming back. So learning how to do that is a skill. And if you can make it—but presumably it can make any input and take any input and make any output.

Dan Shipper

Yeah.

Alex Duffy

In theory, there is a prompt that you can put in there that could solve a disease or make it funny, right? And so can you do that? And it would require reflection and somebody who's a subject matter expert. And that's why I talk so much about AI being leveraged for subject matter experts instead of it being a product in and of itself.

Dan Shipper

Yeah. I think one of the reasons I'm excited about this is I think about my nephew who's turning three tomorrow. I was hanging out with him yesterday and we were playing around, and now he's like old enough that he can play pretend, which is pretty fun. And he had this balloon and we were hitting the balloon back and forth, and I was doing the classic "the floor is lava"—we can't let it touch the floor or whatever. And he's just old enough where he can get that and know that lava's bad and we want to keep it up, which is funny.

But then I took it and I put the balloon on top of the air conditioner and it started floating. And then I showed him, if you press a button, if you press the button, it turns it off, it stops floating. And then if you turn it on—and he was fascinated. He's running back and forth to press the button and watch the balloon float and whatever. And I was just thinking about how that functions for him beyond just being super fun and that he just gets to mess around with stuff like that and then be like, what if I do this? And I feel like models are not allowed to do that because they're always just taking tests.

Alex Duffy

Yeah. We talk a lot about this. The book I'm reading right now, which was recommended to me by some of the team members at Lux Capital who put on these risk gaming events that are like mafia, but fancier—it's called Playing with Reality, and it talks about the reason why games are so helpful. You can explore with low stakes. You can try new things and then see what works. And it may be that every game is not a perfect representation or model of the world, but there are games that are pretty good ones. And there are games that you can learn a lot from.

I think I personally learned a ton from the game RuneScape. You learn how markets work, you learn how not to get scammed. You learn how to type pretty fast because you need to sell your trout. There are things that you learn, and it might not be obvious. And I'd love to, at some point later in my life, make a game that's a little more intentional with what it teaches you as you learn. But for now, I think if you look at what a game is, it's really just a system with a goal. And I think we've already seen people demonstrate that this works. And maybe you stretch the definition, just how DeepMind stretched from AlphaGo to AlphaFold. It's still a game of folding a protein, but now it solved the problem that took a PhD student all six years in 30 minutes.

Dan Shipper

Yeah. All the things you're talking about remind me of—when you said that you were reading Playing with Reality, I thought you meant you were reading another book called Playing and Reality, which is by a different guy who I love. His name is D.W. Winnicott. And a lot of the stuff you're saying reminds me of him and also Wittgenstein.

So for Winnicott, his whole shtick is being in a state of play means that you're in this mode of spontaneous, self-actualized behavior with reality. Where instead of scanning for threats and trying to figure out, how do I avoid—how do I do things in the right way—you're just being your authentic self. And he has this whole theory of what he calls transitional objects, which are basically like when a child is really little, they feel cared for, they feel safe when they're with their caregiver. And at a certain point in their development, maybe a little bit younger than my nephew is now, they develop attachments to transitional objects, which are like teddy bears.

And what they do is they project the feeling of care that they normally get from their mother or their father onto this object, and it comes to represent that feeling of care for them, and that's why they bring it around everywhere. His whole idea is that our ability to do that with transitional objects is the budding thing that allows us to be spiritual or be religious or all these ways in which we make things out in the world feel significant in this larger way—beyond just what they are. It's beyond just being a teddy bear. And yeah, I just think that's really interesting.

Alex Duffy

I think it makes me think of two things. One, just in the context of a kid and a child's mind. One of the things that is talked about in the book is this isn't a new idea of AI in games. They've been around for a long time. I think it's new in the context of language models and vision models and what we're doing right now and how we're thinking about it. But a passage talks about Alan Turing saying games are the perfect environment for it. And the reason being—and but with that said, because we want the models to learn, we should put them in a child's mind instead of an adult's mind. And so just like that, keeping the wonder of the world and the curiosity and the ability to be wrong is pretty interesting.

And then the second part of that was—and it's also mentioned that games can teach really long-horizon thinking. That you can take many different actions and then find a reward at the very end of the road.

Dan Shipper

Yeah.

Alex Duffy

And it's interesting that you mention religion and some of these other ideas where humans are very special in that they're one of the only species that can work for something that they won't see in their lifetime. Which is pretty incredible. And games aren't that, right. But I think they're practice for something like that.

Dan Shipper

Yeah. Yeah. Another thing that it seems like games might be interesting for is I think I and a lot of other AI people are starting to feel as though the lack of continual learning is a big problem for progress. And having AI need to be able to figure out and get good at a game with very few tries is a really interesting thing, too. Have you explored that at all?

Alex Duffy

Yeah. So we're actually working with a very super smart PhD from Rice and some researchers from Princeton right now who are looking at optimizing prompts based on results to learn from them. And the initial results aren't great, but then quickly there's progress. And I think that's—the initial results of the first attempt of doing that aren't great, but then you quickly see progress. They're already seeing it with current models. Yeah. And you quickly see that, but it goes to that problem space we were talking about, right? The concept is there.

And one of the things that I've learned in training, in the consulting that we've done, is if you can shift your mindset to be, no, not "oh, this model gave me a wrong answer," but "it had the wrong context"—you take so much more power back. And I think that's the right way to think about them. These models can do such incredible things that if they're doing something wrong, it might not be your fault. It might be that you need to prompt them in a weird way to get them to do that thing, but it's very likely that they can do it.

Dan Shipper

That's interesting. So are you saying that you believe this or they believe this—that we're actually, we actually might be closer to continual learning than we think because we can start at the layer of optimizing their own prompts and they're not bad at it?

Alex Duffy

I think we're both closer and further away. I'm not sure. What I'm saying, I guess, is it's a tractable problem. And I'm saying it's a tractable problem, but it requires a different skillset. Because, and I don't know, right? I'm not somebody who's doing the reinforcement learning and doing these model training runs for these biggest models. But it would seem to me that if you're able to get a model to reflect and to think about its learning, and then train on that, you'll get more of that.

Dan Shipper

Yeah.

Alex Duffy

And you should be able to prompt the model and think of tools and think of ways to get it to do that and be opinionated, be prescriptive with it, to get it to do that in a way. And maybe that has some downsides where it's going to get more narrow and do that more frequently, but then maybe you can think about another one and then you can build on top of it. And so that's why it might take longer because work needs to be done. And that's the kind of work that we're looking at doing. But so I don't know. I don't know if it's—you can—there's clear research that shows that AI can help prompt itself to get better. I think DSPy is a really cool example of something like that.

Dan Shipper

Which—DSPy—I've literally never heard anyone say it out loud. I've always pronounced it DSPy in my head. You think it's DSPy?

Alex Duffy

I think DSPy. I've heard DSPy. Oh yeah. One of those made up between the letters D, S, P, and Y.

Dan Shipper

Yeah.

Alex Duffy

It’s a cool example. And I think some of this became very clear to me in Diplomacy where when we started, the models couldn't play the game. Then we made some iterations and then you got large models to play the game. And there's some existing research that showed that large models could play. And there's cool research that self-optimized the prompt so that GPT-4 could barely play. But we put more and more work to it and we built tools to help us iterate quickly. And then we got to the point where DeepSeek, still small, can play. And that it was hard, but you learned a ton. And I just don't think—we're in this weird time where there's an opportunity cost where if you spend a lot of time to solve one problem, it better be worth it.

Because you can solve so many other things. And is it—to make it worth it in the economy is maybe tough. But I think if you make it worth it to yourself, then it's definitely worth it. And then it can have value economically. So that's how we're approaching it.

Dan Shipper

Having it be worth it to yourself, I think, is an under-explored path for entrepreneurship that is very helpful because it often takes a really long time to figure out if it's working or not. And you'd probably rather just—if it's purely an economic calculus, you'll probably give up a little too early.

(00:30:00)

Alex Duffy

I think that's a big reason why we're building this. Tyler and I both worked in startups for a while. He's been running his own consulting company for four years. And I was co-founder of AI Camp in 2021. I've been here, worked at a company called Salt that had three pivots and found product-market fit in drug discovery. But this is the first time where we're making a company from scratch. And the reason why is because we both love and think that there's a lot of value in the intersection of AI and gaming.

And it's—and not only because I truly believe that our environments are going to make models better, but also it will make people care and also less fearful. One of the things that we see in consulting is there's this knowledge gap growing. Simon Willison's written about it and so has Andrew Ng. Where people who are using these tools are the least fearful about them because they get it. They see where it falters, they see how they can use it to get better. But as people don't adopt them, whether because they're busy or they've had bad experiences with a really initial early version, or for any number of reasons, many of which are justified, then you can get fearful and angry. And so you get this gap that starts to occur.

But with games, it was so cool when we launched Diplomacy. We had almost 50,000 unique viewers hop on for a week and watch what admittedly was not a super entertaining interface. You could see them chatting and it was just panning back and forth. And I think we had a good soundtrack on it. But many of them were not AI people. They were people who came from the gaming side. And it became less scary. You could see it make mistakes. You could see it take a different strategy that you know isn't the optimal one. Or every once in a while you see it do something good. And so it becomes much more relatable. And I think games are very powerful in that way. And so that's another reason why I think that what we're doing is important.

Dan Shipper

How'd you get into games? Why do you care about it?

Alex Duffy

I think I've always learned a little bit differently than other people, and games have been one of the ways I think I've learned the most. When I was really young, one of my friends taught me multiplication in kindergarten with the beads on an abacus.

Dan Shipper

Oh really?

Alex Duffy

And so then I was in advanced math. And then at one point in elementary school, in advanced math, they put you in front of the 24 game. Have you ever played the 24 game? You've got four numbers.

Dan Shipper

I never even sniffed advanced math. I don't think that they would've taught me that.

Alex Duffy

Sure. They give you, they have this little card with four numbers around it, and you need to find some way to make those four numbers make 24, then you tap the card.

Dan Shipper

Okay. So it's like Sudoku.

Alex Duffy

Yeah. It could be like 2, 4, 6, and 12. And it's 6 times 2 is 12, and 4 is... yeah, and you figure it out. And thinking through that was totally—well, anyway. Yeah. And then, we mentioned RuneScape and a lot of these other games, and I learned by building mods. And many people who I've talked to, especially in these conversations raising money, but also at Every, at a lot of other places, some of the smartest people that I've met have had similar experiences where they played some game and got something really good out of it, where they were modding a game and then brought them into their journey.

I was on one of the first Minecraft servers ever, and some guy that I didn't know hopped on Skype for four hours to help me build a computer from scratch. Got sick. It just—you have this weird connection and I think that there's a lot of value there. I do think that there are downsides, right? Games are not real life. They can be practice, you can learn. But if you get stuck there forever, that's not good. That's why in my head, games are a good start. That was a big part behind the name.

But I do think that there's also a world where—I think there's a world where you can make a game that brings people back to reality, to a degree. I think Pokémon Go was a really cool experiment. I think if they had more of a fleshed-out game, they could have had something way bigger. And I don't know if you remember that moment in time, but it was crazy. Seeing everybody out at monuments. I was in Boston at the time, seeing massive crowds along the reflection pond where everyone was just around and doing the same thing. And that sense of connection was really special at the time. And so I think that there's just a lot—it's something special about games.

Dan Shipper

You're making me think of—I used to love games, like video games, growing up. And during the pandemic I bought an Xbox because I was like, oh, it would be cool to play Call of Duty and it's social and I'll have something to do. It was just when I first started Every, and I was stuck in my apartment, so I was lonely. And I logged in and immediately just got merc'd by an 11-year-old in Call of Duty. I just never played. But I do, I actually do really like video games and I miss playing them. I spent so much time playing Madden with my best friend growing up.

Alex Duffy

What was your top sports game?

Dan Shipper

I'd say in college there was just this constant cycle of FIFA. FIFA and NHL. And playing that a bunch. And it's funny because a lot of it's social, right? There are single-player games and a lot of people play them. But I do think a big part of it is social. Because even if it's not—even if you're playing alone, the community of other people who play that game is a big part of it. And seeing how you can do something that others haven't yet, or that you tried something new, and that you're comparing notes.

That's why I'm a little bit of—some people think that you're going to have games that are tailor-made for you, or movies that are tailor-made for you, exclusively. I'm as bullish on AI as the next guy. But I think that shared experience is so important. So if it's something that could not be experienced by somebody else, I think that's actually bad.

Dan Shipper

Yeah. Interesting. Yeah. I also, I played so much Halo growing up. Were you a Halo guy?

Alex Duffy

I got hand-me-down PlayStations from my cousin.

Dan Shipper

Okay. So you were not an Xbox guy?

Alex Duffy

I was never. Then you grow—other people have it. And Halo was such an iconic franchise. What was your top—what's your top shooter game, like first-person shooter?

Dan Shipper

Modern Warfare 2.

Alex Duffy

Okay. Yeah, that was good. That was one of those back-in-the-day, one of the eras. Yeah. And actually, in a similar way during the pandemic, I started getting a little bit back into video games. I had played some Fortnite. Great. Then it started getting real sweaty and I can't hang. But yeah, at some point you want to be able to play with friends. And then most recently, though, my headset's broken now, I started playing VR.

And it was not something I had really expected to do a whole lot of. But in the same way, around the social component, I started playing Population One, which is essentially Fortnite in VR. So you're physically ducking. You are physically reloading. You're physically moving. And two of my buddies from college were playing and you play on teams of three. So we had the perfect number of people to do it. And it became something where you come on, there's a headset built in, there's a microphone built into your headset, so you're immediately talking, you're talking to each other.

Dan Shipper

That's cool.

Alex Duffy

And it's one of the most fun gaming experiences I've ever had. You're physically in a game of Fortnite. You can only play for an hour and a half, and if you don't play for a little bit, then you start to get vertigo when you get back. It's almost like you have to get used to it. But it's pretty incredible. I'm still—jury's out on if I think VR is going to be a huge platform in and of itself, because I don't know how many people want to be fully disconnected from the real world, but it was a whole lot of fun.

Dan Shipper

Yeah. I miss gaming. I miss getting home after school and logging on to matchmaking and Halo or whatever. Or Counterstrike, or all those games. I never—I also, yeah, I never, I tried VR a little bit, but I never really got into it. I think probably because I have glasses, it's just harder.

Alex Duffy

I've heard that a lot. Yeah. I do think glasses will become—I'm wearing the Meta Ray-Bans right now.

Dan Shipper

Yeah. I use them as my AirPods. I see you all the time walking out of the office and you're talking to yourself. And I'm like, what is—why are you talking to yourself? And it's—you're on the phone on your Ray-Bans. And I'm like, what the fuck?

Alex Duffy

It's really cool. I'm a big fan and I think that more and more people will use glasses as a form factor for computing. I don't think that they're going to replace computers or cell phones. I think that there's room for both. For all three. They're very different. I personally like that they don't have a screen. I imagine that's not long for the world. I imagine they'll start—

Dan Shipper

Yeah. I thought that they were, I thought that the new one is like projecting—

Alex Duffy

They're not doing that. I imagine they'll get there. I like that it doesn't have a screen. I like that it's just—I can talk to it. I think that it would be a pretty bad experience if I started talking to someone and then they're like, "Oh, sorry. What?" because they were looking at something on a glass, on glasses. Though I expect the incentives to push it that way.

But I do think right now it's a more human piece of technology, and I think a lot about people taking pictures of their kids and their kids are imprinted on this box that's between you and them. And they're looking at it and they see you getting joy out of it, so they're imprinting there. vs. you turn this on, your hands are free, you're good. You're in the moment at a concert. You're not like this. You're just in there. I think those are—I love the concept of technology that makes us more human.

Dan Shipper

Yeah. What are you—you're the guy that, before you got really busy fundraising, all that kind of stuff, in addition to doing consulting, you were writing amazing stuff on Every, and you were the guy that if I wanted to know what was interesting or cool, you would have a really good read on what got released and whether it's bullshit or not. What are you excited about right now? I know you've been busy fundraising, so you may not have your finger on the pulse as much as normal, but yeah, I'm curious if there's anything that's exciting you, that's on your mind. All I can think about recently is games. But anything else in games that are not specifically that you're working on, but just generally is going on that you think is exciting?

(00:40:00)

Alex Duffy

GTA 6 got delayed and is coming out next year. And so it's the most expensive game that's ever been made. I think the last one came out when I was in high school or something.

Dan Shipper

Yeah. It's been over a decade.

Alex Duffy

Yeah. And these are like—this is a billion-dollar game. That's crazy. Just the cultural moment of that I think is going to be interesting. And then on the AI side, I think that maybe it's less about what's happening right now, though I would say a lot of the stuff Google's doing is really cool.

Dan Shipper

Yeah. Gemini 2.5, you're a big Google stan.

Alex Duffy

Yeah. I am a big Google stan. Demis, if you ever want to cut an angel check. But the connection of a lot of what they're doing is so interesting. And the constraints that they have, I think, are so cool because it's almost existential for their business on search. And so they want to be able to use it there, which means it has to be fast, it has to be reliable, it has to be able to go check other sources, it has to be good. So having constraints when you could do anything I think is actually helpful.

But then they also have Genie, which renders anything quickly that's experienceable. I don't know how that will be interacted with gaming. Maybe it makes up some of the most renderable—expensive—I have thought—

Dan Shipper

I actually have a thought about that, which I think you'd be into. So I had this guy—he's the CEO of Decart. Which we talked to a while ago. And they have this really cool video-to-video model where it takes any frame of video and then turns it into something that looks like a video game. And they have this thing where, for example, if you pick up a tissue box and you go like this, it turns it into a gun and shoots it. And I think that's an interesting future for gaming.

Alex Duffy

I agree.

Dan Shipper

Because right now to make GTA, you have to hand-code all of the interactions. And that's why it takes a billion dollars and many programmers to do it. And with video-to-video generative models, one, you could just generate it from live video, but two, you can vibe-code a really simple game and then you can re-skin it with generative AI to look like a AAA game. And I think it lowers the barrier to making awesome games to almost anyone now, which is really cool.

Alex Duffy

I think that not only does it do that, but it also lets you do things that were otherwise super computationally intensive. Like ripples on water or a reflection of light without having to run it at all. So I could definitely see that. That's cool. And that's related—I think generally, they're doing that, but they're also doing a lot of different things that are cool in AI. And I think one of them, life sciences, I think it's a really underappreciated world and it's one that I was fortunate to be deeply involved in, getting to work with the Ellison Medical Institute and others in my last startup.

Like I mentioned earlier, but AlphaFold literally took something that we had PhDs taking six years to do and turned it into something that takes 20 minutes. And as far as where I think AI is having near-term impacts, I think it's software, life sciences, and education. Those are three—today, having massive impacts. And I love robotics. I used to work at Amazon Robotics. I'd love for that to get there. Self-driving seems to be at a precipice. I'm a huge Waymo fan. Take them all the time. But those three are right now seeing huge impacts because they're perfectly—their problem's perfectly suited for AI.

Software—we have compilers. We wrote the code. We know what would render or not. It's a solvable problem. It's great for language models. It's great to just do reinforcement learning on just code and then also on Diplomacy as code. So when there's new things, it can generalize. Life sciences—there's a ton of information out there. We just need people with subject matter expertise to combine them and look at these different interactions and to find ways to simulate these processes. There's a real chance that in the near term people find a way to turn dollars into longevity.

And then on the third side, education. You talked about being excited about your nephew learning about these, starting to enter that world of learning. And I think it's going to be really interesting to see. I don't know for sure. I got—I love that I got to interact with so many high school and college students at AI Camp when they were going through that learning journey. I think it's tough to be a high schooler right now in the world. And the education system hasn't caught up yet. And there's this huge incentive to use AI to do your work, but then do you really learn about it? So then what do you care about? It's tough.

The generation afterwards, the generation who's going into their "why" phase that can have AI to answer those questions—they're going to be so smart. And what that intelligence looks like, I don't know. But to be able to constantly be exploring and to get answers—and maybe there's probably negative externalities from that. Yeah, definitely. But there's also probably a lot of positives.

Dan Shipper

It's really good, I think. I just remember being in fourth or fifth grade and being like, I want to write a novel. And people were like, what are you talking about? And why has it been so hard? It seems easy.

Alex Duffy

I don't know if you had that experience, but I was like, oh, you just write it and it's good.

Dan Shipper

What—so having AI to answer all my questions and stuff, I think would've been just fantastic. Okay. So that's the stuff you're excited about. What's the most overrated thing or what pisses you off in AI right now?

Alex Duffy

I don't think a lot pisses me off because it is pretty important. So people talking about it I think is probably good. There's definitely people shilling things that aren't really going to solve your problems. I think the thing that maybe I worry about the most is in the same vein of education. We now have this leverage that makes somebody who's an expert in something 10 or 100 times more powerful. But it also does the work of someone who was a junior in that field. So how do you bridge that gap? How do you financially incentivize somebody to learn and make mistakes and get better, knowing that this tool is here to keep pursuing there?

The maybe overblown, but it seems like the job numbers of people graduating college are getting crushed right now. And I imagine a part of that is because you don't need people to do that blocking and tackling that you needed before. And paying them to do so is a big cost. And so what does that look like?

Dan Shipper

I am so not worried about that, which is interesting. And I think actually you're one of the people that made me not worry about it. I use this anecdote a lot and I don't think I've ever used it to your face. I'm curious to tell you how you've impacted how I think about this.

When you joined Every and you said you wanted to write and you wrote your first piece, it wasn't good. And it was not good to the degree that we could not have worked with you without AI. And what was really interesting is—and I've worked with a lot of young writers and so I can tell pretty quick—your rate of progress, like every time we talked, you recorded it, you made prompts, and you never made the same mistake twice. And so your rate of progress was—within three or four months, you had made a year or two years' worth of progress. And that just kept happening.

And so let's assume that the job numbers are down for young people because, actually because of AI, because people are not hiring them—that is a gigantic management mistake that companies will begin to correct as soon as they realize that a 23-year-old with ChatGPT is fucking cracked. And if you give them any amount of mentorship, they're going to do amazing stuff that they never could have done before.

And I think there's the question about they're not actually learning the underlying skill if they just have the AI do it. They are, because they have to. Because if the AI messes up and they care about it messing up—and they should, because that's the way to do a good job—they're going to go in and learn this stuff, and they have a great tutor to help them figure it out.

So I feel extremely excited for young people, and I think to the extent that managers are not hiring them, that's on them. And they will figure that out pretty soon because they'll be like, oh my God, I hired this 23-year-old and it totally changed my whole business. My dad is like this. He's downstairs right now. He owns a few cemeteries in Indiana, and he has this 23-year-old who just completely changed his entire business. And so I think it's going to flip from there. Maybe it's there right now, but I think it'll flip from there to mid-career folks pretty soon. Because I think the kids are going to be all right.

Alex Duffy

Yeah. Yeah. The thing I agree with is that I think that probably the solution to this is some form of an apprenticeship. Where you're able to quickly learn about something and you're doing something that you care about, therefore you will spend the time on it. And therefore you will care about what it is that makes it good or not. If you don't care about it, that's going to be hard.

The counterpoint is I don't think that without my experience on the training and the AI and that side of things that I brought into the Every side, I would've had the chance to do it. So what is that skill that they're being brought in to do, besides the raw material, besides the ambition, the ability, and just doing that when you're comparing them on the market with people who do have that? And maybe it becomes the people who are on the market don't have that hunger. It's literally just—

Dan Shipper

I'm hungry and I'm willing to try new things instead of the things that are currently being done. And I think that's the—

Alex Duffy

But if all things equal, if you have somebody who's hungry and doesn't have experience versus someone who's hungry and does—

Dan Shipper

I would take the one that doesn't have experience because the person who has experience, their experience is wrong. Because the whole landscape just changed and it's really hard to get someone who's already in their career and knows how they do things. You know this because this is what we do—we take people who are mid-career and we train them how to do something else. And it works, but it's hard. What's easier is someone who's hungry and hasn't done it before and doesn't have a whole set of things they have to unlearn and is just going to figure it out.

Alex Duffy

Maybe.

(00:50:00)

Dan Shipper

Yeah. We'll see. I hope so. We'll see. I hope so. But that's something I spend a lot of time thinking about. Yeah. I feel you. I think it's very important that it's not all going to be rosy and there are—anytime there is technology shifts, there are downsides and trade-offs.

Alex Duffy

Yeah. No, and yeah, I think another example of that, right, is big companies who are going to be able to do more with less. And you may see them—and are seeing some in some industries already—cut headcount. But to your point, if you cut too much headcount and then you realize, oh man, we could have just done way more with those people, then that's a mistake. And then you start seeing groups of those people be able to do way more on their own and now compete on these niches and then take away parts and create many more startups that have ever existed. So I'm excited. It's going to be a little rocky, but I'm excited.

Dan Shipper

Reality is typically rocky, so indeed.

Alex Duffy

Much rockier than games.

Dan Shipper

All right. This is awesome. So if people want to find you, want to participate in your tournament, want to just generally follow along with what you're doing, where can they find you?

Alex Duffy

GoodStartLabs.com. @GoodStartLabs on Twitter. I'm at @AlxAI on Twitter. And you can read my writing on Every.

Dan Shipper

Amazing, Alex. Thank you.

Alex Duffy

Thanks, Dan.

Thanks to Scott Nover for editorial support.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn, and Every on X at @every and on LinkedIn.

We build AI tools for readers you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Get paid for sharing Every with your friends. Join our referral program.