Transcript: 'How to Win With Prompt Engineering'

The transcript of AI & I with Jared Zoneraich is below.

Timestamps

Introduction: 00:01:08
Jared’s hot AGI take: 00:09:54
An inside look at how PromptLayer works: 00:11:49
How AI startups can build defensibility by working with domain experts: 00:15:44
Everything Jared has learned about prompt engineering: 00:25:39
Best practices for evals: 00:29:46
Jared’s take on o-1: 00:32:42
How AI is enabling custom software just for you: 00:39:07
The gnarliest prompt Jared has ever run into: 00:42:02
Who the next generation of non-technical prompt engineers are: 00:46:39

Transcript

Dan Shipper (00:01:08)

Jared, welcome to the show.

Jared Zoneraich (00:01:09)

Thanks for having me.

Dan Shipper (00:01:10)

So for people who don't know, you are the cofounder and CEO of PromptLayer. And how do you describe PromptLayer these days?

Jared Zoneraich (00:01:16)

Yeah, we are a prompt engineering platform. So, we came out about a year ago—a year-and-a-half ago, actually. Whenever ChatGPT came out—not too long ago.

Dan Shipper (00:01:24)

I remember that.

Jared Zoneraich (00:01:26)

It was an internal tool, originally. We just put it on Twitter and people liked it. And maybe we'll do this.

Dan Shipper (00:01:33)

Kind of early Slack vibes a little bit.

Jared Zoneraich (00:01:36)

Yeah, exactly.

Dan Shipper (00:01:39)

So I guess you've been around for a year-and-a-half. So the prompt engineer is not dead yet? Have reports of the death of the prompt engineer been exaggerated? Or where are we in the timeline?

Jared Zoneraich (00:01:48)

I think so. I think so. This is a common question we hear a lot: Is prompt engineering going to be automated. What's going to happen here? I'll tell you my take on it. So, we're focused on building the prompt engineering workflow. How do you iterate on these things? How do you make the source code of the LM application better? There's three primitives we see here: the prompt, the eval, and the dataset. You can automate the prompt, but you still have to build the eval and dataset. You can automate the dataset. You need the eval and prompt. You could take out one of these elements from the triangle, but at the end of the day, our core thesis, our core theory is that prompt engineering is just about putting domain knowledge into your LLM system and whether you have to say please and thank you to the AI—that'll probably go away. But you still need to iterate on this core source code, we could call it.

Dan Shipper (00:02:44)

Okay. So probably you're not going to have to be like, hey, like I'll tip you $5,000 in a couple of years, that thing is going to go away. But the sort of core primitives that you're talking about—so the prompt, the eval— Can you make that concrete for me? What do you mean by those things? Give me a specific example.

Jared Zoneraich (00:03:46)

Yeah, let’s talk about it. So let's say I'm building an AI math tutor. Pre-AI, there's a lot of math tutors you can go to—some are good, some are bad. There's almost infinite solutions to this problem. So there's never going to be a one prompt that rules it all where you say, build an AI math tutor and it makes you the only solution. You can imagine 30 companies competing for this. So, when I was talking about those primitives, when you're building this, you have the prompt or multiple prompts on how you respond to each question the student asks. You have the ways to test that and the sample data you're testing that on. But all those are a different kind of a more technical way to say you have the actual knowledge the math tutor has.

Dan Shipper (00:03:57)

That's interesting. You're making me think about like— I love that there's no one prompt to roll them all. There's no one dataset to roll them all. And even in a particular domain. And the thing that made me think of is— Do you know the philosopher Leibniz? I'm sorry to take it to Leibniz.

Jared Zoneraich (00:04:19)

No.

Dan Shipper (00:04:20)

Okay. So he's around Newton's time. He also invented calculus—they fought over that. But one of the things that Leibniz was really into is the idea of creating a language where you could only say true things. So the syntax of the language made it so that like only you can only say factual true things. And so if people had a difference of opinion, they would just say, write it in the language and then calculate, and get the answer. Just like if you have a question about how far is the cannonball going to travel, you can use calculus to figure that out—same thing for statements of fact or statements about the world with this hypothetical language. And obviously he never figured it out.

And the reason is because when you try to produce something that's so all encompassing or totalizing of truth, it gets really brittle, it gets really hard. A lot of early AI attempts were sort of like that, and I think there's a thinking error in a lot of what people think of when they think of, okay, what's the future of AI? Oh, I'm just going to tell it to be a math tutor. It's going to be the best possible math tutor. It's the same kind of thing. What that means is incredibly different and incredible in many different contexts. And so there's a lot of room for many different attempts at finding the truth or representing the truth of what is best in any given scenario.

And I think you're kind of, you're getting at that with your answer of, yeah, we're going to have to do prompt engineering because there's going to be 30 different ways to make a good math tutor and who knows which one's going to be the right one for which person.

Jared Zoneraich (00:06:06)

Yeah, no, I mean, forget about the prompt. In this world of, we have the best AGI possible, it’s— I like that reference a lot because— How do you even define the problem to solve? That's the hard part. Let's assume you had millions of dollars. You just got a big VC round to start a new company. What would you build knowing AGI is going to come? The hard part is the problem. You would start working on what problem you are going to solve, because even with the best tool, you need to define the exact scope of the problem. And that's kind of the irreducible part.

Dan Shipper (00:06:40)

That's really interesting. And I think we're so used to building things being expensive that finding the problem is this invisible thing that everyone has. I don't know if this is your first company. I assume it's not. Every is not my first company. You spend freaking years trying to figure out what to solve. And the building is expensive and hard, but finding the right problem is quite a bit harder. And I guess you could sort of like posit, okay the AGI is going to figure out the problem for you to solve. But, I don't know. That requires a lot of real-world interaction and creativity and a viewpoint. I don't know. That feels quite a bit more complicated than building software.

Jared Zoneraich (00:07:25)

Well, this is my first company, actually.

Dan Shipper (00:07:30)

Oh, it is? Oh, congratulations. You're doing great.

Jared Zoneraich (00:07:32)

We had a slight product before this that probably was like our pivot. But yeah, I like this simple example I like to think about. If I wanted to build an AI secretary that booked flights and I had a conference in Japan, there's a lot of different possible solutions. Do I want an aisle seat? Do I want a window seat? Do I prefer a nonstop over a business class with a stop? And these are the differences that mean there's no one solution. I love the word irreducible here. And I stole it from— Stephen Wolfram wrote a few articles on ChatGPT.

Dan Shipper (00:08:09)

Like, computational irreducibility?

Jared Zoneraich (00:08:11)

Yeah. Yeah. I love that. I love that theory. And I think the irreducible part—

Dan Shipper (00:08:14)

Can you define that for people who are listening and haven't heard that before?

Jared Zoneraich (00:08:19)

Totally. I'm not going to do as good of a job at defining it, so I encourage everyone to look it up and read the original.

Dan Shipper (00:08:25)

Use ChatGPT to look it up, if you're listening to this or watching this.

Jared Zoneraich (00:08:29)

I would define computational irreducibility as: When you're solving a problem, you can collapse a lot of parts and speed up a lot of parts of solving the problem, but there's one part that you'll never be able to collapse. I like to think of when in school in math class, you're— What was it? Factoring, right? When you're kind of taking things out into the parentheses. But there's always that irreducible part that you can't factorize or you can't simplify. And in this case, we’re saying if you have an amazing AGI that can solve any problem, the hard part is: What do you even tell it to solve?

Dan Shipper (00:09:11)

Yeah, I think another way to think about computational irreducibility is if you imagine the world to be like a big computer—or the universe to be a big computer—there are certain operations that you actually have to run the computer to find out what happens. You have to run out the reality of the system in order to know. You don't have calculus to be like, okay, I know where it's going to be in 10 steps because the universe doesn't know yet. And I think that's super cool.

Jared Zoneraich (00:09:35)

Yeah. Halting problem-esque.

Dan Shipper (00:09:37)

Yeah. Okay. So you kind of have this pluralistic framework of the future of AI. How did you come to that?

Jared Zoneraich (00:09:51)

Well, it depends, I guess. What are you referring to as pluralistic—multiple models or multiple stakeholders?

Dan Shipper (00:09:52)

Multiple approaches to any given problem.

Jared Zoneraich (00:09:54)

Yeah, I think it's just observation. Honestly, I'm a hacker. That's my background. I'm not necessarily a researcher and not necessarily— To be honest, and this might be a controversial thing, AGI kind of doesn't interest me—whether AI is going to take over.

Dan Shipper (00:10:15)

I love that. You're the first person to say that on this show. Hot take alert.

Jared Zoneraich (00:10:20)

It is. And I get into it. We have a lot of debates on it. When we interview someone, we usually take them to lunch and ask them what they think about AGI. Just curiously, just to get their takes.

Dan Shipper (00:10:28)

What's the bad response?

Jared Zoneraich (00:10:30)

There is no bad response.

Dan Shipper (00:10:31)

There is no bad response. So why do you ask?

Jared Zoneraich (00:10:35)

I’m curious what they think. We have a lot of different ideas. But what I'm most interested in is not so much: Is AI going to take over the world? Is it going to kill us all? I'm interested in: How do you build with it? I think it's just a really cool technology. And, as a hacker, I am like, my mental model of AI is not the reasoning part as much as language-to-data and data-to-language. And how much does that open for us? So that's how I get here, it's just a tool in the toolbox. We have a lot of tools in the toolbox and what can you solve? In the same way that my non-tech friends will ask me, why does Facebook have engineers? The site works for me, right? Why do we have to hire more engineers? I think it's the same thing. I look at AI the same way. You're going to always be iterating. There's so many degrees of flexibility in these things.

Dan Shipper (00:11:29)

You said earlier before we, before we started recording, that one of the things you're kind of excited about or thinking about a lot is the non-technical prompt engineer. Can you just open it up for me? What does that mean? And how did you start knowing that was a thing you wanted to pay attention to?

Jared Zoneraich (00:11:49)

Yeah, so when we launched the PromptLayer, as I was saying, it was kind of an internal tool we built for ourselves and people started liking it. So, we didn't have this idea when we started, but I'll tell you, there was kind of a light-bulb moment I look back on. It's this team Parent Lab. They're building a very cool app. It's basically an AI parenting coach to help you be a better parent. We got on a call with them for feedback on our product about a year ago and asked them, how do you like PromptLayer? What can you improve? But the most interesting thing is that someone on the call, one of their prompt engineers, was a teacher—15 years as a teacher, not technical at all. And we're like, what are you doing? Why are you using PromptLayer? What's going on? And she had explained to us that the engineers had set it up for her. She would go onto PromptLayer, edit the prompts, and then pull up the app on her phone and it will pull down the latest version of the prompt. And from there, kind of the gears started turning. And where we're at now is the core thesis of what we're building—PromptLayer itself—is that you're not going to win in the age of GenAI hiring the best engineers. You're not going to build defensibility through machine learning for most companies. You're going to win by working with the domain experts who can write down that problem as we were talking about earlier, who could define the specs of what you're solving. And in cases of building AI teachers, AI therapists, AI doctors, lawyers— I'm an engineer. I don't know how a therapist should respond to someone who's depressed, right? I'm not the right person to be in the driver's seat of building that. Does that make sense?

Dan Shipper (00:13:27)

Yeah, I think it makes sense. Can you give me a concrete example of what this teacher was prompting? I don't really understand what she's going in to change, and then what changes she's able to see in the app.

Jared Zoneraich (00:13:36)

Yeah, so when I say she's prompting, she's editing the prompt that powers the app. So maybe they're opening the app and a user could be saying, how do I discipline my child? I need to respond very specifically to that. And maybe it was saying I'm a language model trained by opening. I don't support any discipline and then she needs to go into it and say, okay, in that use case, let me make sure that I respond— I don't have children, so maybe I'm not the best one to say how it should respond, but she makes it respond in the right way.

Dan Shipper (00:14:14)

Smack them.

Jared Zoneraich (00:14:15)

Yeah. Hit them with a belt.

And you don't want to ruin all the other cases, though, where they're asking about what I should feed them. You shouldn’t also hit them with the belt. So it's kind of it's kind of that systematic process. And it's just an iteration. We really believe prompt engineering is how do you close the feedback loop? How do you iterate as quickly as possible?

Dan Shipper (00:14:38)

Yeah. okay. Let me just play devil's advocate for a second. So, I think I agree with you. I really like this and it resonates a lot with some of the things that we've been building internally at Every, which I'll tell you about in a second. But I think an AGI-pilled response would be like, sure, you can have the teacher do that or you can just put the AI in the loop and have it measure parents who say thumbs up or thumbs down on a response and then it can just try new things until it learns what responses to give. And so you don't really even need human prompt engineers anymore. What do you say to that?

Jared Zoneraich (00:15:18)

Yeah. Let's switch gears to AI therapy because I think that's an even better example. Do you want the therapist that just gives the responses that you want to hear and people are giving thumbs up to? Maybe that's not the best metric. Maybe there's a better metric that you can find. Maybe the metric is how long people— Maybe there's an exit survey or something like that.

Dan Shipper (00:15:41)

Yeah. It's like long-term well being or something like that.

Jared Zoneraich (00:15:44)

Yeah. I think the data-driven approaches need to be used as well. But at the end of the day, I think you're going to reach a local minimum. When there's 20 other teams working on this exact problem and you're facing off in the market against them, the differentiation is everyone's going to do this data-driven approach. The differentiation is what— Going back again, what problem can you write about? What is that domain expertise you can bake into the application? And as you have more and more users using it, you have them using it in different ways. What edge cases can you find? What is the trade-off you're willing to take between latency and quality? And there's no one answer to that. Every company has to decide that themselves. And that's the job of the prompt engineer in my opinion.

Dan Shipper (00:16:31)

Yeah. I think people forget with the data-driven stuff that having a perspective is really important. And the way that you're collecting data and the way that you're doing the loop of response to changing the prompt or whatever, that's not neutral. That embodies a perspective on the world, even if it's implicit or you haven't thought about it too much. And having a different loop that embodies a different perspective will get much different results and will create behaviors and people and all that are just different. And what's really interesting is if you're a human with domain expertise, you've developed a perspective over many, many thousands and thousands of data gathering loops that maybe you could get with an AI, but it would take a while.

We're back to the kind of computational irreducibility thing. You can't just be like a superintelligent sitting on a server and theorizing about what might make people react well in a therapy situation that you probably have to try it. There's maybe some weirdness there where like if you have enough data, you can simulate different things and do self-play or whatever. But I don't know, we can just put that to the side for a second. It's much harder to do that. It's much harder and it's much easier and better probably to take people who are smart and have a lot of domain expertise.

Jared Zoneraich (00:18:06)

Also, who sets up the loop? Like you were saying, even the supercomputer setting up the loop—

Dan Shipper (00:18:11)

It's super intelligence all the way down, Jared.

Jared Zoneraich (00:18:15)

Alright, so the super intelligence is all the way down, like you said, it's still taking an opinion.I think also to go back, Stephen Wolfram, I also talked about this in that article if you have computational irreducibility and then there's not going to be one model that rules them all because there's multiple answers, it's irreducible how to find it. And because of that there is no one way to gather data and come to the conclusion, and by setting up these data-driven approaches, Instagram is different from TikTok. They're both data-driven, but the users are different and how they make decisions are different.

Dan Shipper (00:19:04)

Totally. I think that's totally right. And so the thing that you're making me think of is this product that we have. And I think this product, it's a pattern that we're replicating across a lot of different products. So I think actually it is a general thing in AI and it relates to what you're doing, but it's a sort of slightly different take on the same idea, which is— We have this product called Spiral and it helps you automate a lot of repetitive creative work. So let's say like for this podcast, I'm constantly taking the podcast transcript and turning it into a tweet. And Claude with a few-shot prompt, is really good, if I have historical examples of podcast sessions and tweets, getting me to 80 percent on my tweets. So I just like to throw it into Claude and whatever, but the prompt is kind of hard to construct and it's just kind of messy to do it in Claude, so we built Spiral. And in Spiral you can just create a Spiral that converts podcast transcripts to tweets. And then you give it examples, it writes a little guidebook for itself and then you have a little form where you can just paste in your podcast transcripts and it makes tweets in your voice and with your examples or whatever.

And there's no one podcast transcript-to-tweet converter. You can make many different Spirals with many different tones of voices or styles and you can share it with your team. You can make them public, whatever. So we launched that three or four months ago. And I think it has that same flavor of getting off not having one single answer to a problem and sort of there's some differentiation and having this diversity or pluralistic approach to what prompt is best for a given situation. But one interesting difference that I'd be curious to pull apart with you is in, I think in the PromptLayer example, you've got a teacher who is the expert for everyone else who's using the app. She's the expert for the parents more or less. And in a Spiral example, each user is their own expert to some degree because they're all either constructing their own prompts or taking a Spiral that exists and cloning it and maybe changing it or so it seems like there's two places to prompt engineer. One is to turn your users into prompt engineers without them really knowing it. And the other is to have domain experts like that have crafted something that users are using.

Jared Zoneraich (00:21:25)

Or both.

Dan Shipper (00:21:26)

Or both. I mean, for sure. Yeah.

Jared Zoneraich (00:21:30)

I mean, there's probably a skeleton prompt that you guys are using, right? I guess my take on this is that at the core of what Spirals is doing, what PromptLayer is doing, what these end user applications are doing, is taking a knowledge expert and distributing their knowledge—whether distributing it across a workflow and across data, and distributing it by writing tweets from a podcast transcript, or whether it's taking someone's knowledge of parenting and distributing it to the user base of their app. So I actually see them as the same problem and it's less about how many degrees of flexibility do I want my user to have vs. what workflows can I provide them? What workflows can I kind of scale on my own? Can I hire a team of salespeople and build a sales AI application that does what they're doing? Same thing, I think.

Dan Shipper (00:22:30)

What have you learned about the most effective ways to make good prompts? So you talked about the primitives of prompting data evals. When you think about the people who are most effective at making good prompts and improving them over time without backsliding, especially dealing with new models and all this kind of stuff, what are the characteristics of good prompting?

Jared Zoneraich (00:22:57)

Yeah, I have a really annoying answer here, which is that it's just the scientific method, just trying and fixing it. I think a lot of people are always looking. I'll give another AGI-style hot take which is, I really don't like these research papers about prompting. I think people love sharing them. Oh, this research paper came out of this new prompting strategy. Prompting doesn't feel like that sort of science. It's more of, I’m trying a bunch of stuff. And as long as I have a good dataset and a good framework to iterate and see if it worked, that's what I should be focusing on. How can I edit my prompts in an operational way, as opposed to how do I discover the correct prompt for me?

Dan Shipper (00:23:40)

And do you have a sense about why it's not that kind of science?

Jared Zoneraich (00:23:43)

That's a good question. Maybe it has to do with it being a language— I mean, I took machine learning in school. I understand how it works behind the scenes, but I actually think the best prompt engineers kind of treat it as a black box and say, the LLM is getting more complicated, more hard to understand, not less. Let's not think about how it works. All I want to think about is, how do I map the inputs to the outputs I want? And if you're not getting the outputs you want, that's a skill issue. That's not something else. And I think there's another degree here, besides just the prompt is, what is the combination of prompts you're using? Now you have exponentially more ways to improve it. Are you breaking down the prompts? We have a lot of opinions we've learned about the right architecture to use to build it with a team. But that's more about how you actually ship a product that works in practice, as opposed to these are the right ways to prompt.

Dan Shipper (00:25:30)

Yeah, I think that makes sense. I mean, if I had to guess, it's sort of the same thing that we've been talking about. In order to say X technique is better, you have to define what better is. Depending on how you define better, you're going to get different results. And also it's very dependent on the dataset and so you can kind of find some hand-wavy things, but it's so contextual that maybe you start with some of the things that are best practices or whatever, but you're probably just going to discover stuff for your particular use case where group-level statistics don't really apply, which— I love that kind of stuff. I'm a huge nerd for that.

Jared Zoneraich (00:26:25)

Yeah, and there’s like, can you speak the language of a certain model family? Obviously, I mean, the popular ones—Claude, it's better XML, GPT, it's better Markdown. And those are kind of model-level things that you can learn. But I look at it as, what idioms do they understand? Can you say that? But I think the best example here is tool calling and function calling. Whereas personally, I love using function calls, even for things that are not functions, because implicitly that's the language that it knows. And by having the return or interpret a tool call, you're conveying it much more information, at least in my mental model, then you would be writing, this is the data I got back, if that makes sense.

Dan Shipper (00:25:39)

Yeah, I think so. But I don't want to let you off the hook. I agree that it is a scientific method, but I just think that, as someone who spends all their time thinking about how to help people do this, you must have some opinions on things like, what do you see that works? And what do you think about it? And all that kind of stuff that goes deeper than then try and see what happens.

Jared Zoneraich (00:26:45)

Sure, yeah. The first thing— I always start with, try and see what happens. Then make your iteration loop as quick as possible. So you could try as much. Then we can go to the kind of best practices that we recommend. The biggest one I think is this, we call it prompt router approach, but there's a lot of different words and it's kind of overloaded. So I think the naive approach to prompting if you just stack messages. So I just have one prompt that does everything and the user sends a message and I stack another one and I stack, stack, stack.

I think the better approach is to kind of build a workflow—a DAG, a graph, and route them to the right prompt, because if you can make your prompts do one thing and do one thing really well, it's going to be easier to test, easier to collaborate, and it's just going to work better. As you try to make the prompt do more and more things, it's gonna be more likely to fail and more likely to actually build unit tests on it. So it's around operationalizing and then kind of structuring your prompts in this discrete way. Although, I'll argue with myself here a little bit that, as models get better, you'll have to do that less probably, but there's always a trade-off because if you're building individual prompts to do one and only one thing, it's going to work much more of the time and have much fewer failure cases, but it's also going to have fewer degrees of flexibility, so you might get new user messages you're not expecting. And then the opposite is also true. If everything's one prompt, you're going to be able to answer any message the user sends, but it might not always be right.

Dan Shipper (00:28:26)

Yeah, that's interesting. That makes a lot of sense. And it all sounds like maybe towards the beginning of the life cycle of a prompt when it's less known what the distribution of questions you're going to get or messages from a user you're going to get, the better it is to start with a single prompt so that you can iterate more quickly and cover the whole variety of cases. And then as you're maturing, you're probably you're probably learning, okay, these are the kinds of things I typically get and then you can build that directed graph of prompts. And then you can start to tune each individual prompt once things are a little bit more set. Does that sound right?

Jared Zoneraich (00:29:09)

Totally. It's also a solution to jailbreaking, right? If each prompt does one thing—say, I'm making a sales bot so users can place orders and refunds and that sort of thing. I can make individual prompts to do a refund or to do a new sale or whatever. And if I route to the right prompt, if the user asks me to refund $1 million dollars and I'm in the sales prompt, there's no risk of breaking out, but then you can't ask for both.

Dan Shipper (00:29:46)

What about best practices for evals?

Jared Zoneraich (00:29:50)

There's a lot of ways you can go here, I think. And I have a very similar answer, which is starting easy. The easiest thing at the 80 percent case for evals is just backtest on your old data. And see how it's changing because the most information-dense thing you can learn from an eval is that it changed everything. That’s what you don't want to do. So what we have a lot of our users do is they just create a backtest based on like, alright, our last 1,000 or last 10,000 prompt responses, let's just run the new prompt using that data. See how much it changes. Maybe you want to do a cosine similarity, or maybe you just want to scroll through and eyeball the diffs. Once you do that, then you can kind of get a little bit more fancy and it depends on the use case, again, if you have a ground truth. So you can plug into thumbs up, thumbs down, you can plug into was the sale made, did the ticket close, that sort of thing. Then you can really throw some A/B testing in there and have it anchor it on real metrics or build an eval that gives you a real score.

Unfortunately, if you're doing something like AI summarizing or summarizing calls or something like that where there isn't a ground truth, then it gets a little more complicated. And what we recommend is either if you're having human graders read it, you could do that, or you have your prompt engineer, you need to sit down and say, what are the heuristics when you're looking at the output yourself and trying to decide if it's real and then trying to mimic each and every heuristic and build a metrics based on that.

Dan Shipper (00:31:32)

Interesting. And then what about datasets?

Jared Zoneraich (00:31:36)

The three primitives, like we said, you can skip one of them. So if you care a lot about evals, you care a lot about prompts, you could just build backtests and those could be your datasets. But if you don't have the backtest data, you're gonna want to focus a lot on building ground truth datasets. And if you can really get that ground truth, then you're sailing. Then prompt engineering, it's kind of easy because it's, do you get 100 percent or do you not? So datasets, I think, there's a lot of room for synthetically generating them as well. I've done that to kind of bootstrap datasets myself.

Dan Shipper (00:32:09)

Yeah, what about situations where you don't have ground truth? That's just something I'm thinking about personally is, how do you eval on that? And it's funny because I think a lot of the AI labs are going hard only on problems that have ground truth. So o1 for example, they're doing reasoning and the reasoning chains, they can validate the reasoning chains. Because it's all math problems. Something about that seems kind of fishy to me. And it feels important to also be able to eval on things without ground truth. Talk about that.

Jared Zoneraich (00:32:42)

Yeah. Well, first of all, My own take is that it's just prompt engineering and that's the cool part of it.

Dan Shipper (00:32:50)

Everything is prompt engineering. You heard it here first.

Jared Zoneraich (00:32:55)

I really think o1 is just a bunch of different— Just feeding it into it a few times. And maybe they do some low-level stuff there to make it a little bit better, but I think the core of that innovation is prompted here. But to answer your question of what if you don't have the ground truth? They don't eval problems without ground truth because you can't really do that in a generic way. It's kind of hard. That's why I don't trust these kinds of eval benchmarks that people come out with. They're a little bit useful. Maybe we'll talk about examples because I think it is very example based. So let's go back to this transcript summaries because I was thinking about that in-depth the other week. What I would do is if I'm building, let's say it's an email summarizer. Every day it sends me a summary of all my emails.

Dan Shipper (00:33:48)

We're building that internally right now. Are you reading my mind?

Jared Zoneraich (00:33:50)

No, but that sounds good.

Dan Shipper (00:33:52)

It'll probably be out in a couple weeks.

Jared Zoneraich (00:33:56)

So let's talk about how you eval it. I'll tell you what I would do and you tell me what you're doing. If I was building that, I would just make something quick and then look at a few summaries and say what do I like, what do I not like, and start to generate individual test cases based on that. So maybe the first test case is—and I'm sure you've experienced this in this case—sometimes the AI will give you a little excerpt at the bottom. This summary is about blah blah blah. That would be my first check. Does it have that excerpt? I don't want it to have that excerpt. Second check: does it use Markdown? Maybe it's too indented. Does it only have one level of index and then these are really simple? The hard part of this is understanding what your brain does when you eyeball a summary to see if it's good or not. And then breaking that down to individual heuristics.

Dan Shipper (00:34:45)

That's really interesting. I mean, so I'm not close enough to the actual engineering happening on this to know exactly the evils. The person building it is one of our entrepreneurs in residence, Kieran, and another component of the product is drafting, so I know we have an eval set for that. And that's basically just like labeled email responses. So it's email responses that we wished we had gotten. And then it checks. There's a couple of different similarity scores between what got sent and what actually got sent and what got drafted. So it checks to see if the prompt is generating an email that is close to the email that I actually generated or someone else in the dataset actually generated for the email summaries. I can tell you the way that we approached it. I literally just sat down, looked at my inbox and wrote in a Google document the summary of my inbox that I wish I'd gotten.

And that's how we started handwriting it. And then from there, you can pull out all these little principles: the summary should not be a passive voice, it should be an active voice. And I don't actually know if there's evals yet, but we'll have that eventually in the next couple of weeks, I'm sure we'll have something. But yeah, it required one person, I guess me, just to write it out. And then from there you can pull out principles.

Jared Zoneraich (00:36:24)

When you build these applications, you really want to skip the hard work, but the hard work just changes from writing the code to figuring out how it'll actually be correct or not. And in that case, it's like you said, you have to create the summaries and read it. And yeah, I'm a little skeptical of that similarity thing you were talking about. I mean, I think it's good as an approximation, but in the email maybe, I think there's a lot of— I mean, I use the Superhuman command-j response thing. I click retry like six times when I'm doing that. I'm not trying to hate on the team. I'm sure they've done a great job. But I click retry a lot and the retries are all very similar to each other, but very different if I were to send it in terms of wording.

Dan Shipper (00:38:55)

That's interesting. It's not one similarity score. It's like five different ones. And we are eyeballing it too and we're using it every day. So that's one of the things that I think is really great, which I think goes to your point is, we're building all this stuff internally. We've launched three apps so far. We have a bunch of other ones that we're kind of in the process of. And what's really cool is we're all building stuff that we want to use. And we're all kind of similar because everyone's a fan of Every and sort of came in from Every. And so, someone makes something and then we're all using this email summarizer. And so, every day there's feedback in the channel being, hey, this thing didn't work or I want this or blah, blah, blah, whatever. So it creates this really rich feedback loop because not only are you using it yourself, but everyone else around you is using it too. There's something unique about it to everyone, I think right now. But I think what's really unique is just the time right now is one where the low hanging fruit hasn't been picked. So you can build something for yourself. I assume PromptLayer to some degree, it was an internal tool, so it came out of something that you just needed. And we're, we're early enough in the kind of AI wave that you can still do that. And it's not 15 other people have tried it over the last 10 years or whatever. And I think that's so special to be in a place where you can just make stuff for yourself. I love it so much.

Jared Zoneraich (00:39:07)

The hacker energy is back. We don’t have to buy a fleet of cars to build a startup on the weekend anymore. But yeah, I actually think so too. It's another phenomenon that's going on, which is that LLMs unlock a new type of software you can build where— For example, one of the internal tools I have that I use all the time is a very simple natural language to SQL specifically for our schema. So I have a lot of information. I'm like, this is a request log—whatever. Maybe I would pay for a tool like that if it was really good, but this also probably might just exist in that subset of software—people call it the single-use software that you're not really going to sell to other people, but it's easy enough to make. It's one prompt—pretty simple. I mean, I could really spend months and make it really good, but for my use case, that's as good as I need it.

Dan Shipper (00:39:58)

Yeah, I love that. I think there's a single-use thing and I think that that extends even further away from software that people realize, there's a lot of consternation right now about AI replacing writing or art or podcasting or whatever. And it certainly is lowering the cost of telling stories, but my experience thus far is what that actually does is rather than crowd out people who are professional storytellers, it allows you to tell stories about things that used to cost too much to tell stories about. So a good example is just NotebookLM. The thing that's so compelling about that is you can get an NPR-style talk show about something that NPR would never cover. And I think that that's so cool because there are so many stories that are happening all the time in our lives that maybe they don't deserve a Netflix special, but they certainly deserve better storytelling than we have. And I think that's sort of the promise of these things. And I think that's kind of beautiful.

Jared Zoneraich (00:41:08)

Yeah. On the software side, my friend made a website that helps you just make a button with sliders for the button radius and the color—and she just made that for herself. Nobody's ever gonna spend the time to set up the server and pay for the hosting for that. But if you can make it with one Claude command, why not? But I don't know. I also like not to get too in the clouds here a little bit, but I look at it like a junk food analogy where AI music, for example, I think we'll have a lot of AI music. And a lot will be junk food, meaning a lot of people will consume it and love it, and it will be great. And but there'll still be the organic farm-to-table musicians where a human makes it and it's just going to be different. It's going to solve different things.

Dan Shipper (00:42:02)

Yeah. I definitely agree with that. What's the gnarliest prompt you've run into?

Jared Zoneraich (00:42:05)

I'm going to be honest. It's the ChatGPT or Anthropic system prompts. Those are bad.

Dan Shipper (00:42:08)

Wow. Hot takes.

Jared Zoneraich (00:42:10)

Yeah. I mean, I like this because this is connected to the other stuff we were talking about. If you're building ChatGPT or Claude—and I encourage everyone to check out people who leak their system prompts, you should be able to find it with a quick Google search. What are they evaluating on? They don't have a specific use case they're evaling or tying their prompt into, so you see these prompts are long, and just run-on. I call it prompt debt. And they're accumulating all this debt. Oh, do this if they say this, don't say this. The classic Google, make sure every historical figure is diverse—all that stuff. And you keep adding it to your prompt. And I don't think that's a good prompt. I think that's not clear and concise with the problem you're solving. You're just taking on more stuff. And that's overfitting in my opinion.

Dan Shipper (00:43:20)

Well, how would you solve that? Because they really need to build a general-purpose tool that hundreds of millions of people can use. So how would you do it differently?

Jared Zoneraich (00:43:27)

I don't know. But I think the fact that they're building a general-purpose tool shows you why your prompt shouldn't look like that because you should have a better way than them to evaluate if your prompt works and then be able to trim it down. I don't think they should do this because this is not their business model. But if I started a company to build a general-purpose tool, I would probably have different prompts for different types of things and try to route it. But then you have the latency concern. I spoke with someone from the Snap AI team, and I think there's— You don't hear about it a lot, but it's one of the most popular AI chatbots—the Snapchat AI. And yeah, because all the kids who don't know about ChatGPT or aren't old enough to sign up, I think, use it. And they have tons of models. From last I spoke to someone there, maybe it's different now, but they're constantly trying to understand: Do I send it to this model? Do I send it to this prompt? And they have a lot of thought that went into making that good, I think.

Dan Shipper (00:44:33)

So do you think that's sort of the future of this? Do you think these kinds of even these kinds of general-purpose ones are going to do explicit routing or is, because there's another way to do this, it's sort of implicit or internal in the model. The models are mixtures of experts where it is sort of routing your request to a certain part of the model, depending on what it is. There are trade-offs between the explicit and implicit versions of this kind of thing. Do you think the explicit one is part of the future?

Jared Zoneraich (00:45:06)

I think to the end user, probably. I mean, that's a UX problem, but look at ChatGPT’s evolution. You had to select which tools you wanted and which plugins you wanted. And that was just a horrible experience. And I think they quickly moved to this world where ChatGPT will choose whatever tool they want you to use and will let you use the Code Interpreter. You don't have to use the Code Interpreter-style ChatGPT. From the technical perspective of someone building these applications—bad answer. But it's very use case dependent. It's very, what are you building? What are your trade-offs? What's your latency? What's your cost trade-offs? Hard to say.

Dan Shipper (00:45:44)

What's on your mind right now? What are you thinking about currently in the evolution of the business? And what's what the next couple months look like?

Jared Zoneraich (00:45:54)

I think we're shifting into the next phase of this company where we spent the last year-and-a-half really trying to understand where we fit in, what PromptLayer is, what layer of the stack is this, and what the value prop is. And I think now we have a really strong thesis of what we're talking about earlier, bringing these domain experts, these non-technical subject matter experts and putting them in the driver seat of prompt engineering. And our product does that. but I think a lot of the next few months is making the happy path a little bit easier, encouraging that and kind of scaling it up from this point, running at that.

Dan Shipper (00:46:39)

How do you make that happen? This is not a role that I think a lot of people are hiring for right now, or the people haven't realized it yet. So if you're kind of pushing the company in that direction and you're sort of skating to where you think the puck is going to be, how do you make sure the puck gets there? Are you doing anything in particular? Is it just about building the right tools and people will figure it out? Or how do you think about it?

Jared Zoneraich (00:47:10)

So we have a lot of people doing this today already. But you're right—people haven't realized this.

Dan Shipper (00:47:15)

Wait, actually before you tell me that, what are the characteristics of people who like to figure out that this is a thing that they're doing? Your early adopter crowd.

Jared Zoneraich (00:47:20)

So there's two things that bring people to the thesis of non-technical prompt engineers. The first thing is they just are kind of not a visionary, but they, they understand it.

Dan Shipper (00:47:38)

They're just smart.

Jared Zoneraich (00:47:40)

They come in with this as a preconceived notion. The second thing, and this is the thing we're kind of locking into more because the first year you get a few of those, but you can't count on that. Second is: Companies that are actually building something revenue-generated, building something that are going to put a team behind, because when you put a team behind something, collaboration becomes a big issue. They're going to, almost always, especially in non-technical domains— If you're building Cursor might be different because your experts are software engineers, but in these other cases. They're ending up using it, having a lot of handoffs with QA testers who are asking a lot of legal questions or whatever. And then they have to keep coming back and telling the engineers to fix something. So they experienced the handoff and then they're also trying to hire. But I think at the core, the interesting thing here and the big bet is that we don't have to convince anybody this. The market is going to convince us because the winners are going to be the ones who do this, and then everyone's going to say, oh, they're winning. I got to do that.

Dan Shipper (00:48:50)

And so if you're building the tools to connect domain and domain experts to companies and one way to do that is to kind of rope them into the loop of prompts. Is part of this also recruiting domain experts and the sort of marketplace thing where it's like, if you're building a teacher app, you'll hook me up with a teacher. Or is that the user's job?

Jared Zoneraich (00:49:12)

That’s the user's job. But I have heard from a lot of teams that that's hard for them to do, because it's not just data labelers. You're not just MTurking it. It's expert data labelers in a specific category. And people do hire out for this. I've heard of legal AI teams doing this. But you have to find the right person. I don't think we're going to become a marketplace for that, but maybe we'll help people. Maybe we'll help connect them.

Dan Shipper (00:49:40)

That's always been kind of a pet thing of mine is language models, their summaries are all RLHFed on human feedback from data labelers that are not professional writers. And even among professional writers, that level of skill and quality is highly variable and then there's no one good metric for what is a good summary. But like in my ideal world, I would be so curious to see a model that was like a trained or RLHFed just on my favorite writers doing data labeling. I feel like it’d be different.

Jared Zoneraich (00:50:20)

Would they argue with each other though? Because maybe it's the nature of them being your favorite writer that they're a little unique in their own thing. And an LLM’s not going to do that.

Dan Shipper (00:50:35)

It's possible, but even if they argued and didn't agree—and maybe that makes the final output a little more unreliable. I feel like I'm fine with it. It's better than whatever the bland thing is now. Because I think there's this thing happening with language model summaries where they're very adept at saying a lot without saying anything which is a specific kind of bad summary that a data labeler— I know this because I edit a lot of articles at Every, and I don't edit as much anymore, but like I used to. And there's such a big difference— You can glance at an article and be like, yeah, this is pretty much right. And I would just give it the check mark and just be like, yep—approved, human approved. And then you actually like to sit down with the article and you're really like pulling it apart to be like, does this really make sense? And then you just see all the stuff that you would not have seen before. And I can imagine that that's happening with these labelers it's like you're sort of like tuned out at some point and you're just like, yeah, this isn't like garbage. And then you just get the sort of milquetoast thing, especially with— They want to be not controversial and whatever. I don't know.

Jared Zoneraich (00:51:50)

Interesting. Maybe you do that with prompt engineering. Maybe you don't need to train your own model.

Dan Shipper (00:51:55)

Maybe. I mean, I can get Claude to do some of this, but it's still not amazing. It depends on the thing. It depends on the task.

Jared Zoneraich (00:52:06)

It's hard. Yeah. It's hard to know will it ever— I'm sure it'll get better. When I summarize with AI, I almost always say bullets, don't use complete sentences, and it helps it a little bit, but I totally see what you're saying. That's an interesting one. But what is the core that makes writing good? Is it that it's concise or something a little bit more—to piss off the AGI crowd again—a little more human that that you can't really codify and that's not just logic. I don't know. Maybe not.

Dan Shipper (00:52:45)

I think that you can, but if you said it, it would just be like a high-level thing. A good summary has to be concrete and it has to tell you the most important thing that you need to know and leave out everything else in a concise way, right? But what's the most important thing is incredibly dependent on the situation and the person. And one of the beauties of language models are they're super sensitive to context. So they should be able to pick that up. There's no generically best summary. There's just a principle, but the way that that principle plays out operationally is very different depending on the context.

Jared Zoneraich (00:53:30)

So what you said makes it sound like multiple tasks to me, and maybe that's the solution. I remember when I was in high school, in some classes, I tried a little too hard, then I learned it doesn't really help. But I would make summaries of my notes and then make summaries of that to learn. And it is hard to figure out what the most important thing is, right? So maybe that needs to be the pre-processing task. Maybe the fact that the LLM doesn't generate what, to you, is a good summary because its task is just to condense the data but not to make the reader understand the important thing.

Dan Shipper (00:54:16)

This is really interesting. I feel like I just spent— I was in SF like twice over the last three weeks. And so I got super AGI-pilled because you just see enough of the thousand-yard stare of the internal researcher at the AI labs. And you're just like, oh fuck. And you're just like, no, this is going to be fine. I mean, maybe you're not saying it's going to be fine, but I think that you have a different perspective at least on the fundamental limits of what at least near-term AGI might look like or more advanced AI might look like. I'm getting that vibe from you and I'm swinging back a little bit. And I always sort of swing back.

Jared Zoneraich (00:55:00)

Come back to the dark side.

Dan Shipper (00:55:03)

No, I think that there's something— It’s really important to realize that people have been predicting the end of the world because of the latest new technology since the world started basically. And so if you're going to place bets, you got to bet against those people probably is my thought. And there are way more limits—

Jared Zoneraich (00:55:30)

Until they're right.

Dan Shipper (00:55:02)

Until they're right. Well, but yeah, so you never know when you're in turkey and it's Thanksgiving. And if they are right, the bet is— It's like a Pascal's wager type thing. If they are right, then the bet doesn't matter because you're dead anyway. So you might as well believe that.

Jared Zoneraich (00:55:50)

Yeah. It's also how do you define AGI? Because when I like to rail on the AGI I'm using the hypothetical, oh, it's this human-like AI that takes everything over definition. But I think the real definition is that it's just an AI that can do most tasks humans can do. And I think we're there already. I think you could do that with enough prompt engineering and you can kind of build systems using that toolbox. But does that matter? Yeah, it matters for the economy for a little bit. Maybe it'll correct after that.

Dan Shipper (00:56:25)

But I think by the time you're talking about prompt engineering, though, already you're eliminating the AGI thing, that's sort of—

Jared Zoneraich (00:56:30)

What about o1 or Sora?

Dan Shipper (00:56:32)

o1 doesn't even know modus ponens!

Jared Zoneraich (00:56:35)

I don't know what that is either.

Dan Shipper (00:56:38)

Exactly. You do know what it is. Basically, for example, with o1, you can be like, who is John Mayer's mother? It'll know the name. It'll think it through and it'll know the name. I don't remember what her name is. Let's say it's Cynthia Mayer. And then if you ask, you just start a new thing and you're like, who is Cynthia Mayer? It won't know. And you can even be like, I want you to think about other famous people who have the last name Mayer. So it gets in its thought process and it still doesn't get it all the time. And that's something tha a human just knows basically.

Jared Zoneraich (00:57:15)

We don't know her name though.

Dan Shipper (00:57:18)

I don't know her name both ways, but if I knew it one way, I would know what the other.

Jared Zoneraich (00:57:20)

I see. I see. Well, yeah, I mean, this is why I know this is not a complete way to think about it. But I think the simplest way to think about LLMs is just, again, solving language. And language doesn't know who John Mayer's mom is. But if you give language the repository of information, it can tell you that.

Dan Shipper (00:57:47)

I don't know what it means to solve language. What do you mean?

Jared Zoneraich (00:57:50)

So I mean a big part of computer science for many years is processing natural language or language and turning it into data. LLMs are really good at that and basically has solved the problem of going from human speech or human text to information that a computer can understand and vice versa. Before that, that was completely— At least going from data to language is basically impossible to do it well. That solved, I think, anything with regard to bringing in real-world data, John Mayer's mom, that sort of stuff is just what you hook this box up to, what you hook this technology up to. I think like how you get to be—

Dan Shipper (00:58:43)

I think what you're saying is, you could probably take the same architecture that we have now and with a different training method or dataset or whatever, you could get it to answer both ways. And I totally think that that's right. But I guess is what I'm trying to point to is I think it's possible we're going to get there, but I just don't think we're there right now. Clearly the thing that people miss is in my opinion, they're like, oh yeah, o1 can do physics problems better than I can. And it's like, yeah, but it gets all this other basic stuff wrong. So far, I'm not ready to call it AGI. It’s smart in certain domains. I think that's really important. That shouldn't be overlooked. I'm not being Gary Marcus here where I'm like, they suck or whatever. I think that they're truly incredible, but I also think there's a lot of worship. I wouldn't call it AGI.

Jared Zoneraich (00:59:43)

So, hypothetically, say all we needed to do is train the LLM on 100x or 1,000x the data size. Probably not, but let's say that was the way to do it. What's the difference between having an o2 model that can answer John Mayer's mom in both directions vs. having o2 that's plugged into the dataset and that kind of end user application that lets you ask it both ways.

Dan Shipper (01:00:13)

Oh, you have to predefine the dataset. So you have to have an explicit knowledge store that has all the knowledge in it. So you're kind of cheating and also having that predefined knowledge store is an impossible problem. You can't have all of that in a database. And that's why it took so long to even get the models we have now. To have a knowledge store like that to work, you have to have a more bounded problem, a less general problem.

Jared Zoneraich (01:00:41)

I mean, the weights are technically a dataset.

Dan Shipper (01:00:44)

They are but it's inexplicit. It's a totally different thing. It's not the same thing as a database. It’s much more flexible, but I like where this is going. I think that this is a very fun conversation. We are at time. I really, really appreciate you coming on the show. For people who want to try out PromptLayer or find you, where can they find you?

Jared Zoneraich (01:01:10)

Yeah, you can find me on Twitter on X @imjaredz also check out PromptLayer, promptlayer.com. It's free to sign up and use for hackers and hobbyists and small teams. And then not so expensive for startups either. And it's a great way to build AGI.

Dan Shipper (01:01:28)

Sweet. Thanks, Jared.

Jared Zoneraich (01:01:32)

Thank you.

Thanks to Scott Nover for editorial support.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn, and Every on X at @every and on LinkedIn.

We also build AI tools for readers like you. Automate repeat writing with Spiral. Organize files automatically with Sparkle. Write something great with Lex.