
When humans make requests of their AI assistants, what matters isn’t merely what they ask but often how. That’s the central premise behind chain of thought prompting, a method for getting the most out of ChatGPT or another chatbot. In the latest installment of Also True for Humans, Michael Taylor’s column on working with AI tools like you would work with humans, he dives into how and why this method works, why we’re not all that different from our machine counterparts—and what the number of piano tuners in New York City has to do with any of this.—Kate Lee
ChatGPT writes faster than we can read—but is the output worth reading?
When I was writing this article, I worked with my editor to plan the outline and make sure I had a compelling pitch. So why do most people expect ChatGPT to “write a blog post on X” without taking the time to think?
AI does a better job when it’s prompted to make a plan first—just like humans do. Most AI applications have one or more research and planning steps, a technique called chain of thought (CoT). It’s an order of operations for the model to reason through a problem before answering.
When you’re getting mediocre results from AI, it’s often because you haven’t allowed the AI to plan sufficiently. Applying the chain of thought technique can result in an immediate boost in performance.
Let’s look into the science behind chain of thought prompting and how to get AIs to think through their answers before responding. It’s one of the easiest ways you can improve your prompts to get more sophisticated results.
Giving the AI time to ‘think’
If you asked me what my favorite band was, I’d immediately answer with Rage Against the Machine (RATM). I’d be responding instinctively and emotionally, recalling how their unique blend of anti-establishment rap rock made me feel as an angsty teenager, and all the good times I had playing bass guitar in a cover band called The Machine Rages On.
If you asked me a more complicated question, “What year did RATM’s song ‘Killing in the Name’ hit number one on the music charts over Christmas?” I’d hazard a guess that it was 2014, because it feels like it was about a decade ago. The song reappeared on the charts only after a public campaign to block the winner of The X Factor from taking the top spot, but I couldn’t remember when exactly that was.
However, if I took the time to think, I could work it out:
- It was after I left college in 2008, but it couldn’t be long after.
- I remember stopping at a gas station to buy the CD, so I wasn’t living in London yet and still had my car.
- Using that information, I can narrow down the year to sometime between 2008 and 2010.
By thinking step-by-step through what I remember and applying some reasoning, I could identify the year as 2009. This thinking process doesn’t give me any new information, but it helps me process the information I have and apply some reasoning to get a better answer than I otherwise would by guessing. In retrospect, my initial guess was pretty far off. A little bit of pause made a world of difference.
Source: Know Your Meme.
These two modes of thinking are referred to as “System 1” and “System 2” thinking in Thinking, Fast and Slow, the late economist Daniel Kahneman’s book about how people make decisions. In the book, he differentiated between the fast, automatic, and intuitive System 1 thinking we use for most everyday decisions, and the slow, deliberate conscious System 2 thinking we reserve for complex problem solving and analytical tasks.
Source: The Decision Lab.As they’re built today, AIs are System 1 thinkers. They respond in an instant, using patterns in their training data, and frequently get things wrong. By default, they aren’t thinking things through like you would when tasked with a problem that needed careful thought. They’re impulsive: When you give them a task, they jump right in instead of planning.
Human brains are flawed in many ways, but one recurring theme in AI research is that we’ve tried to replicate how our brains work. Artificial neural networks were modeled on how our biological neural networks behave, and the attention mechanism of the transformer model that powers ChatGPT was inspired by how humans pay attention to the most important parts of a sentence when processing its meaning. It stands to reason that emulating the division of labor between System 1 and 2 would yield results with AI.
Chain of thought is simply a way of implementing that slow, deliberate, not-yet-conscious System 2 thinking in AI models. For example, an AI content writing startup I advise called Adaptify works in a similar fashion to my own AI writing system, in which I first prompt ChatGPT to search the web to research the topic and plan an outline before any writing takes place. This process is more expensive because it uses more tokens (about three-fourths of a word) when making API calls, and it takes longer to complete a longer answer, but it leads to significantly better results.
CoT is such an effective and commonly used technique that OpenAI has a section in its prompt engineering guide called “Give the model time to ‘think,’” advising users to guide its deliberations. It can be as simple as appending, “Let’s think step by step,” to the end of a prompt, or you can provide examples of the reasoning steps you want it to take when working out the solution.
Source: OpenAI.When humans make requests of their AI assistants, what matters isn’t merely what they ask but often how. That’s the central premise behind chain of thought prompting, a method for getting the most out of ChatGPT or another chatbot. In the latest installment of Also True for Humans, Michael Taylor’s column on working with AI tools like you would work with humans, he dives into how and why this method works, why we’re not all that different from our machine counterparts—and what the number of piano tuners in New York City has to do with any of this.—Kate Lee
ChatGPT writes faster than we can read—but is the output worth reading?
When I was writing this article, I worked with my editor to plan the outline and make sure I had a compelling pitch. So why do most people expect ChatGPT to “write a blog post on X” without taking the time to think?
AI does a better job when it’s prompted to make a plan first—just like humans do. Most AI applications have one or more research and planning steps, a technique called chain of thought (CoT). It’s an order of operations for the model to reason through a problem before answering.
When you’re getting mediocre results from AI, it’s often because you haven’t allowed the AI to plan sufficiently. Applying the chain of thought technique can result in an immediate boost in performance.
Let’s look into the science behind chain of thought prompting and how to get AIs to think through their answers before responding. It’s one of the easiest ways you can improve your prompts to get more sophisticated results.
Giving the AI time to ‘think’
If you asked me what my favorite band was, I’d immediately answer with Rage Against the Machine (RATM). I’d be responding instinctively and emotionally, recalling how their unique blend of anti-establishment rap rock made me feel as an angsty teenager, and all the good times I had playing bass guitar in a cover band called The Machine Rages On.
If you asked me a more complicated question, “What year did RATM’s song ‘Killing in the Name’ hit number one on the music charts over Christmas?” I’d hazard a guess that it was 2014, because it feels like it was about a decade ago. The song reappeared on the charts only after a public campaign to block the winner of The X Factor from taking the top spot, but I couldn’t remember when exactly that was.
However, if I took the time to think, I could work it out:
- It was after I left college in 2008, but it couldn’t be long after.
- I remember stopping at a gas station to buy the CD, so I wasn’t living in London yet and still had my car.
- Using that information, I can narrow down the year to sometime between 2008 and 2010.
By thinking step-by-step through what I remember and applying some reasoning, I could identify the year as 2009. This thinking process doesn’t give me any new information, but it helps me process the information I have and apply some reasoning to get a better answer than I otherwise would by guessing. In retrospect, my initial guess was pretty far off. A little bit of pause made a world of difference.
Source: Know Your Meme.
These two modes of thinking are referred to as “System 1” and “System 2” thinking in Thinking, Fast and Slow, the late economist Daniel Kahneman’s book about how people make decisions. In the book, he differentiated between the fast, automatic, and intuitive System 1 thinking we use for most everyday decisions, and the slow, deliberate conscious System 2 thinking we reserve for complex problem solving and analytical tasks.
Source: The Decision Lab.As they’re built today, AIs are System 1 thinkers. They respond in an instant, using patterns in their training data, and frequently get things wrong. By default, they aren’t thinking things through like you would when tasked with a problem that needed careful thought. They’re impulsive: When you give them a task, they jump right in instead of planning.
Human brains are flawed in many ways, but one recurring theme in AI research is that we’ve tried to replicate how our brains work. Artificial neural networks were modeled on how our biological neural networks behave, and the attention mechanism of the transformer model that powers ChatGPT was inspired by how humans pay attention to the most important parts of a sentence when processing its meaning. It stands to reason that emulating the division of labor between System 1 and 2 would yield results with AI.
Chain of thought is simply a way of implementing that slow, deliberate, not-yet-conscious System 2 thinking in AI models. For example, an AI content writing startup I advise called Adaptify works in a similar fashion to my own AI writing system, in which I first prompt ChatGPT to search the web to research the topic and plan an outline before any writing takes place. This process is more expensive because it uses more tokens (about three-fourths of a word) when making API calls, and it takes longer to complete a longer answer, but it leads to significantly better results.
CoT is such an effective and commonly used technique that OpenAI has a section in its prompt engineering guide called “Give the model time to ‘think,’” advising users to guide its deliberations. It can be as simple as appending, “Let’s think step by step,” to the end of a prompt, or you can provide examples of the reasoning steps you want it to take when working out the solution.
Source: OpenAI.Why chain of thought works
Rohit Krishnan, author of Building God: Demystifying AI for Decision Makers, deftly demonstrated how CoT works using the common problem of getting ChatGPT to count letters in a word. Because it doesn’t see letters (only tokens, which are, again, approximately three-fourths of a word), it gets it wrong unless prompted to “count step by step.”
Source: X/Rohit Krishnan.
The technique was introduced in a 2023 paper, which showed a remarkable four-times-higher accuracy over standard prompting on the GSM8K benchmark, a set of 8,500 grade school math problems. Researchers found that CoT reasoning is an emergent property of larger language models. It doesn’t work below a certain “intelligence” level but becomes significantly more consequential as you scale the model from 8 billion to 175 billion parameters—about the level of OpenAI’s GPT-3.5 model.
Source: Arxiv.Clocking such a huge win on existing models without having to buy more Nvidia GPUs was a boon for builders of AI applications, and many papers have since replicated the results. In another paper, researchers found that adding, “Let’s think step by step,” before each answer improved accuracy across a wide range of domains, from arithmetics to symbolic reasoning and other logical reasoning tasks, even without needing to provide examples of the task in the prompt. My personal favorite is a study that uses an LLM to write its own prompts in order to optimize the performance on a task. It found that adding, “Take a deep breath and work on this problem step by step,” increased accuracy on established evaluation benchmarks by a further 11.6 percent. LLMs are weird, but sometimes they need little tips and assurances like we humans do.
Source: Arxiv.What’s the blast radius of a secret nuclear test?
Early in my career, I was interviewing for a job when the interviewer asked me, “How many piano tuners are in New York City?” Nobody actually knows the answer without looking it up, and even Googling won’t produce a precise result. What they really wanted to see was my ability to think, break the problem down, make assumptions, and then apply logic to arrive at a reasonable conclusion.
Google famously banned the use of these questions in its hiring process for being “too tough and unfair,” but I think it’s because the company failed to explain to applicants why questions like, “How many manhole covers are there in San Francisco?”or, “What would you charge to wash all of the windows in Seattle?” or even, “How many golf balls can fit in a school bus?” were relevant to working at a search engine.
Most people don’t know that these questions, called “Fermi problems,” date back to the Manhattan Project, where renowned physicist Enrico Fermi estimated the blast radius of the first atomic bomb test in New Mexico in 1945. He dropped small pieces of paper when he felt the shockwave of the explosion. By observing how far they were blown away, he did a back-of-the-envelope calculation and reasoned that the bomb’s yield was around 10 kilotons of TNT—highly classified information at the time.
The counterintuitive insight is that by breaking down a complex problem into smaller, more manageable parts, making educated guesses about the values of key variables, and combining these estimates, you can arrive at a reasonable approximation of the correct answer. You can think of Fermi estimations as CoT for humans.
Let’s use CoT to solve one of these Fermi problems with AI—one of the questions Fermi famously would ask: “How many piano tuners are in New York City?”
OpenAI knows all about CoT and Fermi estimations, so when you ask ChatGPT this question, it jumps to the Fermi estimation and provides a reasonable result of 240 piano tuners. Note that I’m using ChatGPT Classic, so it doesn’t have the ability to search the web and look up the answer on Yelp or elsewhere. For reference, I counted about about 40 business listings on Yelp, so the model’s estimate isn’t far off if you assume multiple tuners per shop, as well as some not being listed.
Source: Chatbot screenshots courtesy of the author.To try out CoT, we need a different model that hasn’t been told to use CoT to solve this problem, so I asked Meta’s open-source Llama 3. This is an 8 billion parameter model, so it’s right around the size that reasoning begins to emerge in a model and CoT becomes a useful strategy.
Not only does it fail to make an estimate, but it makes up an answer pretending it can run a search query—which we know it can’t do, as it doesn’t have access to the internet.
Source: Author’s screenshot/Llama 3 on Groq.We can make a dumb model a lot smarter by using CoT. If all we do is append, “Let’s think step by step,” to the end of the prompt, we get a radically better answer without the hallucination you might expect. It’s still quite far off in its assumptions compared to ChatGPT—I doubt there are 2-4 million pianos in New York City. But now that we can see its thinking, we can edit those assumptions to make a better calculation. If we decrease the assumption in step 3 from 3 million to 180,000 pianos, applying the same ratio, we’d arrive at an estimate of 100 piano tuners—very close to my own estimate.
Source: Author’s screenshot/Llama 3 on Groq.There is a risk, though: This question is a commonly used example of a Fermi problem, so perhaps Llama is parroting back to us what it saw training on internet data (including in an old blog post of mine). It’s a common problem in LLM evaluation, where the question and answer have leaked into the training data, making it more like an open-book exam than a real test of reasoning ability.
Let’s try a new Fermi problem that we’ve made up: “How many bananas can fit in a Toyota Prius?”
Source: Author’s screenshot/Llama 3 on Groq.While I don’t know how many bananas can fit in a Toyota Prius, Llama 3 is exhibiting use of the CoT technique as we had hoped. We can check the assumptions to ensure the estimate is reasonable: For example, a Prius has 91 cubic feet of interior space, which is 2.58 cubic meters—close to Llama 3’s estimate of 2.75. Whether we agree with the assumptions or not, we can at least debug the model’s thinking and make corrections if needed.
Intelligence too cheap to meter
If CoT is so effective, why don’t we implement it everywhere? For the same reason we humans make most of our decisions with System 1—thinking time is expensive. We don’t need to overthink everything we do because most of the time, decisions are low-stakes and intuitive answers work well enough. Not only does CoT cost you more tokens (you pay per token when using the GPT-4 API), but it also slows down the completion of the task while you wait for it to output its thinking first.
Fortunately, as models have gotten bigger and CoT has become more effective, these large models have also gotten orders of magnitude cheaper. The price of the best model available seems to halve every six months—GPT-4o is almost 50 percent cheaper than the previous model GPT-4 Turbo, which in turn was about half the cost of the original GPT-4. Given competition from Google, Anthropic, and open-source models, it’s reasonable to expect the price to keep decreasing As former OpenAI researcher Leopold Aschenbrenner notes in his Situational Awareness series of essays, it may soon become practical for LLMs to spend the equivalent of multiple months of human thinking time on each query, if models continue to scale by orders of magnitudes (OOMs) in size.
Assuming a human thinking at about 100 tokens per minute and working 40 hours per week, translating “how long a model thinks” in tokens to human-time on a given problem or project. Source: Situational Awareness.Expect further advancements in this area, like new prompting techniques and training methods like Quiet-STaR (Quiet Self-Taught Reasoner). The latter is a training technique in which LLMs learn to generate rationales at each token to explain future text, improving their predictions. There are rumors that OpenAI’s GPT-5 will demonstrate a huge gain in reasoning ability using a similar technique. It’s not hard to believe that with the billions of dollars of funding flowing into the AI market, progress will continue to be made on improving reasoning ability, which will unlock more advanced use cases.
Better reasoning is the difference between your prompting ChatGPT as an assistant and ChatGPT prompting itself as an independent agent that you collaborate with in Slack. As the cost of intelligence goes down, throwing hundreds or thousands of hours of thinking time at even mundane problems will become cost effective. We’ll spin up multiple AI agents to solve a problem, each with its own speciality and ability.
This is the same kind of leverage we get as humans when we build a company and hire hundreds or thousands of people to work on solving a problem together—except with AI, you can potentially achieve that scale and impact as a one-person company.
Michael Taylor is a freelance prompt engineer, the creator of the top prompt engineering course on Udemy, and the coauthor of Prompt Engineering for Generative AI. He previously built Ladder, a 50-person marketing agency based out of New York and London.
To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.
Ideas and Apps to
Thrive in the AI Age
The essential toolkit for those shaping the future
"This might be the best value you
can get from an AI subscription."
- Jay S.
Join 100,000+ leaders, builders, and innovators

Email address
Already have an account? Sign in
What is included in a subscription?
Daily insights from AI pioneers + early access to powerful AI tools
Ideas and Apps to
Thrive in the AI Age
The essential toolkit for those shaping the future
"This might be the best value you
can get from an AI subscription."
- Jay S.
Join 100,000+ leaders, builders, and innovators

Email address
Already have an account? Sign in
What is included in a subscription?
Daily insights from AI pioneers + early access to powerful AI tools