
Editor's note: For many years, AI developments crept along at a snail’s pace. It sometimes felt like we’d never move beyond the era of the AOL SmarterChild chatbot. And then, everything changed. In a little over half a decade, we’ve undergone a century’s worth of innovation.
In this post, Anna-Sofia Lesiv explores the major turning points that led us to this moment. Regardless of whether you’re an AI super fan watching ChatGPT’s every move, or a reluctant luddite wondering what the hell a “transformer” is, this essay is worth a read.
With recent advances in machine learning, we could be entering a period of technological progression more impactful than the Scientific Revolution and the Industrial Revolution combined. The development of transformer architectures and massive deep learning models trained on thousands of GPUs has led to the emergence of smart, complex programs capable of understanding and producing language in a manner indistinguishable from humans.
The notion that text is a universal interface that can encode all human knowledge and be manipulated by machines has captivated mathematicians and computer scientists for decades. The massive language models that have emerged in the previous handful of years alone are now proof that these thinkers were onto something deep. Models like GPT-4 are already capable of not only writing creatively but coding, playing chess, and answering complex queries.
The success of these models, and their rapid improvement through incremental scaling and training, suggests that the learning architectures available today may soon be sufficient to bring about a general artificial intelligence. It could be that new models will be required to produce artificial general intelligence (AGI), but if existing models are on the right track, the path to artificial general intelligence could instead amount to an economics problem—what does it take to get the money and energy necessary to train a sufficiently large model?
At a time of such mind-blowing advancement, it’s important to scrutinize the foundations underpinning technologies that are sure to change the world as we know it.
Beyond the Turing test with LLMs
Large language models (LLMs) like GPT-4 or GPT-3 are the most powerful and complex computational systems ever built. Though very little is known about the size of OpenAI’s GPT-4 model, we do know that GPT-3 is structured as a deep neural network composed of 96 layers and over 175 billion parameters. This means that just running this model to answer an innocent query via ChatGPT requires trillions of individual computer operations.
After it was released in June 2020, GPT-3 quickly demonstrated it was formidable. It proved sophisticated enough to write bills, pass an MBA exam at Wharton, and be hired as a top software engineer at Google (eligible to earn a salary of $185,000). Also, it could score a 147 on a verbal IQ test, putting it in the 99th percentile of human intelligence.
However, those accomplishments pale in comparison to what GPT-4 can do. Though OpenAI remained particularly tight-lipped about the size and structure of the model, apart from saying: “Over the past two years, we rebuilt our entire deep learning stack and, together with Azure, co-designed a supercomputer from the ground up for our workload.” It shocked the world when it revealed just what this entirely redesigned model could do.
At one point, the commonly accepted way of detecting human-level computer intelligence was the Turing test. If a person could not distinguish whether they were conversing with a human or a computer via speech alone, then it could be concluded the computer was intelligent. It’s now clear that this benchmark has outlived its relevance. Another test is needed to pinpoint just how intelligent GPT-4 is.
As rated by a variety of professional and academic benchmarks, GPT-4 is essentially in the top 90th percentile of human intelligence. It scored above 700 in SAT Reading & Writing and SAT Math, which is sufficient for admission to many Ivy League universities. It also scored 5s (the top score possible on a scale of 1 to 5) in AP subjects ranging from Art History, Biology, Statistics, Macroeconomics, Psychology, and others. Remarkably, it can also remember and refer to information sourced from up to 25,000 words, meaning that it can respond to a prompt spanning up to 25,000 words.
In fact, calling GPT-4 a language model isn’t quite right. Text isn’t all that it can do. GPT-4 is the first multimodal model ever produced, meaning it deciphers both text and images. In other words, it can understand and summarize the context of a physics paper just as easily as a screenshot of a physics paper. Outside of that, it can also code, school you in the Socratic method, and compose anything from screenplays to songs.
The magic of the transformer model
The secret behind the success of large language models is their unique architecture. This architecture emerged just six years ago and has since gone on to rule the world of artificial intelligence.
When the field first emerged, the operating logic was that every neural network should have a unique architecture geared toward the particular task it needed to achieve. The assumption was that deciphering images would require one type of neural network structure, while reading text would need another. However, there remained those who believed that there might exist a neural network structure capable of performing any task you asked of it, in the same way a chip architecture can be generalized to execute any program. As Open AI CEO Sam Altman wrote in 2014:
"Andrew Ng, who ... works on Google’s AI, has said that he believes learning comes from a single algorithm—the part of your brain that processes input from your ears is also capable of learning to process input from your eyes. If we can just figure out this one general-purpose algorithm, programs may be able to learn general-purpose things."
Between 1970 and 2010, the only real success in the field of artificial intelligence was in computer vision. Creating neural networks that could break a pixelated image into elements like corners, rounded edges, and so on eventually made it possible for AI programs to recognize objects. However, these same models didn't work as well when given the task of parsing the nuance and complexities of language. Early natural language processing systems kept messing up the order of words, suggesting that these systems were unable to correctly parse syntax and understand context.
Source: Our World in Data
It was not until a group of Google researchers in 2017 introduced a new neural network architecture specifically catering to language and translation that all this changed. The researchers wanted to solve the problem of translating text, a process that required decoding meaning from a certain grammar and vocabulary and mapping this meaning onto an entirely separate grammar and vocabulary. This system would need to be incredibly sensitive to word order and nuance, all while being cognizant of computational efficiency. The solution to this problem was the transformer model, which was described in detail in a paper called "Attention Is All You Need".
Rather than parsing information one bit after the other like previous models did, the transformer model allowed a network to retain a holistic perspective of a document. This allowed it to make decisions about relevance, retain flexibility with things like word order, and more importantly understand the entire context of a document at all times.
A neural network that could develop a sense of a total document's context was an important breakthrough. In addition, transformer models were faster and more flexible than any prior models. Their ability to cleverly translate from one format to another also suggested that they would be able to reason about a number of different types of tasks.
Today, it's clear that this was indeed the case. With a few tweaks, this same model could be trained to translate text into images as easily as it could translate English into French. Researchers from every AI subdomain were galvanized by this model, and quickly replaced whatever they were using before with transformers.
This model's uncanny ability to understand any text in any context essentially meant that any knowledge that could be encoded into text could be understood by the transformer model. As a result, large language models like GPT-3 and GPT-4 can write as easily as they can code or play chess—because the logic of those activities can be encoded into text.
The past few years we’ve seen a series of tests on the limits of transformer models, and so far they have none. Transformer models are already being trained to understand protein structure, design artificial enzymes that work just as well as natural enzymes, and much more. It's looking increasingly like the transformer model might be the much sought-after generalizable model. To drive the point home, Andrej Karpathy—a deep learning pioneer who contributed massively to the AI programs OpenAI and Tesla—recently described the transformer architecture as "a general purpose computer that is trainable and also very efficient to run on hardware.”
Neural networks through the ages
Ideas and Apps to
Thrive in the AI Age
The essential toolkit for those shaping the future
"This might be the best value you
can get from an AI subscription."
- Jay S.
Join 100,000+ leaders, builders, and innovators

Email address
Already have an account? Sign in
What is included in a subscription?
Daily insights from AI pioneers + early access to powerful AI tools
Comments
Don't have an account? Sign up!