Chapter 14: LLMs and Beyond

Central Question: Where are we now, and what comes next?

Part IV: The Age of Transformers (2017-Present)

We have witnessed remarkable things in these pages. We watched transformers emerge from the bottleneck problem of sequence-to-sequence models, saw attention mechanisms liberate neural networks from the tyranny of sequential processing, and traced the scaling of language models from millions to hundreds of billions of parameters. By 2020, GPT-3 could write essays, compose poetry, answer questions, and generate code. It could do all of this with no task-specific training—a few examples in the prompt were enough to steer its behavior toward almost any linguistic task.

And yet, something was wrong.

GPT-3 was impressive but not helpful. It could continue any text you started, but it did not understand what you actually wanted. Ask it a question, and it might answer—or it might generate five more questions in the same style. Ask it to write an email, and it might produce an email—or it might produce an article about email-writing tips, or a dialogue between two people discussing emails. The model had learned to predict text with extraordinary skill, but prediction is not assistance. Completion is not conversation.

The transformer architecture had given us unprecedented capability. Now a new question emerged: how do we make these models actually useful? How do we align them with human intentions?

14.1 Instruction Tuning and RLHF

The problem is subtle but profound. Language models are trained to predict the next token—to maximize the probability of whatever text comes next in their training data. The internet, from which that training data is drawn, contains multitudes: academic papers and conspiracy theories, helpful tutorials and toxic rants, sincere questions and sarcastic mockery. A model that perfectly predicts internet text will reflect all of this, including the parts we would rather it did not.

Moreover, prediction and helpfulness are different objectives. Consider this prompt: “Explain quantum entanglement to a five-year-old.” A model optimizing for prediction might generate a physics textbook passage (common in training data), a Wikipedia-style article (also common), or perhaps a forum discussion where someone asks this question and several people respond with varying quality. What we actually want—a clear, simple, child-appropriate explanation—is a tiny fraction of what the model has seen. The helpful response is drowned in a sea of merely probable ones.

Researchers at OpenAI began grappling with this problem in earnest around 2020. The key insight was simple in retrospect: if we want models to follow instructions, we should train them on examples of instruction-following. If we want them to be helpful, we should show them what helpful looks like.

This approach, which became known as instruction tuning or supervised fine-tuning, starts with a base language model and further trains it on a carefully curated dataset of instructions paired with ideal responses. Human contractors write both the prompts (“Write a haiku about machine learning”) and the completions (an actual haiku that is good). The model learns to associate the instruction format with the expected response format.

The improvement is immediate and striking. An instruction-tuned model asked to “explain quantum entanglement to a five-year-old” will actually attempt to do so, rather than generating tangentially related text. It follows the form of the request. It tries to be helpful.

But instruction tuning has limits. We can only write so many demonstrations. The model may follow instructions mechanically without understanding the deeper intent. And when faced with ambiguous requests or novel situations, it has no way to know which of many possible responses a human would actually prefer.

This is where reinforcement learning enters the picture.

The idea of using human feedback to train AI systems has a long history, but applying it to language models at scale required solving several technical challenges. The breakthrough came in 2022 with a paper from OpenAI titled “Training language models to follow instructions with human feedback”—the InstructGPT paper. It introduced a three-stage process that would reshape the field.

Stage one is the supervised fine-tuning we have already described. Take a base model like GPT-3, train it on demonstrations of instruction-following, and create a model that at least tries to do what you ask.

Stage two is more subtle. We cannot have humans rate every possible response the model might generate—there are too many. Instead, we train a separate neural network to predict human preferences. This reward model takes a prompt and a response as input and outputs a score representing how good a human would judge that response to be. To train it, we generate multiple responses to the same prompt, have humans rank them from best to worst, and train the model to assign higher scores to preferred responses.

Stage three uses reinforcement learning to optimize the language model against this reward model. The language model generates responses; the reward model scores them; and the language model is updated to generate responses that score higher. The technique used is Proximal Policy Optimization (PPO), an algorithm from the reinforcement learning literature that has proven stable and effective for this purpose.

Technical Box: The RLHF Pipeline

Reinforcement Learning from Human Feedback (RLHF) trains language models to generate outputs that humans prefer. The process has three stages:

Stage 1: Supervised Fine-Tuning (SFT)

Collect a dataset of (prompt, ideal_response) pairs written by human demonstrators. Fine-tune the base language model on this data using standard supervised learning:

\[\mathcal{L}_{SFT} = -\mathbb{E}_{(x,y) \sim D} \left[ \log \pi_\theta(y|x) \right]\]

This produces an initial policy \(\pi^{SFT}\) that follows instructions.

Stage 2: Reward Model Training

For a given prompt \(x\), generate multiple candidate responses \(\{y_1, y_2, ..., y_k\}\) from the SFT model. Have humans rank these responses from best to worst. Train a reward model \(r_\phi(x, y)\) to predict these preferences.

Given a preferred response \(y_w\) and a dispreferred response \(y_l\), the Bradley-Terry loss is:

\[\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]\]

The reward model learns to assign higher scores to human-preferred responses.

Stage 3: RL Fine-Tuning with PPO

Optimize the policy to maximize the reward model’s scores while staying close to the original SFT model. The objective is:

\[\mathcal{J}(\theta) = \mathbb{E}_{x \sim D, y \sim \pi_\theta} \left[ r_\phi(x, y) - \beta \cdot D_{KL}(\pi_\theta(y|x) \| \pi^{SFT}(y|x)) \right]\]

The KL divergence penalty, controlled by \(\beta\), prevents the policy from drifting too far from the SFT model. Without it, the model might find adversarial responses that score high on the reward model but are nonsensical—exploiting flaws in the reward model rather than genuinely improving.

Why the KL Penalty Matters

The reward model is imperfect—it is trained on limited human feedback and may have blind spots or biases. If we optimize too aggressively against it, the policy learns to exploit these imperfections. The KL penalty acts as a regularizer, keeping the model close to the distribution of natural language it learned during pretraining and SFT.

The results were compelling. InstructGPT, despite having only 1.3 billion parameters, was preferred by human raters over the 175-billion-parameter GPT-3 on most tasks. Smaller but better-aligned beat larger but less helpful.

Then came November 30, 2022.

OpenAI released ChatGPT—essentially the RLHF methodology applied to GPT-3.5—as a free research preview. Within five days, it had one million users. Within two months, it had one hundred million. No technology in history had reached that milestone so quickly. Instagram took two and a half years. TikTok took nine months. ChatGPT did it in eight weeks.

The chatbot moment had arrived. Suddenly, millions of people were interacting with AI systems that could converse, explain, assist, and create. The technology that researchers had been developing in relative obscurity was now dinner table conversation. Students used it for homework. Programmers used it for debugging. Writers used it for brainstorming. Lawyers used it for drafting. The world discovered, almost overnight, that machines could do something that looked remarkably like thinking.

The era of large language models as consumer products had begun.

14.2 Capabilities and Limitations

What can these systems actually do? The answer is both more and less than the hype suggests.

The capabilities are genuine and wide-ranging. Modern LLMs can summarize documents, extracting key points from lengthy texts with reasonable accuracy. They can translate between languages, not with the clinical precision of professional translators but with surprising fluency for casual use. They can write code in dozens of programming languages, often producing working solutions on the first try for standard problems. They can explain complex concepts, adjusting their explanations based on the indicated audience level. They can draft emails, essays, stories, and poems. They can answer questions about a vast range of topics, drawing on the compressed knowledge of their training data.

Perhaps most remarkably, they can reason—sometimes. Give a well-designed prompt, and an LLM can work through multi-step logical problems, show its work, and arrive at correct conclusions. The chain-of-thought technique, where the model is encouraged to “think step by step,” can dramatically improve performance on reasoning tasks. Problems that seem to require genuine inference, not just pattern matching, often yield to this approach.

But the limitations are equally real, and in some ways more troubling precisely because they are less visible.

Mathematical reasoning remains fragile. An LLM might correctly solve a calculus problem that resembles many examples in its training data, then fail catastrophically on a slight variation. The underlying mathematical intuition that humans develop is not clearly present. The model recognizes and replicates patterns of mathematical reasoning without necessarily understanding the mathematical structures themselves.

Factual accuracy is unreliable in ways that are difficult to predict. The model might correctly state that Paris is the capital of France and incorrectly state that the Eiffel Tower was completed in 1889 when it was actually 1887. Both statements emerge from the same statistical process. The model has no internal sense of which facts it knows well and which are uncertain. It generates text that sounds equally confident regardless of whether the content is correct.

This leads to what researchers call hallucination—the confident generation of false information. Ask an LLM about an obscure topic, and it may fabricate plausible-sounding details: citations to papers that do not exist, historical events that never happened, biographical facts that are simply wrong. The model is not lying in any meaningful sense; it is doing what it was trained to do, generating text that fits the context. It has no way to distinguish between patterns from factual sources and patterns from fiction, between well-documented claims and speculation.

The term “hallucination” is perhaps unfortunate, anthropomorphizing a process that is fundamentally different from human confabulation. When a person hallucinates, there is a distinction between their hallucination and reality that they have lost access to. The LLM has no such distinction to lose. It never had access to reality in the first place—only to text about reality, in which truths and falsehoods are mingled inextricably.

Long-term coherence presents another challenge. LLMs can maintain consistency over a few paragraphs but tend to drift over longer texts. In a lengthy story, characters may change personality, settings may shift without explanation, plot threads may be abandoned or contradicted. The model has no persistent model of the world it is describing; it generates each token based on the context window, which eventually pushes earlier content out of scope.

The interpretability challenge looms over all of this. We do not really know how these systems work inside. We know the architecture—attention layers, feed-forward networks, residual connections—and we can trace the flow of information through the network. But we cannot explain why a particular prompt produces a particular response. We cannot predict when the model will fail. We cannot look inside and see knowledge or reasoning or understanding; we see only matrices of floating-point numbers, vast and inscrutable.

This opacity is troubling not just scientifically but practically. If we cannot understand how the model arrives at its outputs, we cannot reliably predict when it will produce harmful ones. We cannot audit its reasoning. We cannot explain its decisions. We are deploying systems whose capabilities increasingly exceed our comprehension of them.

Is what these models do really reasoning, or is it just sophisticated pattern matching? The debate is ongoing and perhaps ill-posed. Human reasoning itself may be a form of pattern matching, refined by evolution and experience. The question may be less whether LLMs “truly” reason than whether the distinction between true reasoning and its imitation has any clear meaning.

What we can say is that LLMs exhibit emergent capabilities—abilities that appear suddenly as models scale up, without being explicitly trained. Few-shot learning, where the model learns new tasks from a handful of examples in the prompt, emerged around the 100-billion-parameter scale. Chain-of-thought reasoning becomes effective only at certain scales. These discontinuities are not well understood; we cannot predict what capabilities the next scale-up will unlock.

And this unpredictability feeds into the deepest concern of all: the alignment problem. As these systems become more capable, how do we ensure they remain beneficial? How do we make them do what we actually want, not just what we managed to specify? The RLHF techniques we described earlier are a start, but they are band-aids on a fundamental difficulty. We are training systems to optimize proxy objectives—reward model scores that approximate human preferences—and hoping the proxies do not diverge from our true intentions at some critical moment.

The models we are building are not intelligent in the way humans are intelligent. But they are also not merely sophisticated autocomplete. They occupy a strange new space in the landscape of cognition, and we are still learning to navigate it.

14.3 The Road Ahead

Where does this journey take us next?

The most visible trend is the unification of modalities. Language models are becoming vision models, speech models, video models. GPT-4V can describe images, answer questions about photographs, read handwritten text. Google’s Gemini processes text, images, and audio in a single architecture. The dream of unified AI—systems that can perceive and reason about the world in all its sensory richness—is taking shape.

This multimodal expansion follows naturally from the transformer architecture, which makes minimal assumptions about its input. If you can tokenize it, you can process it. Images become patches, audio becomes spectrograms, video becomes sequences of frames. The same attention mechanisms that learned relationships between words can learn relationships between visual regions, between speech segments, between any discrete representations we choose to provide.

At the same time, a countervailing force pushes toward efficiency. Not every application needs GPT-4. Not every device can run a hundred-billion-parameter model. Techniques like distillation (training smaller models to mimic larger ones), quantization (reducing numerical precision to shrink memory footprint), and sparse mixture-of-experts architectures (activating only a fraction of parameters for each input) are making capable models smaller and faster.

The open-source movement has democratized access. Meta’s LLaMA models, released in 2023, provided weights that researchers and developers could build upon. Mistral’s efficient models punched above their parameter count. A vibrant ecosystem emerged of fine-tuned variants, specialized adaptations, local deployments. The technology that was once locked in corporate research labs now runs on laptops.

This democratization raises its own questions. When anyone can fine-tune a language model, what happens when someone fine-tunes it for harm? The same techniques that align models with helpful behavior can align them with harmful behavior. The defenses are imperfect, the attacks are evolving, and the implications are being worked out in real time.

Perhaps the most significant development is the emergence of agents—LLMs that do not just generate text but take actions in the world. A model that can write code can run that code. A model that can describe web navigation can drive a browser. A model that can reason about tasks can decompose them into steps, execute each step, observe the results, and adapt. We are moving from language models as oracles (you ask, they answer) to language models as actors (you specify a goal, they pursue it).

This agentic capability transforms the risk profile. A model that only generates text is bounded in its impact—a human must read the text and decide whether to act on it. A model that executes code, makes API calls, and takes real-world actions can have consequences far beyond what its operators anticipated. The alignment problem becomes more urgent: not just “does it say what we want?” but “does it do what we want?”

The biggest questions remain open.

Will scaling continue to work? The scaling laws that predicted GPT-3’s capabilities from GPT-2’s have held remarkably well. But there are reasons to wonder if they will hold indefinitely. We may exhaust high-quality training data. We may hit fundamental limits on what can be learned from text prediction alone. The next factor-of-ten improvement may require not just more compute but new ideas.

What are the limits of the paradigm? The transformer architecture is general and powerful, but it is not infinitely flexible. Its quadratic attention complexity constrains context length. Its lack of persistent memory limits coherence over long interactions. Its fundamentally statistical nature may be unable to capture certain kinds of formal reasoning. Whether these are engineering challenges or fundamental barriers is not yet clear.

Is this a path to AGI? Artificial general intelligence—systems that match or exceed human cognitive abilities across all domains—remains speculative. Some researchers believe we are on an inexorable path toward it, that scaling up current architectures will eventually produce general intelligence. Others believe we are building very sophisticated narrow tools that will hit a ceiling short of genuine understanding. The honest answer is that we do not know. We are running an experiment on ourselves, and the results are not yet in.

We began this book with Charles Babbage, frustrated by errors in mathematical tables, dreaming of engines that could calculate without human fallibility. We traced the long arc from his brass gears through Boole’s algebra, Turing’s machine, Shannon’s bits, the perceptron and its discontents, the AI winters and the connectionist revival, the rise of deep learning and the attention revolution, to the strange new world of large language models.

The question Babbage implicitly asked—can thought be mechanized?—remains contested. We can now say with certainty that many things that looked like thought can be mechanized: translation, summarization, conversation, creative writing, mathematical problem-solving (sometimes), logical reasoning (sometimes), code generation. Whether what machines do is “really” thinking, or whether that question even has a meaningful answer, is left to philosophers to debate.

What we can say is this: we have built systems that surprise their creators. Systems whose capabilities exceed our ability to explain or predict them. Systems that pass tests—the Turing test, medical licensing exams, bar exams, coding interviews—that were once thought to require human-level intelligence. The boundary between what machines can and cannot do has moved so dramatically that we must constantly revise our intuitions.

We have been here before, in a sense. The history of AI is a history of hype cycles—extravagant promises followed by sobering winters. The perceptron would unlock machine intelligence; then it could not solve XOR. Expert systems would capture human expertise; then they could not handle the complexity of real-world knowledge. Every breakthrough has been followed by the discovery of new limitations, every triumph by a humbling correction.

But something feels different this time. The systems we have built are not narrow tools that succeed in one domain and fail in all others. They are general-purpose reasoning engines that improve continuously across a remarkable range of tasks as we scale them up. The limitations are real, but they are not obviously fundamental. The trajectory points upward even as we struggle to understand where it leads.

From Babbage’s gears to GPT, we have traced an extraordinary journey. The story of artificial intelligence is a story of human aspiration, human ingenuity, and human blindness—we rarely knew where our ideas would lead. Ada Lovelace could not have imagined ChatGPT. Alan Turing could not have imagined transformers. The researchers who proved the perceptron convergence theorem could not have imagined that neural networks would one day beat grandmasters at Go and draft legal briefs.

The story continues. Where it leads, we cannot say. But we are all participants now, whether we choose to be or not. The machines that we have built are changing what it means to know, to create, to work, to think. The questions they raise—about intelligence, about consciousness, about the nature of mind—are no longer merely philosophical. They are practical, urgent, and ours to answer.

We began with a question: Can machines think? We end with a world in which machines do things that look remarkably like thinking, and in which the answer to the question matters more than ever.

The gears have become algorithms. The algorithms have become models. The models have become something we are still learning to name. The journey from mechanical calculators to large language models is not a straight line but a winding path, full of detours and dead ends and unexpected leaps. And yet, looking back, we can see a thread connecting Babbage’s brass wheels to the silicon attention layers of today: the faith that thought can be formalized, that reasoning can be captured in rules, that the mind—whatever it is—has a structure we can understand and perhaps replicate.

Whether that faith is justified, we will discover together.

Chapter Notes

Key Figures

Jan Leike - Researcher at OpenAI, later Anthropic, key contributor to alignment research
John Schulman - Co-founder of OpenAI, developer of PPO algorithm, instrumental in RLHF
Sam Altman - CEO of OpenAI, oversaw release of ChatGPT and GPT-4
Dario Amodei - Former VP of Research at OpenAI, co-founder of Anthropic
Ilya Sutskever - Co-founder and former Chief Scientist of OpenAI, key figure in scaling language models

Primary Sources to Reference

Ouyang, L., et al. (2022). “Training language models to follow instructions with human feedback.” arXiv preprint arXiv:2203.02155.
Bai, Y., et al. (2022). “Constitutional AI: Harmlessness from AI Feedback.” arXiv preprint arXiv:2212.08073.
OpenAI. (2023). “GPT-4 Technical Report.” arXiv preprint arXiv:2303.08774.
Schulman, J., et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv preprint arXiv:1707.06347.
Wei, J., et al. (2022). “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv preprint arXiv:2201.11903.
Bubeck, S., et al. (2023). “Sparks of Artificial General Intelligence: Early experiments with GPT-4.” arXiv preprint arXiv:2303.12712.

Figures Needed

RLHF three-stage pipeline diagram
Instruction tuning before/after comparison examples
Hallucination examples with citations to non-existent papers
Multimodal model architecture showing unified processing
Timeline of major LLM releases (GPT-3, ChatGPT, GPT-4, Claude, Gemini, LLaMA, etc.)
Scaling laws graph showing capability emergence
Agent architecture diagram showing tool use and action loops