Central Question: What happens when we scale transformers on text?
Part IV: The Age of Transformers (2017-Present)
The transformer architecture has been unveiled. In the previous chapter, we watched a team at Google throw away recurrence entirely, replacing the sequential processing of RNNs with the parallel elegance of self-attention. We saw how this seemingly radical simplification - “Attention Is All You Need” - enabled unprecedented parallelization and scaling. But a translation model, however elegant, is just a translation model. The transformer’s true potential is about to be revealed.
In this chapter, we witness the birth of the large language model era. We begin in October 2018, when Google releases a model called BERT that crushes every NLP benchmark in sight. We then follow OpenAI’s parallel journey with GPT, watching as models grow from millions to billions to hundreds of billions of parameters, developing capabilities that surprise even their creators. Finally, we examine the scaling laws that explain why bigger models keep getting better - and what this means for the future of AI.
The question is no longer “Can we do better than recurrence?” It is now: “How far can we push this?”
13.1 BERT and Bidirectional Understanding
It is October 2018. Jacob Devlin and his colleagues at Google AI Language are about to publish a paper that will reshape natural language processing. The paper introduces BERT - Bidirectional Encoder Representations from Transformers - and within months, it will become the most influential NLP paper since the transformer itself.
To understand why BERT matters, we must first understand the limitation it addresses. Consider the sentence: “The bank was steep, so we couldn’t climb it.” What does “bank” mean here? A human reader instantly recognizes this is a riverbank, not a financial institution. The word “steep” to the right and “climb” further right make this clear. But what if our language model only looks at words to the left of “bank”? It sees “The” - not terribly informative. It has no access to the crucial disambiguating context that appears later in the sentence.
This is the problem with left-to-right language models. They predict each word based only on preceding words, missing the context that follows. Humans do not read this way. We take in entire sentences, paragraphs, even documents, integrating information from all directions. Why should our models be so constrained?
Previous attempts at bidirectional context existed, but they felt like workarounds. Some models trained separate left-to-right and right-to-left models and concatenated their outputs. Others used shallow bidirectional layers on top of unidirectional bases. BERT proposes something more elegant: train a single, deeply bidirectional model from the start.
But there is an immediate problem. How do you train a bidirectional model? The standard language modeling objective - predict the next word - is inherently left-to-right. If the model can see all words simultaneously, it could trivially “cheat” by just copying the word it is supposed to predict from its input.
Devlin’s insight is beautifully simple: mask some words and ask the model to predict them. This is masked language modeling (MLM). Take a sentence, randomly mask 15% of the tokens (replacing them with a special [MASK] token), and train the model to predict the original words. Now the model must use both left and right context to make its predictions. To guess what word belongs in “The [MASK] was steep,” it must consider both what comes before and what comes after.
This masking approach has an elegant property: it forces the model to build rich, contextualized representations of every word. The model cannot rely on shallow pattern matching; it must develop deep understanding of how words relate to their contexts in all directions. The representation for “bank” becomes different depending on whether it appears near “river” or “money” - the model learns context-dependent word meanings automatically.
BERT adds a second pre-training objective: next sentence prediction (NSP). Given two sentences, the model must predict whether the second sentence actually follows the first in the original text, or whether it was randomly sampled from elsewhere. This teaches the model something about discourse coherence - how sentences relate to each other across longer spans.
The pre-training data is massive: the entire English Wikipedia (2.5 billion words) plus a corpus of books (800 million words). Training takes four days on 16 of Google’s custom TPU chips - enormous resources that were simply unavailable a few years earlier. But when training completes, something remarkable emerges: a model that understands language in a deep, transferable way.
The magic happens in fine-tuning. Take this pre-trained BERT model - which has never seen a single labeled example of any specific task - and adapt it to your task with minimal additional training. Want to classify sentiment? Add a classification head, fine-tune on a few thousand labeled examples, done. Question answering? Same process. Named entity recognition? Same process. Each task requires only a small task-specific output layer; the massive pre-trained BERT encoder provides the foundation.
The results are stunning. BERT crushes the GLUE benchmark - a suite of nine natural language understanding tasks - improving the state of the art by 7.7% absolute. On SQuAD, a reading comprehension benchmark, BERT surpasses human performance for the first time. Task after task falls. The NLP community scrambles to understand what is happening.
What BERT demonstrates is the power of pre-training and fine-tuning. Instead of training a separate model for each task from scratch, we train one massive model on a general objective (predicting masked words), then adapt it to specific tasks. The pre-trained model has already learned syntax, semantics, world knowledge, reasoning patterns - all from simply predicting missing words in text. Fine-tuning leverages this knowledge for new tasks.
This paradigm shift echoes what happened in computer vision with ImageNet pre-training. Train a CNN on ImageNet, then fine-tune for your specific task. The features learned for ImageNet - edges, textures, shapes, objects - transfer to almost any visual task. BERT does the same for language. The representations learned from masked language modeling transfer to almost any language task.
Technical Box: BERT Architecture
BERT uses the encoder portion of the transformer architecture - self-attention without the causal masking that decoders use. Each token can attend to every other token in both directions.
Input Representation:
Every input sequence begins with a special [CLS] token, whose final hidden state serves as a summary representation for classification tasks. For tasks involving pairs of sentences, a [SEP] token separates them:
[CLS] The cat sat on the mat [SEP] It was comfortable [SEP]
Each token’s input embedding combines three components: - Token embedding (what word is this?) - Segment embedding (is this sentence A or B?) - Position embedding (where in the sequence?)
Model Specifications:
| BERT-Base |
12 |
768 |
12 |
110M |
| BERT-Large |
24 |
1024 |
16 |
340M |
Pre-training Objectives:
- Masked Language Model (MLM):
- Randomly select 15% of input tokens
- Of these: 80% replaced with [MASK], 10% replaced with random token, 10% unchanged
- Train to predict original tokens
- Loss: cross-entropy over masked positions only
- Next Sentence Prediction (NSP):
- 50% of training examples: sentence B follows sentence A (label: IsNext)
- 50%: sentence B is random (label: NotNext)
- Train classifier on [CLS] token output
- Later research found NSP contributes little; some variants drop it
Fine-tuning Paradigm:
For each downstream task, add a task-specific head on top of BERT: - Classification: Linear layer on [CLS] output - Token-level tasks (NER): Linear layer on each token output - Question answering: Linear layers predicting start and end positions of answer span
During fine-tuning, update ALL parameters - both the task head and the entire BERT encoder. This allows BERT’s representations to adapt slightly to each task while preserving the rich pre-trained knowledge.
The impact of BERT ripples through the field. Variants proliferate: RoBERTa (more data, longer training, no NSP), ALBERT (parameter sharing for efficiency), DistilBERT (smaller, faster), XLNet (permutation language modeling), ELECTRA (replaced token detection). The core insight - deep bidirectional pre-training, then task-specific fine-tuning - becomes the default paradigm for NLP.
But BERT has a limitation that becomes increasingly apparent. It is an encoder; it produces representations but does not generate text. Ask BERT to complete a sentence and it cannot comply - that is not what it was trained for. For generation, we need a different approach. At a lab in San Francisco, a team at OpenAI is betting on a different architecture entirely.
13.2 GPT: Generative Pre-training
While Google bets on bidirectional encoders, OpenAI makes a different wager. In June 2018 - four months before BERT - Alec Radford and colleagues release GPT: Generative Pre-Training. The “G” in GPT stands for generative, and this choice of direction will prove consequential.
GPT uses the decoder portion of the transformer, not the encoder. It is autoregressive: it generates text one token at a time, each token conditioned on all previous tokens. This is the same left-to-right language modeling that BERT explicitly abandoned. But OpenAI believes this limitation is actually a strength.
The training objective is simple: predict the next word. Given “The cat sat on the”, predict “mat”. Given “The cat sat on the mat”, predict the period. This is the most natural way to model text - it is how text is written, one word after another. And crucially, it allows the model to generate new text by sampling from its own predictions.
GPT-1 has 117 million parameters, trained on a corpus of books. The results are respectable but not revolutionary. It improves on several benchmarks but does not shake the field like BERT will. OpenAI presses on.
In February 2019, GPT-2 arrives. It has 1.5 billion parameters - ten times larger than GPT-1 - trained on a much larger corpus scraped from the web. And something unexpected happens.
GPT-2 does not just get incrementally better. It develops capabilities that seem qualitatively different. Given a prompt, it generates coherent paragraphs - sometimes pages - of text. It writes stories, articles, code. It answers questions, summarizes documents, translates languages. None of these tasks were in its training objective. It was trained only to predict the next word. Yet it learned to do all these things.
OpenAI makes an unusual decision: they initially withhold the full model, citing concerns about misuse. “Due to our concerns about malicious applications of the technology,” they write, “we are not releasing the trained model.” This is the first time a major AI lab has declined to release a model because it worked too well. The research community is skeptical - surely this is a publicity stunt? - but the released samples are undeniably impressive.
What GPT-2 demonstrates is that pure scale can unlock new capabilities. It was not trained differently than GPT-1; it was just bigger. More parameters, more data, more compute. And with that scale came abilities that seemed to emerge from nowhere.
But GPT-2 is merely a prelude. In May 2020, OpenAI releases a paper that will define the field for years to come: “Language Models are Few-Shot Learners.” The paper introduces GPT-3.
GPT-3 has 175 billion parameters. This is not a typo. It is more than 100 times larger than GPT-2, which was itself 10 times larger than GPT-1. Training costs are estimated in the millions of dollars. The model is trained on a filtered version of the entire publicly accessible internet, plus books, plus Wikipedia.
But the paper’s central contribution is not the model’s size - it is a new paradigm for using language models. Instead of fine-tuning, GPT-3 is evaluated through few-shot learning: show the model a few examples of a task in its prompt, then ask it to perform the task on a new input. No gradient updates. No task-specific training. Just examples in the prompt.
Consider a translation task. The prompt might be:
English: I love you.
French: Je t'aime.
English: Where is the library?
French: Ou est la bibliotheque?
English: The weather is beautiful today.
French:
GPT-3 completes this with “Le temps est beau aujourd’hui.” It learned to translate from two examples. Not from fine-tuning on millions of parallel sentences. Not from any explicit translation training. Just from seeing the pattern in its prompt.
This works for task after task. Arithmetic: show a few examples of addition, and GPT-3 can add. Grammar correction: show a few examples of corrections, and GPT-3 corrects. Question answering, summarization, code generation - all accessible through few-shot prompting.
Even more surprising: some tasks work with zero-shot prompting. Just describe the task in natural language, no examples at all. “Translate the following English text to French:” - and GPT-3 complies. It understands task descriptions, not just task examples.
What is happening here? How can a model trained only to predict the next word develop these abilities?
The answer lies in what “predicting the next word” actually requires. To predict text well across the vast diversity of the internet, a model must learn an enormous amount about language and the world. It must learn syntax and semantics, facts and reasoning, styles and conventions. When we ask GPT-3 to translate, we are not teaching it translation - we are activating translation abilities it already learned during pre-training, from the many examples of multilingual text in its training data.
This is the emergence of emergent capabilities - abilities that appear at scale but are absent in smaller models. GPT-2 could barely do arithmetic; GPT-3 can solve word problems. This is not because GPT-3 has more arithmetic training; it is because scale enables capabilities that do not exist at smaller scales. The field is beginning to understand that larger language models are not just better at the same things - they can do fundamentally different things.
Technical Box: Autoregressive Generation
GPT models generate text by factoring the probability of a sequence into a product of conditional probabilities:
\[P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i | x_1, x_2, ..., x_{i-1})\]
During training, we maximize the log probability of the training data:
\[L(\theta) = \sum_{i=1}^{n} \log P(x_i | x_1, ..., x_{i-1}; \theta)\]
This is computed efficiently in parallel using causal masking - each position can only attend to earlier positions.
Generation Strategies:
Once trained, how do we generate text? Several strategies exist, each with different tradeoffs:
Greedy Decoding: Always pick the highest-probability token. - Fast and deterministic - Often produces repetitive, dull text - Gets stuck in loops
Beam Search: Maintain top-k candidate sequences, extend each, keep the best. - Better than greedy for some tasks (translation) - Still tends toward high-probability, generic text - Computationally expensive
Temperature Sampling: Sample from the probability distribution, scaled by temperature T:
\[P(x_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\]
Where \(z_i\) is the logit (pre-softmax score) for token \(i\).
- T = 1: Sample from the model’s actual distribution
- T < 1: Sharper distribution, more deterministic
- T > 1: Flatter distribution, more random/creative
Top-k Sampling: Sample only from the k highest-probability tokens. - Prevents very low-probability tokens from being selected - k = 1 is equivalent to greedy decoding
Top-p (Nucleus) Sampling: Sample from the smallest set of tokens whose cumulative probability exceeds p. - Dynamically adjusts the candidate set based on the distribution - If the model is confident, considers fewer options - If uncertain, considers more options - Typically p = 0.9 or 0.95
The Creativity-Coherence Tradeoff:
Lower temperatures and smaller k/p values produce more coherent but predictable text. Higher values produce more creative but potentially nonsensical output. The optimal setting depends on the application: summarization wants coherence, creative writing wants variety.
GPT Model Scaling:
| GPT-1 |
117M |
BookCorpus (5GB) |
~1 petaflop-day |
| GPT-2 |
1.5B |
WebText (40GB) |
~10 petaflop-days |
| GPT-3 |
175B |
Filtered internet (570GB) |
~3,640 petaflop-days |
The jump from GPT-2 to GPT-3 represents roughly 350x more compute.
The GPT-3 paper ends with a reflection on what few-shot learning means. If models can learn tasks from a handful of examples in their context, the traditional paradigm of collecting large labeled datasets for each task may become obsolete. One model, many tasks, no task-specific training. This vision - which seemed speculative in 2020 - would prove prescient.
But a deeper question emerges: why does scale work? Why do bigger models keep getting better? Is this a temporary phenomenon, or something more fundamental? A group of researchers at OpenAI is about to discover something remarkable: the relationship between scale and performance is not random. It follows precise mathematical laws.
13.3 Scaling Laws
In January 2020, a team at OpenAI publishes a paper that will reshape how we think about language models. Jared Kaplan, Sam McCandlish, Tom Henighan, and colleagues train dozens of language models of varying sizes on varying amounts of data with varying compute budgets. Their goal: understand how performance scales with resources.
What they discover is striking. The relationship between model size and performance is not erratic or model-specific. It follows a smooth power law:
\[L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}\]
Where \(L\) is the loss (how wrong the model’s predictions are), \(N\) is the number of parameters, and \(N_c\) and \(\alpha_N\) are constants fit from data. The exponent \(\alpha_N\) is approximately 0.076 - small but relentless. Double the parameters, and the loss decreases by about 5%. Double again, another 5%. The improvement never stops; it just requires exponentially more parameters for linear gains.
The same power law relationship holds for data:
\[L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}\]
More training data yields lower loss, predictably, following another power law with \(\alpha_D \approx 0.095\).
And for compute:
\[L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}\]
Total training compute - parameters times data times iterations - also predicts loss via a power law with \(\alpha_C \approx 0.050\).
These are not loose correlations. The fits are remarkably tight across seven orders of magnitude of compute. Models from 768 parameters to 1.5 billion parameters, trained for varying durations, all fall on the same smooth curves. This is not a pattern; it is a law.
The implications are profound. If you want a model that achieves a certain loss, you can calculate how much compute you need. If you have a fixed compute budget, you can predict how well your model will perform. The mysticism around deep learning - “we don’t know why it works” - gives way to precise empirical regularities. We may not understand why the laws hold, but we can use them to plan.
The Kaplan paper also makes a recommendation: if compute is limited, make the model as big as possible and train it on as much data as possible. Bigger is better. The gains from more parameters outweigh the gains from more training steps. This advice is taken to heart. Models grow rapidly: 1.5 billion, 11 billion, 175 billion parameters.
Then in March 2022, a team at DeepMind challenges this orthodoxy. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, and colleagues publish “Training Compute-Optimal Large Language Models” - the paper that introduces Chinchilla.
Hoffmann’s team reexamines the scaling laws with a crucial question: given a fixed compute budget, what is the optimal allocation between model size and training data? Should you train a huge model for a few epochs, or a smaller model for many epochs?
Their answer surprises the field: previous models, including GPT-3, had been trained wrong. They were too large and undertrained. For any given compute budget, there is an optimal balance between parameters and data. Both should scale together:
\[N_{opt} \propto C^{0.5}\] \[D_{opt} \propto C^{0.5}\]
Parameters and data should grow at the same rate as the square root of compute. GPT-3’s 175 billion parameters, trained on 300 billion tokens, was far from optimal. It should have been closer to 70 billion parameters trained on far more data.
Chinchilla proves the point. It has “only” 70 billion parameters - less than half of GPT-3 - but is trained on 1.4 trillion tokens, far more than GPT-3’s 300 billion. Despite being smaller, Chinchilla outperforms GPT-3 on nearly every benchmark. The same compute, better results, just by allocating it differently.
Technical Box: Scaling Law Formulas
Kaplan et al. (2020) Scaling Laws:
Loss as a function of parameters (N), data (D), and compute (C):
\[L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N} \quad \text{where } \alpha_N \approx 0.076, N_c \approx 8.8 \times 10^{13}\]
\[L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D} \quad \text{where } \alpha_D \approx 0.095, D_c \approx 5.4 \times 10^{13}\]
\[L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C} \quad \text{where } \alpha_C \approx 0.050, C_c \approx 3.1 \times 10^8\]
Combined (when both N and D are bottlenecks):
\[L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \left(\frac{D_c}{D}\right)\right]^{\alpha_D}\]
Hoffmann et al. (2022) Chinchilla Optimal Scaling:
Given compute budget C (in FLOPs), the optimal model size and data are:
\[N_{opt} = A \cdot C^a \quad \text{where } a \approx 0.5\] \[D_{opt} = B \cdot C^b \quad \text{where } b \approx 0.5\]
This implies: parameters and data tokens should be scaled equally.
A rule of thumb from Chinchilla: train on approximately 20 tokens per parameter.
| GPT-3 |
175B |
300B |
1.7 |
Undertrained |
| Gopher |
280B |
300B |
1.1 |
Undertrained |
| Chinchilla |
70B |
1.4T |
20 |
Optimal |
Why Scaling Laws Matter:
- Predictability: Can forecast model performance before training
- Resource allocation: Know how to spend compute budget
- Research planning: Estimate what’s achievable with future resources
- No magic: Performance improvements are systematic, not lucky
Caveats:
- Laws describe pre-training loss, not downstream task performance
- Emergent capabilities may not follow smooth scaling
- Laws may change for different architectures or data distributions
- Extrapolation beyond observed scale is uncertain
The scaling laws have broader intellectual significance. They vindicate what Rich Sutton called “the bitter lesson”: general methods that leverage computation beat specialized methods that leverage human knowledge. The transformer has almost no built-in linguistic knowledge - no grammar rules, no semantic theories, no discourse models. It learns everything from data. And with enough data and compute, it learns extraordinarily well.
This is bittersweet for many researchers. Years of work on clever architectures, specialized inductive biases, and domain-specific adaptations seem less important when raw scale solves problems more effectively. The lesson is bitter because it suggests our clever ideas were less valuable than simply waiting for more compute.
But there is another way to view the situation. The scaling laws give us a map. They tell us that if we can muster enough compute, we can build models of unprecedented capability. They demystify progress: we are not waiting for conceptual breakthroughs or lucky discoveries. We are on a predictable trajectory. The question is not whether more capable models are possible, but whether we can gather the resources to train them.
And gather them we do. Following Chinchilla, the field recalibrates. Models still grow, but now they are trained on correspondingly more data. LLaMA from Meta trains a 65 billion parameter model on 1.4 trillion tokens. GPT-4 - whose details remain undisclosed - is rumored to be even larger, trained on even more data. The scaling laws predicted this arms race, and the arms race validates the laws.
But scaling laws describe pre-training loss, not usefulness. A model with lower loss is better at predicting text, but that does not make it helpful, harmless, or honest. GPT-3 can generate remarkably fluent text, but it can also generate toxic content, confidently state falsehoods, or produce harmful instructions. It has no values, no judgment, no understanding of what humans actually want from it.
We have built powerful engines of language. They can generate, summarize, translate, answer questions, write code. But they are, at their core, next-word predictors. They optimize for plausibility, not truth. For fluency, not helpfulness. For predicting what a human might say, not what a human should hear.
As we close this chapter, we stand at an inflection point. The transformer architecture, combined with massive scale, has produced language models of remarkable capability. BERT demonstrated that bidirectional pre-training creates representations that transfer to virtually any language task. GPT showed that autoregressive models, scaled up dramatically, develop emergent abilities to learn from examples in their context. The scaling laws revealed that this progress is not random - it follows predictable power laws that suggest further improvement with more resources.
But capability is not alignment. Power is not wisdom. These models predict what text comes next, but they do not understand what humans need. They can write convincingly about anything, including things that are false, harmful, or misleading. They have no intrinsic motivation to be helpful rather than harmful, truthful rather than plausible.
The next chapter takes on this challenge directly. How do we take a powerful but indifferent text predictor and shape it into something genuinely useful? The answer will involve a technique that seems almost paradoxically simple: asking humans what they want, and training the model to provide it. This is reinforcement learning from human feedback - RLHF - and it will transform language models from impressive curiosities into tools that hundreds of millions of people use every day.
Chapter Notes
Primary Sources
- Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv preprint arXiv:1810.04805.
- Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). “Improving Language Understanding by Generative Pre-Training.” OpenAI Technical Report.
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). “Language Models are Unsupervised Multitask Learners.” OpenAI Technical Report.
- Brown, T., et al. (2020). “Language Models are Few-Shot Learners.” Advances in Neural Information Processing Systems 33.
- Kaplan, J., et al. (2020). “Scaling Laws for Neural Language Models.” arXiv preprint arXiv:2001.08361.
- Hoffmann, J., et al. (2022). “Training Compute-Optimal Large Language Models.” arXiv preprint arXiv:2203.15556.
Further Reading
- “The Illustrated BERT” by Jay Alammar - visual walkthrough of BERT’s architecture and pre-training
- “GPT-3: Language Models are Few-Shot Learners” blog post by OpenAI - accessible summary of GPT-3 capabilities
- “Scaling Laws for Neural Language Models” - the original Kaplan et al. paper, readable and well-illustrated
- “An Empirical Analysis of Compute-Optimal Large Language Model Training” - DeepMind’s Chinchilla paper
- “The Bitter Lesson” by Rich Sutton - the philosophical context for why scale beats cleverness