Chapter 12: Attention Is All You Need

Central Question: Can we do better than recurrence?


Part IV: The Age of Transformers (2017-Present)


We have arrived at a watershed moment. In the previous chapter, we traced the rise of recurrent neural networks and their gated variants - LSTMs and GRUs - watching them conquer one sequence task after another. We celebrated the sequence-to-sequence paradigm that enabled machines to translate languages, summarize documents, and answer questions. But we also confronted an uncomfortable truth: the bottleneck problem. No matter how sophisticated our encoder became, it had to compress an entire input sequence - a sentence, a paragraph, an entire document - into a single fixed-length vector. This felt fundamentally wrong, like asking someone to remember a novel by memorizing a single sentence.

In this chapter, we witness the birth of an architecture that will reshape the entire field of artificial intelligence. We begin with a seemingly modest proposal from 2014: what if the decoder could look back at the input? And we end with a bold declaration from 2017: attention is all you need. Between these two moments lies one of the most consequential intellectual journeys in the history of computing.


12.1 The Attention Mechanism

The year is 2014, and the sequence-to-sequence revolution is in full swing. Ilya Sutskever and his colleagues at Google have just demonstrated that neural networks can translate between languages with surprising fluency. But a nagging problem remains. Consider translating a long German sentence into English. The encoder processes each German word, updating its hidden state, until it reaches the final word. At this moment, the entire meaning of the sentence must somehow be captured in a single vector of perhaps 256 or 512 dimensions. This vector is then handed to the decoder, which must reconstruct the full English translation from this compressed representation alone.

For short sentences, this works reasonably well. But as sentences grow longer, performance degrades. The bottleneck is simply too tight. Important information gets lost in compression. The model struggles particularly with rare words and complex syntactic structures that appear early in long sentences - by the time the encoder finishes, their traces in the hidden state have been overwritten by later words.

At the University of Montreal, a PhD student named Dzmitry Bahdanau is thinking about this problem. Working with Kyunghyun Cho and Yoshua Bengio, he asks a simple but profound question: why does the decoder have to rely on just one vector? What if, at each step of generating the output, the decoder could “look back” at all the hidden states the encoder produced?

This is the key insight behind attention. Instead of forcing all information through a single bottleneck, we allow the decoder to access the full sequence of encoder hidden states. But we do not want it to simply concatenate all these states - that would create an unwieldy input. Instead, we want the decoder to learn which parts of the input are relevant for each part of the output.

Consider translating “The black cat sat on the mat” into French. When the decoder is about to generate “noir” (black), it should focus on “black” in the input. When generating “chat” (cat), it should attend to “cat.” When generating “tapis” (mat), it should look at “mat.” The word “sat” might be relevant throughout, helping determine verb tense and sentence structure. Attention provides a mechanism for exactly this kind of selective focus.

Let us build an intuition for how attention works. Imagine you are looking up a word in a dictionary. You have a query - the word you want to find - and the dictionary contains many keys (the entries) with associated values (the definitions). You scan through the keys until you find one that matches your query, then retrieve the corresponding value. Standard dictionary lookup is “hard” - you either find an exact match or you do not.

Attention is like a “soft” dictionary lookup. Instead of finding a single exact match, we compare our query against all keys simultaneously, computing a similarity score for each. These scores tell us how relevant each key is to our query. We then take a weighted average of all the values, where the weights come from these similarity scores. Keys that closely match our query contribute more to the output; keys that do not match contribute less.

In the context of seq2seq translation, the query is the current decoder state (what we are trying to generate), the keys are the encoder hidden states (the input representations), and the values are also the encoder hidden states. We compare the decoder state against each encoder state, compute attention weights, and produce a weighted combination of encoder states - a “context vector” - that captures the relevant parts of the input for the current decoding step.


Technical Box: Attention Computation

The general attention mechanism can be expressed as:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]

Let us unpack this formula step by step.

Step 1: Compute Similarity Scores

We first compute the dot product between each query and each key: \(QK^T\). If \(Q\) has shape \((n, d_k)\) (n queries, each of dimension \(d_k\)) and \(K\) has shape \((m, d_k)\) (m keys, same dimension), then \(QK^T\) has shape \((n, m)\) - a similarity score for each query-key pair. Higher scores indicate greater similarity.

Step 2: Scale by \(\sqrt{d_k}\)

We divide by \(\sqrt{d_k}\). Why? When \(d_k\) is large, dot products tend to become large in magnitude. Large values pushed through softmax become extremely peaked - one score close to 1, the rest close to 0. This creates “hard” attention with vanishingly small gradients for non-maximum positions. Dividing by \(\sqrt{d_k}\) keeps the scores in a range where softmax produces a smoother distribution and gradients flow more evenly.

To see why: if the components of \(Q\) and \(K\) are independent random variables with mean 0 and variance 1, their dot product has mean 0 and variance \(d_k\). Dividing by \(\sqrt{d_k}\) normalizes the variance to 1.

Step 3: Apply Softmax

The softmax function converts raw scores into a probability distribution:

\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}\]

After softmax, each row of our attention matrix sums to 1. These are our attention weights - they tell us, for each query, how much to attend to each key.

Step 4: Weighted Sum of Values

Finally, we multiply the attention weights by the values \(V\). This produces a weighted combination of value vectors, where the weights are the attention scores. If query \(i\) attends strongly to key \(j\), then value \(j\) contributes heavily to output \(i\).

Intuition Summary

  • Query (Q): What we are looking for - “I need information relevant to generating this output”
  • Key (K): What information is available - “Here is what each input position offers”
  • Value (V): The actual content to retrieve - often the same as keys in basic attention

The entire operation is differentiable, so the network can learn what to attend to through standard backpropagation.


Bahdanau attention, as it came to be known, was a revelation. Applied to machine translation, it improved BLEU scores significantly, especially for longer sentences. But perhaps more importantly, the attention weights themselves became interpretable. Visualizing them revealed soft alignment patterns - the model learned which source words corresponded to which target words without ever being explicitly taught alignment. This was not programmed; it emerged from learning to translate well.

The attention mechanism spread rapidly through the NLP community. It improved not just translation but summarization, question answering, and dialogue systems. Researchers began to realize that attention was not just a patch for the bottleneck problem - it was a fundamental computational primitive for relating pieces of information, regardless of their position in a sequence.

But even with attention, we were still running recurrent networks. The encoder and decoder remained LSTMs or GRUs, processing tokens one at a time. Attention helped the decoder access the full input, but it did not change the fundamental sequential nature of the computation. Training was still slow because of this sequential dependency. A team at Google was about to ask a radical question: what if we could eliminate recurrence entirely?


12.2 The Transformer Architecture

It is June 2017, and a paper appears on arXiv with a title that reads like a manifesto: “Attention Is All You Need.” The authors are a team of eight researchers at Google Brain and Google Research: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. Their proposal is audacious: throw away recurrence entirely. No LSTMs. No GRUs. No sequential processing at all. Build the entire model from attention mechanisms.

The skepticism is understandable. Recurrence seemed essential for processing sequences. The hidden state of an RNN accumulates information as it moves through time, maintaining a kind of running memory. How can you process a sequence without… processing it sequentially?

The answer lies in a new kind of attention: self-attention. In Bahdanau attention, queries come from the decoder and keys/values come from the encoder - two different sequences interacting. Self-attention is attention within a single sequence. Each position in the sequence attends to all positions in the same sequence, including itself. Queries, keys, and values all come from the same source.

Consider the sentence: “The animal didn’t cross the street because it was too tired.” What does “it” refer to? A human reader immediately understands “it” refers to “the animal,” not “the street.” To make this inference, we must relate “it” to other words in the sentence. Self-attention provides a mechanism for every word to directly consider every other word, regardless of distance. The word “it” can attend to “animal” even though they are separated by several positions.

But wait - if we process all positions simultaneously, how does the model know the order of words? “Dog bites man” and “Man bites dog” contain the same words but mean very different things. The transformer solves this with positional encoding: before self-attention, we add information about each token’s position in the sequence directly to its embedding.

Let us walk through the complete transformer architecture, building up from its components.

Input Embeddings and Positional Encoding

Each input token is first converted to a dense vector through an embedding layer - the same technique used in earlier neural language models. But unlike RNNs, which naturally incorporate position through sequential processing, transformers process all positions in parallel. We must explicitly inject positional information.

The original transformer uses sinusoidal positional encodings. For position \(pos\) and dimension \(i\):

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\] \[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\]

These encodings have elegant properties. Each position has a unique encoding. The encoding for position \(pos + k\) can be expressed as a linear function of the encoding for position \(pos\), allowing the model to learn relative position patterns. And the sinusoidal waves at different frequencies give the model multiple “resolution levels” for positional information.

The positional encoding is simply added to the token embedding, creating a combined representation that carries both semantic and positional information.

Multi-Head Attention

A single attention operation computes one set of attention weights. But different parts of a sentence might need different types of attention patterns simultaneously. Consider: “The bank approved the loan.” The word “bank” might need to attend to “approved” to understand it is a financial institution, while also attending to “loan” for the same reason, while also attending to “The” to understand it is a noun phrase.

Multi-head attention runs several attention operations in parallel, each with its own learned projections:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O\]

where each head is:

\[\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

Each head projects the queries, keys, and values into a different subspace using learned weight matrices \(W_i^Q\), \(W_i^K\), \(W_i^V\). The attention is computed in this subspace, and the results from all heads are concatenated and projected back to the model dimension through \(W^O\).

The original transformer uses 8 heads, each operating on \(d_{model}/8 = 64\) dimensions. Different heads can learn different attention patterns: one might capture syntactic relationships, another semantic similarities, another positional patterns. The model learns to distribute its attention capacity across these parallel pathways.

The Feed-Forward Network

After attention, each position passes through a position-wise feed-forward network:

\[\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2\]

This is the same fully-connected network applied independently to each position. Typically the inner dimension is larger than the model dimension (2048 vs. 512 in the original paper), creating an expansion-compression bottleneck. The FFN provides nonlinear transformation capacity that pure attention lacks.

Residual Connections and Layer Normalization

Each sub-layer (attention, FFN) is wrapped with a residual connection and layer normalization:

\[\text{LayerNorm}(x + \text{Sublayer}(x))\]

Residual connections, borrowed from computer vision (ResNets), allow gradients to flow directly through the network during backpropagation. Layer normalization stabilizes training by normalizing activations across the feature dimension. Together, these techniques enable training of very deep transformer stacks.


Technical Box: Full Transformer Block

A complete transformer encoder block performs these operations:

Input: x (sequence of embeddings)

1. Self-Attention Sub-layer:
   a. Compute Q = xW^Q, K = xW^K, V = xW^V
   b. Compute attention for each head:
      head_i = softmax(Q_i K_i^T / sqrt(d_k)) V_i
   c. Concatenate heads and project:
      MultiHead = Concat(head_1, ..., head_h) W^O
   d. Add residual and normalize:
      x' = LayerNorm(x + MultiHead)

2. Feed-Forward Sub-layer:
   a. Apply FFN: FFN(x') = ReLU(x'W_1 + b_1)W_2 + b_2
   b. Add residual and normalize:
      output = LayerNorm(x' + FFN(x'))

Output: output (same shape as input)

Typical Dimensions (Base Model): - \(d_{model}\) = 512 (embedding/model dimension) - \(h\) = 8 (number of attention heads) - \(d_k = d_v\) = 64 (dimension per head) - \(d_{ff}\) = 2048 (feed-forward inner dimension) - Number of layers: 6

Parameter Count: Each attention layer: \(4 \times d_{model}^2\) (for \(W^Q\), \(W^K\), \(W^V\), \(W^O\)) Each FFN layer: \(2 \times d_{model} \times d_{ff}\) Total per block: approximately 2.4 million parameters Full base model: approximately 65 million parameters


The Encoder-Decoder Structure

For translation, the transformer maintains the encoder-decoder pattern from seq2seq models. The encoder consists of 6 identical layers, each containing self-attention and a feed-forward network. The encoder processes the entire source sentence simultaneously, with each position attending to all positions in the source.

The decoder also has 6 layers, but with a crucial modification. It contains two types of attention: self-attention over the decoder’s own outputs (so far), and cross-attention from the decoder to the encoder outputs. The cross-attention is exactly like Bahdanau attention - decoder positions query encoder positions to gather relevant source information.

Masked Self-Attention in the Decoder

There is a subtle but critical detail in decoder self-attention. During training, we know the entire target sequence. But during inference, we generate one token at a time - we cannot look at future tokens because they do not exist yet. To maintain this autoregressive property during training, we apply a mask to the attention scores, setting to negative infinity (effectively zero after softmax) any attention from position \(i\) to any position \(j > i\).

This masking ensures that the prediction for position \(t\) depends only on positions \(1, ..., t-1\). Without this mask, the model would “cheat” during training by looking at the answer, and then fail catastrophically during inference when that information is unavailable.

The transformer paper demonstrated state-of-the-art results on English-to-German and English-to-French translation. But the authors knew they had created something bigger than a translation model. Their paper ends with a prescient note: “We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio, and video.”

They could not have known just how prophetic these words would prove to be.


12.3 Why Transformers Won

The transformer was not just a new architecture; it was the right architecture for the computational era we were entering. To understand its dominance, we must understand why it succeeded where recurrent networks increasingly struggled.

Parallelization

The fundamental limitation of RNNs is their sequential nature. To compute the hidden state at position \(t\), you must first compute the hidden state at position \(t-1\), which requires position \(t-2\), and so on. This creates a chain of dependencies that cannot be parallelized across time steps. Training an RNN on a sequence of length 100 requires 100 sequential operations, regardless of how many processors you have.

Transformers break this chain. Self-attention relates all positions to all positions in a single operation - a massive matrix multiplication. Computing attention over a sequence of length 100 is a single parallel operation, not 100 sequential ones. On modern GPUs with thousands of cores designed for parallel matrix operations, this difference is transformative.

GPU Utilization

Graphics Processing Units (GPUs) were designed for rendering - computing many pixel values simultaneously using the same operations. This makes them exceptionally good at matrix multiplication, which is fundamentally parallel: each element of the output matrix can be computed independently. Modern GPUs can perform trillions of floating-point operations per second on large matrix multiplications.

The transformer is, at its core, a sequence of matrix multiplications. Attention is matrix multiplication. The feed-forward network is matrix multiplication. The embeddings are matrix multiplication. This architecture maps almost perfectly onto GPU capabilities. RNNs, with their sequential dependencies and complex cell operations, utilize only a fraction of available GPU capacity.

The practical impact is dramatic. Training that once took weeks on RNNs can be accomplished in days with transformers. This is not a small improvement - it is the difference between feasible and infeasible, between exploring one idea and exploring dozens.

Scaling Properties

Perhaps most importantly, transformers scale. As we add more data, more parameters, and more compute, performance continues to improve. This scaling behavior is remarkably predictable - researchers would later discover “scaling laws” showing smooth power-law relationships between compute and performance.

RNNs do not scale as gracefully. The sequential bottleneck limits effective batch sizes. The vanishing gradient problem, despite LSTM’s mitigations, still constrains how much long-range information can be captured. Attempts to make RNNs larger often hit diminishing returns or instabilities.

Transformers, by contrast, seemed to eat compute for breakfast. Double the layers, double the heads, double the dimensions - performance kept improving. This would prove crucial as the field moved toward ever-larger models.

The Bitter Lesson

In 2019, Rich Sutton, a pioneer of reinforcement learning, published a short essay called “The Bitter Lesson.” His thesis: throughout AI history, methods that leverage computation scale better than methods that leverage human knowledge. Hand-crafted features, clever architectures, domain-specific inductive biases - these provide short-term gains but are eventually overtaken by simpler methods that can exploit more compute.

The transformer is a poster child for the bitter lesson. It has almost no sequence-specific inductive biases. It does not know that words come in order until we tell it through positional encoding. It does not know about syntax, semantics, morphology, or phonology. It simply learns to attend - to relate pieces of information, wherever they appear.

This simplicity is its strength. The transformer makes minimal assumptions about its input, which means it can learn whatever patterns are present in the data. Give it enough data and compute, and it learns what took linguists centuries to formalize. Give it more data and compute, and it learns things linguists never noticed.

The End of RNN Dominance

The transformer did not immediately replace RNNs everywhere. For some tasks, especially those with very long sequences or limited compute, RNNs remained competitive. But the trend was unmistakable. By 2019, the top results on almost every major NLP benchmark used transformers. Researchers who had spent years developing better RNN variants found their improvements irrelevant - the entire paradigm had shifted beneath them.

Some viewed this as loss - all that accumulated knowledge about gating, architecture search, regularization for RNNs, rendered moot by a single paper. Others saw liberation - we could stop fighting RNN limitations and start exploring what was possible with this new, more capable substrate.

What Comes Next

The transformer paper was published in June 2017. Within eighteen months, it would spawn two models that would reshape the entire field of AI: BERT from Google (October 2018) and GPT-2 from OpenAI (February 2019). These models would demonstrate that transformers, trained on massive text corpora, could learn rich representations of language that transferred to virtually any downstream task.

The era of large language models was about to begin.

As we close this chapter, it is worth pausing to appreciate the magnitude of what has occurred. A team of researchers, building on ideas about attention that had emerged just three years earlier, proposed an architecture that would become the foundation for artificial intelligence systems affecting billions of people. The transformer is not just a neural network architecture; it is the computational substrate on which modern AI is built.

In the next chapter, we follow the transformer into the age of pre-trained language models. We watch as researchers discover that scale and data, combined with this elegant architecture, produce capabilities no one expected. The question “Can we do better than recurrence?” has been answered. Now a new question emerges: just how far can this approach take us?


Chapter Notes

Key Figures

  • Dzmitry Bahdanau - PhD student at University of Montreal, first author on the foundational attention paper
  • Kyunghyun Cho - Researcher at Montreal, co-author on attention paper, also known for GRU
  • Yoshua Bengio (1964-) - Professor at Montreal, Turing Award winner, attention paper co-author
  • Ashish Vaswani - Google Brain researcher, first author of “Attention Is All You Need”
  • Noam Shazeer - Google researcher, key transformer architect, later co-founded Character.AI
  • Jakob Uszkoreit - Google researcher, transformer co-author
  • Illia Polosukhin - Google researcher, transformer co-author, later co-founded NEAR Protocol
  • Llion Jones - Google researcher, transformer co-author, later co-founded Sakana AI
  • Aidan Gomez - Google researcher, transformer co-author, later co-founded Cohere
  • Niki Parmar - Google researcher, transformer co-author
  • Lukasz Kaiser - Google Brain researcher, transformer co-author

Primary Sources

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv preprint arXiv:1409.0473.
  • Vaswani, A., et al. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.
  • Sutton, R. (2019). “The Bitter Lesson.” Blog post at incompleteideas.net.

Figures Needed

Further Reading

  • “The Illustrated Transformer” by Jay Alammar - excellent visual walkthrough of the architecture
  • “The Annotated Transformer” from Harvard NLP - the paper implemented in PyTorch with commentary
  • “Formal Algorithms for Transformers” by Mary Phuong and Marcus Hutter - rigorous mathematical treatment