Central Question: How can networks handle sequential data?
The deep learning revolution, as we saw in Chapter 10, began with images. Convolutional neural networks learned to see: recognizing faces, identifying objects, reading handwritten digits. The ImageNet moment of 2012 demonstrated that neural networks could match and exceed human performance on visual tasks that had seemed impossibly difficult just years before.
But vision is only part of intelligence. Human cognition unfolds in time. We speak in sequences of words. We listen to melodies that derive meaning from the order of notes. We read stories where the beginning shapes our understanding of the end. We make decisions based on histories of events stretching back hours, days, or years.
How can neural networks handle this temporal dimension of intelligence? A convolutional network expects an input of fixed size - an image with a certain number of pixels, arranged in a grid. But a sentence might be five words or fifty. A conversation might last a minute or an hour. A patient’s medical history might span decades of records. The very notion of “input size” becomes problematic when dealing with sequences.
This chapter traces the quest to give neural networks memory - the ability to process information that unfolds through time, maintaining context from the past to inform understanding of the present. We begin with recurrent neural networks, an elegant architecture that reuses the same weights at each time step. We confront the vanishing gradient problem in a new guise, discovering that learning long-range dependencies is profoundly difficult. We then meet the Long Short-Term Memory cell, a baroque but brilliant solution that would dominate sequence modeling for nearly two decades. Finally, we arrive at the sequence-to-sequence paradigm, which enabled neural networks to translate between languages, but also revealed a fundamental bottleneck that would eventually demand a new approach entirely.
11.1 Recurrent Neural Networks
The fundamental insight of recurrent neural networks is simple: what if, instead of processing a fixed input, a network processed one element at a time, maintaining a hidden state that carries information forward from the past?
Consider reading a sentence: “The cat sat on the mat.” A feedforward network would need to receive all six words simultaneously as a single fixed-size input. But that is not how humans read. We process one word at a time, and our understanding evolves as we go. After reading “The cat sat,” we have expectations about what might come next. After reading “on the,” we anticipate a noun. Each word is interpreted in the context of everything that came before.
An RNN mimics this process. At each time step \(t\), the network receives an input \(x_t\) (say, a word embedding) and combines it with its hidden state \(h_{t-1}\) from the previous time step. This combination produces a new hidden state \(h_t\) that captures information from both the current input and the accumulated past. We can then use this hidden state to make predictions, or simply pass it forward to process the next element.
The same operation repeats at every time step. The network “unrolls” through time, applying identical weights at each step but maintaining a flow of information through the hidden state. This weight sharing is crucial: it means the network can, in principle, handle sequences of any length. We do not need separate parameters for position 1, position 2, position 100. The same learned function applies everywhere.
Picture the network as a loop. At time step 1, input \(x_1\) enters, combines with an initial hidden state \(h_0\) (often set to zeros), and produces hidden state \(h_1\). At time step 2, input \(x_2\) enters, combines with \(h_1\), and produces \(h_2\). This continues until we reach the end of the sequence. If we need to produce an output at each step - for instance, predicting the next word - we can add an output layer on top of each hidden state.
The elegance of this design lies in its simplicity. With just a few weight matrices and a nonlinearity, we have a general-purpose sequence processor. The hidden state serves as a kind of memory, compressed and transformed at each step, carrying forward whatever information the network has learned is relevant for its task.
Technical Box: RNN Forward and Backward Passes
Forward Pass:
At each time step \(t\), an RNN computes:
\[h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)\]
\[y_t = W_{hy} \cdot h_t + b_y\]
Here: - \(x_t\) is the input at time \(t\) (dimension \(d_{input}\)) - \(h_t\) is the hidden state at time \(t\) (dimension \(d_{hidden}\)) - \(y_t\) is the output at time \(t\) (dimension \(d_{output}\)) - \(W_{xh}\) transforms input to hidden space (\(d_{hidden} \times d_{input}\)) - \(W_{hh}\) transforms previous hidden to current hidden (\(d_{hidden} \times d_{hidden}\)) - \(W_{hy}\) transforms hidden to output (\(d_{output} \times d_{hidden}\)) - \(b_h\), \(b_y\) are bias vectors
The \(\tanh\) nonlinearity squashes values to the range \((-1, 1)\), allowing the hidden state to represent both positive and negative activations while remaining bounded.
Unrolling Through Time:
To train the network, we conceptually “unroll” it through all time steps:
h_0 → [RNN Cell] → h_1 → [RNN Cell] → h_2 → ... → [RNN Cell] → h_T
↑ ↑ ↑
x_1 x_2 x_T
↓ ↓ ↓
y_1 y_2 y_T
Each RNN Cell applies the same weights. The unrolled network looks like a very deep feedforward network with shared weights.
Backpropagation Through Time (BPTT):
The loss \(L\) is typically the sum of losses at each time step: \(L = \sum_t L_t\). To compute gradients with respect to the weights, we apply the chain rule through the unrolled network.
For a weight \(W_{hh}\), the gradient involves contributions from every time step:
\[\frac{\partial L}{\partial W_{hh}} = \sum_t \frac{\partial L_t}{\partial W_{hh}}\]
Each \(\partial L_t / \partial W_{hh}\) requires propagating error backward through all time steps from \(t\) back to \(1\):
\[\frac{\partial L_t}{\partial h_k} = \frac{\partial L_t}{\partial h_t} \cdot \prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}}\]
The Product of Jacobians:
The term \(\partial h_i / \partial h_{i-1}\) is a Jacobian matrix. For our RNN:
\[\frac{\partial h_i}{\partial h_{i-1}} = \text{diag}(\tanh'(z_i)) \cdot W_{hh}\]
where \(z_i\) is the pre-activation and \(\tanh'(z) = 1 - \tanh^2(z)\).
The gradient flowing from time \(t\) back to time \(k\) involves a product of \(t - k\) such terms. This is where trouble arises.
The Vanishing Gradient Problem, Again
In Chapter 8, we encountered the vanishing gradient problem in deep feedforward networks. Error signals, propagated backward through many layers, tended to shrink exponentially as they passed through successive nonlinearities. LSTMs would address this in feedforward networks, but the problem reappears with particular severity in RNNs.
Consider a sequence of 100 time steps. To learn that an event at time step 5 affects the output at time step 100, the gradient must flow backward through 95 multiplicative factors. Each factor involves the derivative of tanh (maximum value 1, typically less) multiplied by the recurrent weight matrix \(W_{hh}\).
If the largest singular value of \(W_{hh}\) is less than 1, these products shrink exponentially. A gradient that starts at 1.0 might become \(10^{-30}\) by the time it reaches the early time steps. Such tiny gradients produce negligible weight updates. The network cannot learn that early inputs matter.
If the largest singular value exceeds 1, we have the opposite problem: gradients explode, growing to infinity and causing numerical overflow. Training becomes unstable, with weights diverging wildly.
There is a narrow regime where the spectral radius of \(W_{hh}\) is exactly 1, but this is impossible to maintain precisely during learning. In practice, RNNs are caught between vanishing and exploding gradients, with a knife-edge of stability that is difficult to find and maintain.
The practical consequence is severe: vanilla RNNs cannot learn long-range dependencies. Ask an RNN to remember something from 50 time steps ago, and the gradient signal simply does not make it back that far. The network forgets.
This limitation was understood early. Sepp Hochreiter, a German graduate student at the Technical University of Munich, analyzed this problem rigorously in his 1991 diploma thesis, proving that the vanishing gradient problem made long-term memory fundamentally difficult for standard RNNs. It was not a matter of finding better hyperparameters or training longer. The architecture itself was broken for anything but short-range dependencies.
Language, unfortunately, is full of long-range dependencies. Consider: “The trophy didn’t fit in the suitcase because it was too big.” Understanding what “it” refers to requires remembering information from several words back and reasoning about relative sizes. Or consider: “The man who the professor who the student met criticized left.” Understanding this sentence requires keeping track of nested dependencies across many words.
If RNNs could not handle such dependencies, they could not truly understand language. Something had to change.
11.2 LSTM: Learning to Remember
In 1997, Sepp Hochreiter and his PhD advisor Jürgen Schmidhuber published a paper titled “Long Short-Term Memory” in the journal Neural Computation. The paper was dense, mathematical, and addressed an obscure problem that most researchers had abandoned years earlier. It would take nearly fifteen years for the world to recognize its importance.
Hochreiter and Schmidhuber’s insight was architectural. The problem with vanilla RNNs was that the hidden state had to do too many jobs at once. It had to store long-term information, process short-term inputs, and do so through a transformation that inevitably caused gradients to vanish or explode. What if we separated these functions?
The LSTM cell introduces a new component: the cell state \(C_t\). Think of the cell state as a conveyor belt running through time. Information can be placed on the belt and ride along unchanged, or it can be modified or removed. Crucially, the cell state is updated through addition rather than multiplication. This additive structure allows gradients to flow backward through time without the compounding decay or explosion that plagues vanilla RNNs.
But if information can flow through unchanged, how does the network decide what to remember, what to forget, and what to output? This is where the gates come in. An LSTM has three gates, each a small neural network that produces values between 0 and 1:
The forget gate decides what information to discard from the cell state. Perhaps we have been tracking the subject of a sentence, but we have reached a period and started a new sentence. Time to clear out the old subject and make room for the new one.
The input gate decides what new information to store in the cell state. We have encountered a new subject - we should remember it.
The output gate decides what to reveal from the cell state as the current hidden state. The cell might store many things, but only some are relevant for the current output.
Each gate looks at the current input \(x_t\) and the previous hidden state \(h_{t-1}\), applies a learned linear transformation, and passes the result through a sigmoid function. The sigmoid produces a value between 0 and 1 for each dimension of the cell state - a “how much to let through” signal.
Technical Box: LSTM Cell Architecture
Gate Computations:
All three gates follow the same pattern - linear transformation of concatenated input and previous hidden state, followed by sigmoid:
Forget gate (what to erase from cell state): \[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]
Input gate (what to write to cell state): \[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\]
Output gate (what to reveal from cell state): \[o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\]
Here \(\sigma\) denotes the sigmoid function: \(\sigma(z) = 1/(1 + e^{-z})\), which produces values in \((0, 1)\).
Candidate Cell State:
We also compute a candidate value to potentially add to the cell state: \[\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]
The tanh produces values in \((-1, 1)\), allowing both positive and negative updates.
Cell State Update:
The cell state updates through a combination of forgetting and remembering: \[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]
where \(\odot\) denotes element-wise multiplication. The first term keeps parts of the old cell state (controlled by \(f_t\)). The second term adds new information (controlled by \(i_t\)).
Hidden State Output:
Finally, the hidden state is a filtered version of the cell state: \[h_t = o_t \odot \tanh(C_t)\]
The output gate controls which parts of the (tanh-squashed) cell state become visible to the next layer and the next time step.
Why This Solves Vanishing Gradients:
Consider the gradient of \(C_t\) with respect to \(C_{t-1}\): \[\frac{\partial C_t}{\partial C_{t-1}} = f_t\]
This is simply the forget gate value, not a matrix multiplication involving learned weights. If \(f_t \approx 1\), the gradient flows through unchanged. The network can learn to keep the forget gate open for information that should persist, creating a “gradient highway” through time.
Contrast this with vanilla RNNs, where: \[\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(\tanh'(z_t)) \cdot W_{hh}\]
The repeated multiplication by \(W_{hh}\) causes exponential decay or explosion. LSTM’s additive cell state update avoids this trap.
Parameter Count:
An LSTM layer with input dimension \(d_x\) and hidden dimension \(d_h\) has: - Four weight matrices of shape \((d_h, d_x + d_h)\): for \(f\), \(i\), \(o\), and \(\tilde{C}\) - Four bias vectors of dimension \(d_h\)
Total parameters: \(4 \times d_h \times (d_x + d_h) + 4 \times d_h\)
For \(d_x = d_h = 512\): approximately 2.1 million parameters per layer.
The LSTM architecture is undeniably complex. Four interacting components (three gates plus the candidate), each with its own weight matrix. Element-wise multiplications at every step. A cell state that is distinct from the hidden state. New practitioners often find LSTMs bewildering.
But this complexity serves a purpose. The vanilla RNN failed because it tried to do everything with a single mechanism. The LSTM succeeds by factoring the problem into pieces: a memory that can persist, and separate learned controllers for reading, writing, and erasing.
The forget gate is particularly crucial. Without it, the cell state would only accumulate information, never clearing old context. Hochreiter and Schmidhuber’s original 1997 paper did not include a forget gate; it was added by Felix Gers, Jürgen Schmidhuber, and Fred Cummins in 2000. This seemingly small modification turned out to be essential for practical performance.
The Long Wait for Recognition
Despite its elegant solution to a fundamental problem, the LSTM languished in relative obscurity for over a decade. The 1997 paper accumulated citations slowly. Most AI researchers in the late 1990s and early 2000s were focused on other approaches: support vector machines, graphical models, kernel methods. Neural networks were unfashionable, and recurrent networks even more so.
The tide began to turn in the early 2010s. As deep learning rose to prominence with the ImageNet breakthrough and the success of deep belief networks, researchers naturally asked: can we apply these ideas to sequences? They rediscovered the LSTM waiting patiently, its theoretical advantages now matched by the computational power needed to train it effectively.
In 2013, Alex Graves and his colleagues at the University of Toronto demonstrated that deep LSTMs could achieve state-of-the-art results on handwriting recognition, speech recognition, and other sequence tasks. The networks could learn to look at an image of handwritten text and produce the text character by character. They could listen to audio waveforms and transcribe speech. The long-range dependencies that had seemed impossible for vanilla RNNs were handled gracefully by LSTM’s gated architecture.
GRUs: A Simpler Alternative
In 2014, Kyunghyun Cho and colleagues proposed the Gated Recurrent Unit (GRU), a simplified variant of the LSTM. The GRU merges the cell state and hidden state into a single representation and uses only two gates instead of three:
\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] (update gate) \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] (reset gate) \[\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])\] \[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\]
The GRU has fewer parameters than the LSTM and is sometimes easier to train. Empirically, performance differences between LSTMs and GRUs are often small and task-dependent. The GRU’s simplicity made it popular, especially for smaller-scale applications where computational efficiency mattered.
By 2015, LSTMs and GRUs had become the workhorses of neural sequence modeling. They powered speech recognition systems deployed to hundreds of millions of users. They enabled significant improvements in machine translation. They generated text, composed music, and modeled financial time series. The problem of learning long-range dependencies in sequences, unsolved for decades, finally had a practical solution.
11.3 Sequence-to-Sequence
With LSTMs proving capable of modeling sequences, researchers faced the next challenge: transforming one sequence into another. Many important tasks have this structure. Translation converts a sentence in one language to a sentence in another. Summarization compresses a long document into a shorter one. Question answering takes a question and a context and produces an answer. These are all sequence-to-sequence problems.
The difficulty is that input and output sequences may have different lengths. A German sentence of ten words might translate to an English sentence of eight words. A paragraph of 200 words might summarize to a sentence of 20 words. We cannot simply map each input position to a corresponding output position; there is no such correspondence.
In 2014, Ilya Sutskever, Oriol Vinyals, and Quoc Le at Google published a paper that proposed an elegant solution: the encoder-decoder architecture, often called “seq2seq” for short.
The idea is to use two LSTMs. The first, the encoder, reads the input sequence one token at a time, updating its hidden state at each step. When it reaches the end of the input, its final hidden state contains a representation of the entire input sequence, compressed into a single vector. This vector is sometimes called the “context” or “thought” vector.
The second LSTM, the decoder, takes this context vector as its initial hidden state. It then generates the output sequence one token at a time, autoregressively. At each step, the decoder predicts the next output token based on its current hidden state and the previous output token. The decoder continues generating until it produces a special end-of-sequence token.
This separation of concerns is powerful. The encoder’s job is solely to understand the input. The decoder’s job is solely to generate the output, conditioned on the encoder’s understanding. Neither needs to worry about the other’s internal mechanics; they communicate only through the context vector.
Sutskever and colleagues demonstrated their approach on machine translation, training on pairs of English and French sentences. The results were remarkable. Their neural system achieved translation quality approaching that of phrase-based statistical machine translation systems that had been refined over decades. And it did so with a far simpler architecture - no hand-crafted features, no alignment models, no language-specific rules. The network learned to translate from examples alone.
Even more striking were qualitative improvements. The neural translations were often more fluent than statistical translations, even when they contained errors. The network seemed to capture something about the style and flow of natural language that escaped the symbolic rules of earlier systems.
The seq2seq paper ignited an explosion of research. Within months, researchers were applying encoder-decoder architectures to dialogue systems, image captioning (using a CNN encoder instead of an LSTM), text summarization, and dozens of other tasks. The paradigm was general: if you have input-output pairs, you can train a seq2seq model.
The Bottleneck Problem
Yet there was a fundamental limitation hiding in this elegant architecture. All information about the input must pass through the context vector - a single fixed-size vector, typically of perhaps 256 or 512 dimensions. For short inputs, this compression is manageable. But what about a paragraph? A page? An entire document?
Imagine trying to translate a complex 50-word German sentence. The encoder dutifully processes each word, updating its hidden state, until it reaches the end. At that moment, the hidden state must somehow represent everything: the subject, the verb, the objects, the modifiers, the tense, the mood, the subtle interactions between clauses. And all this must fit in 512 numbers.
Now the decoder begins generating English. It produces the first few words well enough, drawing on the most salient aspects of the context vector. But as it continues, the context vector is all it has. If the decoder needs to refer back to the third word of the German input - perhaps an unusual proper noun that must be transliterated precisely - it cannot. That information, if it survived compression at all, is entangled with everything else.
Experiments confirmed what intuition suggested: seq2seq performance degraded significantly on longer sentences. A model that translated short sentences nearly perfectly would produce garbled output on long ones. The bottleneck was real.
Researchers tried various workarounds. Reversing the input sequence (so the first input word was closest to the decoder) helped somewhat. Ensembling multiple models with different random seeds improved robustness. But these were patches, not solutions. The fundamental architecture forced all information through a single point of compression.
The Stage is Set
By 2015, the field was in a curious position. LSTMs had solved the vanishing gradient problem and enabled learning of long-range dependencies. The seq2seq paradigm had unlocked sequence-to-sequence tasks like machine translation. Neural approaches were achieving competitive or superior results on tasks long dominated by hand-engineered systems.
Yet everyone knew the bottleneck was a problem. The field needed a way for the decoder to look back at the full input, not just a compressed summary. It needed a mechanism for selective attention - the ability to focus on relevant parts of the input at each step of generation.
That mechanism was already being developed. In a 2014 paper that appeared on arXiv almost simultaneously with the seq2seq paper, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio proposed a solution: attention. Instead of compressing the entire input into a single vector, their decoder could attend to different parts of the encoder’s output at each generation step. The bottleneck was bypassed.
Attention would prove to be more than just a fix for seq2seq. It was a fundamentally new way of relating pieces of information, regardless of their position in a sequence. Combined with the insight that recurrence itself might be unnecessary, attention would give rise to a new architecture that would reshape the entire field.
But that is the story of Chapter 12.
Where We Stand
In this chapter, we have traced the quest to give neural networks memory. We started with the elegant simplicity of recurrent neural networks, which process sequences one element at a time while maintaining a hidden state. We confronted the vanishing gradient problem and saw how it made long-range dependencies unlearnable.
We then examined the LSTM, Hochreiter and Schmidhuber’s intricate solution that separates memory storage from memory control. Through gates that learn what to remember, forget, and reveal, LSTMs can maintain information across hundreds of time steps. Their success, after more than a decade of obscurity, reminds us that good ideas sometimes wait for their moment.
Finally, we saw how the encoder-decoder paradigm united two LSTMs to transform sequences of one kind into sequences of another. Machine translation, the benchmark that had driven statistical NLP for a generation, fell to neural approaches. But the bottleneck problem remained, limiting how much information could flow from encoder to decoder.
LSTMs and seq2seq brought neural networks to language. They proved that networks could learn the complex, hierarchical, long-range structure of human communication. But they also revealed the limits of sequential processing itself. Recurrence, which had seemed essential for sequences, was becoming a liability - slow to train, difficult to parallelize, and bottlenecked at the interface between encoder and decoder.
Something better was needed. In 2017, it arrived. The next chapter tells the story of attention and the transformer - an architecture that would abandon recurrence entirely and usher in the age of large language models.
The sequential bottleneck is about to break.
Chapter Notes
Primary Sources
- Hochreiter, S., & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation, 9(8), 1735-1780. [The foundational LSTM paper]
- Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). “Learning to Forget: Continual Prediction with LSTM.” Neural Computation, 12(10), 2451-2471. [Adding the forget gate]
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). “Sequence to Sequence Learning with Neural Networks.” Advances in Neural Information Processing Systems 27. [The seq2seq paradigm]
- Cho, K., et al. (2014). “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.” EMNLP 2014. [Introducing the GRU]
- Graves, A. (2013). “Generating Sequences With Recurrent Neural Networks.” arXiv:1308.0850. [Deep LSTMs for generation]
Further Reading
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 10. MIT Press. [Comprehensive treatment of sequence modeling]
- Olah, C. (2015). “Understanding LSTM Networks.” Blog post at colah.github.io. [Excellent visual explanation of LSTM mechanics]
- Karpathy, A. (2015). “The Unreasonable Effectiveness of Recurrent Neural Networks.” Blog post at karpathy.github.io. [Accessible introduction with striking examples]
Word count: approximately 4,100 words