Central Question: What changed to make neural networks work?
By the mid-2000s, neural networks had been wandering in the wilderness for nearly two decades. The connectionist revival of 1986 had proven that multi-layer networks could learn. The Universal Approximation Theorem had established their theoretical power. But in practice, neural networks remained niche tools, outperformed on most tasks by support vector machines, random forests, and carefully engineered feature pipelines. The vanishing gradient problem capped network depth at two or three layers. Training was slow, finicky, and unreliable.
And then, in the span of about six years, everything changed.
Between 2006 and 2012, a series of breakthroughs transformed neural networks from an intellectual curiosity into the most powerful learning systems ever built. The transformation required three ingredients that finally came together: hardware capable of massive parallelism, algorithms that could train truly deep networks, and datasets large enough to exploit that depth. When these pieces aligned, the result was not incremental progress but a phase transition. Deep learning went from barely working to dominating nearly every benchmark in sight.
This chapter tells the story of that transformation. We begin with an unlikely source of computational power: video game graphics cards.
10.1 The GPU Revolution
The graphics processing unit, or GPU, was never designed for artificial intelligence. It was designed for video games.
In the 1990s, the video game industry faced a computational crisis. Games were becoming increasingly sophisticated, demanding real-time rendering of complex three-dimensional environments. Every frame required millions of calculations: transforming vertices, computing lighting, applying textures, determining which pixels were visible. A standard CPU, optimized for running sequential instructions very fast, simply could not keep up. Game developers needed something different.
What they needed was parallel processing. Rendering a frame involves performing the same mathematical operations on millions of different data points simultaneously. The color of pixel (100, 200) is independent of the color of pixel (500, 700); both can be computed at the same time if we have enough processors. The video game industry’s solution was to build chips containing not one powerful processor but hundreds or thousands of simpler ones, all working in parallel.
The result was the modern GPU. Where a high-end CPU might have eight or sixteen cores, each capable of complex branching logic and sophisticated caching, a GPU has thousands of cores optimized for one thing: performing the same arithmetic operation on many data points simultaneously. This style of computation, known as Single Instruction Multiple Data (SIMD), is spectacularly well-suited to graphics rendering. Transform a million vertices? Apply the same matrix multiplication to each one, in parallel. Compute lighting for a million pixels? Apply the same shading equation to each one, in parallel.
By the early 2000s, GPUs had become extraordinarily powerful. NVIDIA’s GeForce series and ATI’s Radeon cards were pushing hundreds of gigaflops of compute power, dwarfing what CPUs could achieve for parallel workloads. But this power was locked inside specialized graphics pipelines, accessible only through arcane APIs designed for rendering polygons and filling triangles.
Then, in 2006, NVIDIA released CUDA.
CUDA, which stands for Compute Unified Device Architecture, was a programming platform that opened the GPU to general-purpose computation. For the first time, programmers could write code that ran directly on GPU cores without pretending to render graphics. CUDA provided familiar C-like syntax for expressing parallel algorithms and handled the complex logistics of moving data between CPU memory and GPU memory.
The impact was immediate. Researchers in physics, finance, and scientific computing discovered they could accelerate certain calculations by factors of ten, fifty, even a hundred compared to CPU implementations. Problems that had taken hours now took minutes. Problems that had taken days now took hours. A single desktop workstation with a gaming GPU could outperform a small computing cluster.
Neural network researchers noticed.
Consider what happens during the forward pass of a neural network. At each layer, we compute:
\[\mathbf{a}^{(l)} = \sigma\left(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\right)\]
The weight matrix \(\mathbf{W}^{(l)}\) multiplies the activation vector \(\mathbf{a}^{(l-1)}\). For a layer with 1,000 input units and 1,000 output units, this is a million multiply-add operations. For a batch of 100 training examples processed simultaneously, it is 100 million operations. And this is just one layer of one forward pass.
Matrix multiplication is the heart of neural network computation, and matrix multiplication is embarrassingly parallel. To compute the element \((i, j)\) of the output matrix, we take the dot product of row \(i\) of the left matrix with column \(j\) of the right matrix. Every output element can be computed independently of every other. This is exactly the kind of workload GPUs were designed to accelerate.
The match between neural networks and GPUs was so natural it seemed almost destined. Both involve performing the same mathematical operations on massive amounts of data in parallel. Both benefit from high memory bandwidth. Both scale naturally: bigger networks mean more parallel computation, and GPUs provide more parallel processors. The video game industry had accidentally built the perfect hardware for artificial intelligence.
Technical Box: CPU vs. GPU Architecture
CPU (Central Processing Unit): - Few cores (4-16 typical) - Each core is powerful and versatile - Optimized for low latency on sequential tasks - Complex control logic, branch prediction, caching - High clock speed (3-5 GHz) - Good at: running operating systems, web servers, general programs
GPU (Graphics Processing Unit): - Many cores (thousands) - Each core is simple and specialized - Optimized for high throughput on parallel tasks - Simple control logic, massive parallelism - Lower clock speed (1-2 GHz) per core - Good at: matrix math, image processing, deep learning
Example Performance Comparison:
Matrix multiplication of two 4096x4096 matrices: - CPU (Intel i7, optimized): ~10 seconds - GPU (NVIDIA RTX 3090): ~0.1 seconds
The GPU is not faster at individual operations. It simply does many more operations simultaneously.
Memory Bandwidth: - CPU memory bandwidth: ~50 GB/s - GPU memory bandwidth: ~900 GB/s
For data-intensive workloads like neural networks, memory bandwidth often matters more than raw compute.
The economist Sara Hooker has called this phenomenon “the hardware lottery.” Some algorithms win not because they are inherently better but because existing hardware happens to favor them. In the 1980s and 1990s, hardware favored algorithms that could be expressed as sequential operations on modest amounts of data. Support vector machines fit this profile. Decision trees fit this profile. Neural networks, which demanded massive parallelism that CPUs could not provide, did not.
The GPU changed the calculus. Suddenly, algorithms that could be expressed as large matrix operations had an enormous advantage. Neural networks, which had been computationally burdensome, became computationally efficient. The same algorithm that had been impractical was now fast enough to be useful.
But hardware alone was not enough. Even with GPU acceleration, deep networks remained difficult to train. The vanishing gradient problem still blocked error signals from reaching early layers. Researchers had the computational power to train deep networks quickly; they just did not know how to train them at all.
That changed in 2006, when Geoffrey Hinton found a way in.
10.2 Hinton’s Deep Belief Networks
Geoffrey Hinton never stopped working on neural networks. Through the dark years of the 1990s and early 2000s, while most of AI chased support vector machines and graphical models, Hinton continued refining his ideas about how the brain might compute. He had moved from Carnegie Mellon to the University of Toronto, where he assembled a small group of students and collaborators who shared his conviction that deep networks held the key to intelligence.
The problem they faced was the same one that had plagued neural networks since backpropagation: depth. A network with twenty layers in principle could learn far more sophisticated representations than a network with two layers. But in practice, twenty-layer networks did not learn anything useful. The gradients vanished. The early layers remained stuck at their random initializations. The networks were deep in architecture but shallow in function.
Hinton’s insight was to sidestep the problem entirely. Instead of trying to train all layers simultaneously with backpropagation, what if we trained them one at a time?
The idea was inspired by a class of models called Boltzmann machines, which Hinton had developed with Terry Sejnowski in the 1980s. A Boltzmann machine is a network of binary units connected by symmetric weights. It learns by adjusting weights to make the network’s statistical behavior match the statistics of the training data. Unlike feedforward networks trained by backpropagation, Boltzmann machines are unsupervised: they learn to model the structure of the input data without requiring explicit labels.
The problem with full Boltzmann machines is that they are computationally intractable. Training them requires running a Markov chain to equilibrium, which can take an impractically long time. But a restricted version, with connections only between visible and hidden units (not among units in the same layer), can be trained efficiently using an algorithm called contrastive divergence.
In 2006, Hinton, along with Simon Osindero and Yee-Whye Teh, published a paper that would reshape the field: “A Fast Learning Algorithm for Deep Belief Nets.” The core idea was elegant. Instead of initializing a deep network randomly and hoping backpropagation could train all layers at once, we could build the network layer by layer:
Train a Restricted Boltzmann Machine (RBM) on the raw input data. The hidden units learn to capture statistical structure in the inputs.
Use the hidden unit activations as “data” for the next layer. Train another RBM on top of the first.
Repeat, stacking RBMs to create a deep network.
Finally, add an output layer and fine-tune the entire network using backpropagation.
The crucial innovation is the pre-training phase. Each RBM learns, in an unsupervised way, to build a useful representation of its input. The first layer learns to represent raw pixels. The second layer learns to represent combinations of first-layer features. The third layer learns even higher-level combinations. By the time we add the output layer and run backpropagation, the network already has sensible weights. The gradients no longer need to propagate through randomly initialized layers; they propagate through layers that already encode meaningful features.
Why does this work? The intuition is that greedy layer-wise pre-training provides a good initialization. Training a deep network from random weights is like searching for a needle in a haystack the size of a galaxy. The loss landscape has countless local minima, saddle points, and plateaus. Gradient descent starting from a random point has little hope of finding a good solution. But pre-training positions us in a good region of weight space from the start. We are no longer searching randomly; we are starting from a sensible place and refining.
The 2006 paper demonstrated that this approach could train networks with many layers on real tasks. Hinton and his collaborators showed state-of-the-art results on the MNIST digit recognition benchmark using networks far deeper than anyone had successfully trained before. The community took notice.
Technical Box: Restricted Boltzmann Machines
Structure: An RBM has two layers: visible units \(\mathbf{v}\) and hidden units \(\mathbf{h}\). Connections exist only between layers, not within them.
Energy Function: \[E(\mathbf{v}, \mathbf{h}) = -\mathbf{a}^T\mathbf{v} - \mathbf{b}^T\mathbf{h} - \mathbf{v}^T\mathbf{W}\mathbf{h}\]
where \(\mathbf{a}\) and \(\mathbf{b}\) are bias vectors and \(\mathbf{W}\) is the weight matrix.
Probability Distribution: \[P(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} e^{-E(\mathbf{v}, \mathbf{h})}\]
The partition function \(Z\) normalizes the distribution.
Conditional Independence: Because there are no within-layer connections: - \(P(h_j = 1 | \mathbf{v}) = \sigma(b_j + \mathbf{W}_{:,j}^T \mathbf{v})\) - \(P(v_i = 1 | \mathbf{h}) = \sigma(a_i + \mathbf{W}_{i,:} \mathbf{h})\)
This allows efficient sampling: we can sample all hidden units in parallel given the visible units, and vice versa.
Contrastive Divergence Training: 1. Start with a training example \(\mathbf{v}^{(0)}\) 2. Sample hidden units: \(\mathbf{h}^{(0)} \sim P(\mathbf{h} | \mathbf{v}^{(0)})\) 3. Sample visible units: \(\mathbf{v}^{(1)} \sim P(\mathbf{v} | \mathbf{h}^{(0)})\) 4. Sample hidden units: \(\mathbf{h}^{(1)} \sim P(\mathbf{h} | \mathbf{v}^{(1)})\) 5. Update weights: \(\Delta \mathbf{W} = \eta \left( \mathbf{v}^{(0)} \mathbf{h}^{(0)T} - \mathbf{v}^{(1)} \mathbf{h}^{(1)T} \right)\)
The “positive phase” \(\mathbf{v}^{(0)} \mathbf{h}^{(0)T}\) pulls the model toward the data. The “negative phase” \(\mathbf{v}^{(1)} \mathbf{h}^{(1)T}\) pushes it away from its own samples.
The deep belief network paper had an influence beyond its technical contributions. It demonstrated that deep learning was possible. For years, the conventional wisdom had held that networks with more than two or three layers were untrainable. Hinton proved this was wrong. The problem was not depth itself but how we approached training. With the right initialization strategy, deep networks could learn.
The paper also introduced a powerful concept: representation learning. Traditional machine learning required human engineers to design features by hand. To recognize faces, you might compute edge histograms, color distributions, and geometric ratios. To recognize speech, you might extract spectrograms and formant frequencies. Feature engineering was an art, requiring deep domain expertise and extensive experimentation.
Deep networks promised to automate this process. Each layer of a deep network learns its own features, building progressively more abstract representations. The first layer might learn edge detectors. The second might learn combinations of edges forming textures and shapes. The third might learn object parts. The fourth might learn entire objects. No human needs to specify what features to look for; the network discovers them from data.
Hinton was not alone. Two other researchers who had weathered the neural network winters were about to play crucial roles. Yann LeCun, who had pioneered convolutional networks at Bell Labs in the 1980s, was working on scaling up vision systems at NYU. Yoshua Bengio, at the University of Montreal, was exploring deep learning for language. The three of them, later dubbed the “godfathers of deep learning,” had kept the faith through the dark years. Now their persistence was about to pay off.
But deep belief networks, for all their importance, were not the final answer. Pre-training with RBMs was slow and complex. The algorithm had many hyperparameters to tune. Practitioners found the approach finicky and difficult to scale. Deep learning needed a simpler recipe.
That recipe would emerge from an unexpected source: a massive dataset of labeled images and a competition designed to test who could recognize them.
10.3 ImageNet and AlexNet
Fei-Fei Li had a simple idea that would change the history of artificial intelligence: we need more data.
In the mid-2000s, Li was a young professor at the University of Illinois at Urbana-Champaign (later at Stanford), working on computer vision. She had grown frustrated with the field’s standard datasets. The most popular benchmark, Caltech-101, contained about 9,000 images spread across 101 categories. It was small enough that researchers could memorize its quirks, engineering features and algorithms that worked on this specific dataset but failed to generalize. Progress on benchmarks was not translating into real-world capability.
Li believed the solution was scale. The human visual system learns from billions of images over a lifetime. If machines were to match human visual capabilities, perhaps they too needed to train on billions of images. Or at least millions.
Starting in 2007, Li assembled a team to construct what would become ImageNet. The goal was audacious: collect and label millions of images spanning tens of thousands of categories. The images would be drawn from the internet, capturing the full diversity of the visual world. The labels would follow the WordNet hierarchy, providing not just object categories but semantic relationships between them.
The scale of the labeling task was staggering. Li’s team used Amazon Mechanical Turk to crowdsource annotations from workers around the world. They developed sophisticated quality control mechanisms, having multiple workers label each image and using consensus to eliminate errors. Over three years, they assembled a dataset of over 14 million images labeled across more than 20,000 categories.
In 2010, Li and her collaborators launched the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The challenge focused on a subset of ImageNet: 1.2 million training images in 1,000 categories. The task was straightforward: given an image, predict which of the 1,000 categories it belonged to. Performance was measured by top-5 error rate: the fraction of test images where the correct label was not among the algorithm’s five best guesses.
The first two years of the competition saw steady progress. In 2010, the winning entry achieved a 28.2% error rate. In 2011, error dropped to 25.8%. The improvements came from better feature engineering: more sophisticated handcrafted descriptors like SIFT and HOG, combined with powerful classifiers like SVMs. Progress was real but incremental.
Then came 2012.
The entry that year from the University of Toronto was unlike anything the competition had seen. It was submitted by Alex Krizhevsky, a graduate student, along with his collaborators Ilya Sutskever and Geoffrey Hinton. They called their system AlexNet, and it was a deep convolutional neural network running on GPUs.
AlexNet won by a landslide. Its top-5 error rate was 15.3%. The second-place entry, using traditional computer vision methods, achieved 26.2%. The gap was not a few percentage points; it was nearly 11 points. In a field where annual improvements were typically measured in fractions of a percent, this was a revolution.
The architecture of AlexNet drew on ideas that LeCun had developed decades earlier but scaled them to an unprecedented degree. The network had eight layers: five convolutional layers followed by three fully connected layers, totaling about 60 million parameters. It was trained on two NVIDIA GTX 580 GPUs for about six days. The training data consisted of 1.2 million images, augmented with random crops, flips, and color perturbations to prevent overfitting.
Several technical innovations contributed to AlexNet’s success:
Convolutional Layers: Instead of connecting every input to every output, convolutional layers use small filters (e.g., 11x11 or 5x5 pixels) that slide across the image. Each filter learns to detect a particular feature, and the same filter is applied at every location. This dramatically reduces the number of parameters while building in translation invariance: a feature detector that finds edges works regardless of where in the image the edge appears.
ReLU Activation: Previous networks typically used sigmoid or tanh activation functions, which saturate for large inputs and suffer from vanishing gradients. AlexNet used the rectified linear unit: \(f(x) = \max(0, x)\). ReLU is computationally simple, does not saturate for positive inputs, and trains much faster than saturating nonlinearities.
Dropout: To prevent overfitting, Krizhevsky introduced dropout: randomly setting a fraction of activations to zero during training. This forces the network to learn redundant representations and acts as a powerful regularizer. At test time, all units are active, but their outputs are scaled to maintain the expected activation level.
GPU Implementation: The entire network was implemented in CUDA and ran on GPUs. This made training feasible in days rather than months. Without GPUs, AlexNet would have been computationally prohibitive.
Technical Box: CNN Architecture Components
Convolutional Layer: - A set of learnable filters (kernels) slide across the input - Each filter produces a feature map: \(\text{out}_{ij} = \sum_{m,n} \text{input}_{i+m, j+n} \cdot \text{kernel}_{mn}\) - Parameters: kernel size, number of filters, stride, padding - Key property: weight sharing - same filter applied everywhere, dramatically reducing parameters - Example: a 3x3 filter on a 224x224 image has only 9 weights but produces a 222x222 feature map
Pooling Layer: - Reduces spatial dimensions by taking maximum or average over regions - Max pooling: \(\text{out}_{ij} = \max_{m,n \in \text{region}} \text{input}_{i+m, j+n}\) - Provides translation invariance: small shifts in input do not change output - Example: 2x2 max pooling with stride 2 reduces 224x224 to 112x112
ReLU Activation: - \(f(x) = \max(0, x)\) - Gradient: 1 for \(x > 0\), 0 for \(x \leq 0\) - No vanishing gradient for positive activations - Sparse activation: many units output zero, providing implicit regularization
Dropout: - During training: randomly set fraction \(p\) of activations to zero - Each forward pass uses a different random mask - Forces network to learn redundant, distributed representations - At test time: use all units, scale outputs by \((1-p)\)
AlexNet Architecture Summary:
| Input |
- |
- |
- |
- |
227x227x3 |
| Conv1 |
Conv + ReLU |
96 |
11x11 |
4 |
55x55x96 |
| Pool1 |
Max Pool |
- |
3x3 |
2 |
27x27x96 |
| Conv2 |
Conv + ReLU |
256 |
5x5 |
1 |
27x27x256 |
| Pool2 |
Max Pool |
- |
3x3 |
2 |
13x13x256 |
| Conv3 |
Conv + ReLU |
384 |
3x3 |
1 |
13x13x384 |
| Conv4 |
Conv + ReLU |
384 |
3x3 |
1 |
13x13x384 |
| Conv5 |
Conv + ReLU |
256 |
3x3 |
1 |
13x13x256 |
| Pool5 |
Max Pool |
- |
3x3 |
2 |
6x6x256 |
| FC6 |
Fully Connected |
- |
- |
- |
4096 |
| FC7 |
Fully Connected |
- |
- |
- |
4096 |
| Output |
Softmax |
- |
- |
- |
1000 |
Total parameters: ~60 million
The reaction to AlexNet was immediate and profound. For years, the computer vision community had assumed that progress required better feature engineering. AlexNet proved that learned features could vastly outperform handcrafted ones. The network had discovered its own visual representations, and those representations were far better than anything humans had designed.
The implications were clear to anyone paying attention. If deep learning could revolutionize image classification, what else could it transform?
10.4 The Deep Learning Tsunami
The years following AlexNet were a period of explosive growth. Deep learning moved from academic curiosity to industrial priority almost overnight.
In speech recognition, deep neural networks replaced the hidden Markov models and Gaussian mixture models that had dominated for decades. In 2012, the same year as AlexNet, a collaboration between Hinton’s lab and Microsoft Research showed that deep networks could cut speech recognition error rates by nearly a third. By 2016, Microsoft announced that their speech recognition system had reached human parity on conversational speech.
In natural language processing, deep learning transformed machine translation. Traditional statistical translation systems relied on complex pipelines of alignment models, phrase tables, and language models. Deep learning offered an elegant alternative: encode the source sentence into a vector, then decode that vector into the target language. These encoder-decoder systems, once they scaled up, outperformed decades of painstaking engineering.
In game playing, deep reinforcement learning achieved feats once thought impossible. In 2013, a small London startup called DeepMind trained neural networks to play Atari games directly from pixels. The networks learned, from raw visual input and score signals alone, to match or exceed human performance on games like Breakout and Space Invaders. In 2016, DeepMind’s AlphaGo defeated Lee Sedol, one of the world’s best Go players, in a match that stunned the world. The ancient game of Go, with its vast search space and reliance on intuition, had long been considered beyond the reach of artificial intelligence. Deep learning proved otherwise.
Industry investment followed the breakthroughs. Google acquired DeepMind in 2014 for a reported 500 million dollars. Facebook hired Yann LeCun to lead its AI research lab. Baidu recruited Andrew Ng, a Stanford professor who had pioneered large-scale deep learning, to head its research efforts. Amazon, Microsoft, Apple, and countless startups poured resources into deep learning. The competition for talent became fierce, with top researchers commanding salaries that rivaled professional athletes.
The academic conference on Neural Information Processing Systems, once a modest gathering of a few hundred researchers, exploded in attendance. In 2012, NeurIPS had about 2,500 attendees. By 2019, it had over 13,000, with tickets selling out in minutes. The conference became so oversubscribed that a lottery system was introduced for registration.
Through it all, the basic recipe remained remarkably consistent: big data, big models, big compute. Success required massive labeled datasets, often millions or billions of examples. It required models with millions or billions of parameters. And it required enormous computational resources, typically measured in GPU-hours or TPU-days. The democratization of these resources through cloud computing lowered barriers to entry, but the biggest breakthroughs still came from organizations with the deepest pockets.
Hinton, LeCun, and Bengio, the three researchers who had kept faith with neural networks through the dark years, were vindicated. In 2018, they jointly received the Turing Award, computing’s highest honor, “for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing.” Their decades of persistence, often against the prevailing wisdom of the field, had transformed artificial intelligence.
But even as deep learning conquered vision and speech, a harder challenge remained. Images and audio are fundamentally spatial: the meaning depends on local patterns and their hierarchical composition. Convolutional networks exploit this structure beautifully. Language, however, is sequential. The meaning of a sentence unfolds over time. Words at the beginning constrain the interpretation of words at the end. Long-range dependencies span entire paragraphs or documents.
The deep learning revolution had proven that neural networks could learn. Could they learn to handle sequences?
Chapter Notes
Primary Sources
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). “A fast learning algorithm for deep belief nets.” Neural Computation, 18(7), 1527-1554.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). “ImageNet classification with deep convolutional neural networks.” Advances in Neural Information Processing Systems, 25.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). “ImageNet: A large-scale hierarchical image database.” CVPR 2009.
Hinton, G. E., & Salakhutdinov, R. R. (2006). “Reducing the dimensionality of data with neural networks.” Science, 313(5786), 504-507.
Hooker, S. (2021). “The hardware lottery.” Communications of the ACM, 64(12), 58-65.
Further Reading
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. [The definitive textbook, freely available online]
Marcus, G. (2018). “Deep Learning: A Critical Appraisal.” arXiv preprint arXiv:1801.00631. [A thoughtful critique of deep learning’s limitations]
Sejnowski, T. J. (2018). The Deep Learning Revolution. MIT Press. [Historical account from a participant in the connectionist movement]
LeCun, Y., Bengio, Y., & Hinton, G. (2015). “Deep learning.” Nature, 521(7553), 436-444. [Review article by the three Turing Award winners]
Word count: approximately 4,200 words