Chapter 8: Neural Networks Reborn

Central Question: How can networks learn complex functions?


Part III: The Learning Revolution (1980-2017)

The AI winters had frozen the field. Expert systems, once heralded as the future of intelligent machines, were collapsing under the weight of their own brittleness. Symbolic AI, with its elegant rules and crisp logic, could not handle the messy ambiguity of real-world problems. And neural networks? They were dead. Minsky and Papert had driven a stake through the perceptron’s heart in 1969, and serious researchers knew better than to waste their careers on such discredited ideas.

Or so the mainstream believed.

What the mainstream did not know was that neural networks had not died. They had gone underground.


8.1 The Connectionist Revival

In the late 1970s, a small band of researchers scattered across North America and Europe continued working on neural networks despite the field’s pariah status. They published in obscure journals. They met at small workshops. They called themselves “connectionists” partly to avoid the toxic associations with the word “neural networks.” And they were about to change everything.

The intellectual center of this underground movement coalesced at the University of California, San Diego. There, in a cluster of offices overlooking the Pacific, a remarkable group assembled: David Rumelhart, a cognitive psychologist frustrated with the limitations of symbolic models; James McClelland, a psycholinguist seeking computational accounts of language processing; and Geoffrey Hinton, a British computer scientist who had never stopped believing that the brain held the key to intelligence.

Hinton’s path to UCSD was itself a story of intellectual rebellion. As a graduate student at the University of Edinburgh in the early 1970s, he had chosen to study neural networks precisely when they were most unfashionable. “I thought it was obvious that the brain wasn’t doing symbolic AI,” Hinton later recalled. “The brain has billions of neurons. It doesn’t have a central processor running through rules. It has to work differently.” His PhD advisor warned him he was committing career suicide. Hinton persisted anyway.

The UCSD group, which came to be known as the PDP Research Group (for Parallel Distributed Processing), was asking a simple but profound question: What if intelligence does not require explicit rules at all? What if the right answer emerges from the interaction of thousands of simple, interconnected processing units?

Their key insight was the concept of distributed representations. In symbolic AI, concepts are stored in discrete locations. The word “cat” might be represented by a particular symbol in a particular memory cell. But in the PDP vision, “cat” is represented as a pattern of activation across many processing units. No single unit means “cat.” The meaning emerges from the collective pattern.

This seemingly abstract distinction has profound consequences. Distributed representations are robust: damage a few units, and the pattern degrades gracefully rather than failing catastrophically. They support generalization: similar concepts have similar patterns, so the system can draw inferences about novel inputs based on their similarity to known ones. And they learn automatically: the patterns emerge from experience rather than being hand-coded by programmers.

Consider how this differs from a symbolic expert system. An expert system for diagnosing diseases requires a human expert to articulate explicit rules: “If the patient has fever AND cough AND chest pain, THEN consider pneumonia.” Every rule must be specified in advance. Every exception must be enumerated. The knowledge engineer becomes the bottleneck, laboriously extracting knowledge from human experts and encoding it by hand.

A connectionist system, by contrast, learns from examples. Show it thousands of patient records with symptoms and diagnoses, and it discovers the patterns itself. No one needs to articulate the rules explicitly. The rules, to the extent they exist at all, are implicit in the connection weights between units.

In 1986, the PDP Research Group published their manifesto: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, a two-volume work that landed on the field like a thunderclap. Volume 1, edited by Rumelhart and McClelland, laid out the theoretical foundations. Volume 2 showcased applications to psychology and linguistics. The books were dense, technical, and utterly convincing.

The timing was not accidental. By 1986, the expert systems bubble was beginning to deflate. The Japanese Fifth Generation computing project, which had aimed to build intelligent machines through symbolic AI, was floundering. Researchers disillusioned with the symbol-processing paradigm were looking for alternatives. The PDP books gave them one.

But the theoretical arguments alone would not have been enough. What made the connectionist revival succeed where the perceptron had failed was a practical breakthrough: a method for training multi-layer networks. The perceptron could only learn if it had a single layer of adjustable weights. Minsky and Papert had proved that such networks could not compute many useful functions, including XOR. Adding hidden layers would solve the problem in principle, but no one knew how to train them.

Until now.


8.2 Backpropagation

The fundamental challenge of training a multi-layer network is called the credit assignment problem. Suppose we have a network with an input layer, two hidden layers, and an output layer. We show it an input, it produces an output, and the output is wrong. We know there is an error. But where did the error come from?

Was it the fault of the weights connecting the second hidden layer to the output? Or the weights connecting the first hidden layer to the second? Or the input weights? The error at the output tells us that something is wrong, but it does not tell us which weights to blame.

In a single-layer perceptron, this problem does not arise. There is only one set of weights, so they must be responsible for any errors. We can adjust them directly based on the discrepancy between the target and the prediction. But with hidden layers, the responsibility is distributed across multiple stages of processing.

The solution, which came to be called backpropagation, had actually been discovered multiple times. Paul Werbos described it in his 1974 Harvard PhD thesis, though almost no one read it. David Parker independently derived it in 1985. Yann LeCun developed similar ideas in France. But it was the paper by Rumelhart, Hinton, and Ronald Williams, published in Nature in October 1986 with the title “Learning representations by back-propagating errors,” that brought backpropagation to the world’s attention.

The key insight is elegant: we can propagate error information backward through the network, layer by layer, using the chain rule of calculus. If we know how the error depends on the activations of the final hidden layer, and we know how those activations depend on the activations of the previous layer, then we can compute how the error depends on the earlier activations. The chain rule lets us compose these dependencies.

Let us see how this works in detail.

Forward Pass

Consider a network where information flows from input to output through multiple layers. At each layer, we perform two operations. First, we compute a weighted sum of the inputs:

\[z_j = \sum_i w_{ij} a_i + b_j\]

Here, \(a_i\) is the activation of unit \(i\) in the previous layer, \(w_{ij}\) is the weight connecting unit \(i\) to unit \(j\), and \(b_j\) is a bias term. The result \(z_j\) is sometimes called the pre-activation or logit.

Second, we apply a nonlinear activation function \(\sigma\) to get the unit’s output:

\[a_j = \sigma(z_j)\]

In the 1980s, the most common choice was the sigmoid function \(\sigma(z) = 1/(1 + e^{-z})\), which squashes any real number into the range \((0, 1)\). The forward pass simply repeats this computation layer by layer, from input to output.

Computing the Error

After the forward pass, we compare the network’s output to the target value. The difference is the error. Typically, we quantify this using a loss function \(L\). For regression problems, we might use mean squared error:

\[L = \frac{1}{2} \sum_k (y_k - a_k)^2\]

where \(y_k\) is the target for output unit \(k\) and \(a_k\) is the network’s actual output.

Backward Pass

Now comes the key step. We want to compute how the loss changes with respect to each weight: \(\partial L / \partial w_{ij}\). This gradient tells us which direction to adjust the weight to reduce the error.

The chain rule gives us:

\[\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial z_j} \cdot \frac{\partial z_j}{\partial w_{ij}}\]

The second factor is easy: since \(z_j = \sum_i w_{ij} a_i + b_j\), we have \(\partial z_j / \partial w_{ij} = a_i\). The gradient with respect to a weight is proportional to the activation of the unit it connects from.

The first factor, \(\partial L / \partial z_j\), is called the error signal and is denoted \(\delta_j\). This is the crucial quantity: how sensitive is the loss to the pre-activation of unit \(j\)?

For output units, we can compute \(\delta\) directly:

\[\delta_j = \frac{\partial L}{\partial a_j} \cdot \sigma'(z_j)\]

For mean squared error and sigmoid activation, this works out to \(\delta_j = (a_j - y_j) \cdot a_j (1 - a_j)\).

For hidden units, we propagate the error backward:

\[\delta_j = \sigma'(z_j) \sum_k w_{jk} \delta_k\]

The error signal at a hidden unit depends on the error signals at all the units it connects to in the next layer, weighted by the connecting weights.

This is the “backpropagation” in backpropagation. Errors flow backward through the network, just as activations flow forward.

The Weight Update

Once we have the gradients, we update each weight using gradient descent:

\[w_{ij} \leftarrow w_{ij} - \eta \frac{\partial L}{\partial w_{ij}} = w_{ij} - \eta \cdot \delta_j \cdot a_i\]

Here \(\eta\) is the learning rate, a small positive number that controls how big a step we take.


Technical Box: Backpropagation Derivation

Notation: - \(z_j^{(l)}\): pre-activation of unit \(j\) in layer \(l\) - \(a_j^{(l)}\): activation of unit \(j\) in layer \(l\) (after applying \(\sigma\)) - \(w_{ij}^{(l)}\): weight from unit \(i\) in layer \(l-1\) to unit \(j\) in layer \(l\) - \(\delta_j^{(l)} = \partial L / \partial z_j^{(l)}\): error signal at unit \(j\) in layer \(l\)

The Chain Rule:

\[\frac{\partial L}{\partial w_{ij}^{(l)}} = \frac{\partial L}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial w_{ij}^{(l)}} = \delta_j^{(l)} \cdot a_i^{(l-1)}\]

Error Signal at Output Layer (layer \(L\)):

\[\delta_j^{(L)} = \frac{\partial L}{\partial a_j^{(L)}} \cdot \sigma'(z_j^{(L)})\]

For MSE loss with sigmoid: \(\delta_j^{(L)} = (a_j^{(L)} - y_j) \cdot a_j^{(L)}(1 - a_j^{(L)})\)

Error Signal at Hidden Layers (propagating backward):

\[\delta_j^{(l)} = \sigma'(z_j^{(l)}) \sum_k w_{jk}^{(l+1)} \delta_k^{(l+1)}\]

Worked Example: XOR Network

Consider a network with 2 input units, 2 hidden units, and 1 output unit learning XOR.

Forward pass for input (1, 1), target 0:

Hidden unit 1: \(z_1 = w_{11} \cdot 1 + w_{21} \cdot 1 + b_1\), then \(a_1 = \sigma(z_1)\)

Hidden unit 2: \(z_2 = w_{12} \cdot 1 + w_{22} \cdot 1 + b_2\), then \(a_2 = \sigma(z_2)\)

Output: \(z_o = v_1 \cdot a_1 + v_2 \cdot a_2 + b_o\), then \(\hat{y} = \sigma(z_o)\)

Backward pass:

Output error: \(\delta_o = (\hat{y} - 0) \cdot \hat{y}(1 - \hat{y})\)

Hidden errors: \(\delta_1 = a_1(1-a_1) \cdot v_1 \cdot \delta_o\) and \(\delta_2 = a_2(1-a_2) \cdot v_2 \cdot \delta_o\)

Weight updates: \(v_1 \leftarrow v_1 - \eta \cdot \delta_o \cdot a_1\), etc.

After many iterations over all four XOR examples, the network converges to a solution.


The XOR Problem Solved

With backpropagation, we can finally solve the problem that had haunted neural networks since 1969. A network with two input units, two hidden units, and one output unit can learn XOR. The hidden units learn to compute intermediate representations that make the problem linearly separable.

One common solution the network discovers: one hidden unit learns to compute (roughly) “at least one input is on,” while the other learns “both inputs are on.” The output unit then computes “at least one is on AND NOT both are on,” which is exactly XOR. The network has invented its own internal representation of the problem.

This is the profound insight of the connectionist program. We do not need to tell the network what features to look for. We do not need to design clever representations by hand. The network learns the representations it needs from the training data. The hidden layers are not just a mathematical trick; they are a form of automatic feature engineering.

The 1986 Nature paper demonstrated backpropagation not just on toy problems like XOR but on practical tasks: learning family relationships from examples, converting text to speech, compressing images. Each demonstration made the same point: networks with hidden layers can learn complex functions that are far beyond the reach of single-layer perceptrons.

The response was electric. Neural network research, dormant for nearly two decades, roared back to life. Funding agencies that had shunned connectionism began writing checks. Graduate students who had been warned away from neural networks now flocked to them. The second coming of neural networks had begun.


8.3 Universal Approximation

The empirical successes of backpropagation-trained networks naturally raised a theoretical question: What is the computational power of these systems? Can neural networks, in principle, compute any function we might want? Or are there fundamental limits, as Minsky and Papert had found for single-layer perceptrons?

The answer, when it came, was startlingly powerful.

In 1989, George Cybenko, a mathematician at the University of Illinois, proved what became known as the Universal Approximation Theorem. His result, soon extended by Kurt Hornik, Maxwell Stinchcombe, and Halbert White, established that neural networks are universal function approximators.


Technical Box: Universal Approximation Theorem

Statement (Cybenko 1989, Hornik et al. 1989):

Let \(\sigma\) be a continuous, bounded, non-constant activation function (such as the sigmoid). Let \(f\) be any continuous function on a compact subset \(K\) of \(\mathbb{R}^n\). Then for any \(\epsilon > 0\), there exists a feedforward neural network with a single hidden layer containing a finite number of neurons such that

\[\sup_{x \in K} |f(x) - N(x)| < \epsilon\]

where \(N(x)\) is the output of the network.

In words: A neural network with one hidden layer and sufficiently many neurons can approximate any continuous function to arbitrary precision on any bounded region.

Intuition: Think of each hidden unit as implementing a “bump” or “ridge” function localized in some region of input space. By combining enough such bumps with appropriate weights, we can sculpt any continuous surface. This is similar in spirit to how Fourier series combine sine waves to approximate any periodic function, or how polynomial splines approximate curves by piecing together local segments.

Critical Caveat: The theorem guarantees existence, not learnability. It tells us that a sufficiently large network can represent any function, but it says nothing about whether gradient descent will find the right weights, how many neurons are needed, or how much training data is required. A function may be representable in principle but impossible to learn in practice.


The Universal Approximation Theorem gave the connectionist program theoretical legitimacy. Neural networks were not merely clever heuristics; they were provably capable of representing the kinds of complex functions needed for intelligent behavior. The XOR limitation was not a fundamental flaw of the neural network idea; it was merely a limitation of insufficient depth.

But the theorem also revealed the limits of what theory alone could tell us. Yes, neural networks can represent any function, but that does not mean they will learn the right one. The learning process depends on the optimization landscape, and that landscape is treacherous.

Vanishing Gradients

The very mechanism that makes backpropagation work also contains the seeds of its failure. Recall that error signals propagate backward by multiplying through derivatives of the activation function. For a sigmoid, the derivative \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\) reaches its maximum value of 0.25 when \(z = 0\) and decays toward zero for large \(|z|\).

In a deep network with many layers, the error signal passes through many such multiplicative factors. If each factor is less than one, the signal shrinks exponentially. By the time it reaches the early layers, it has effectively vanished. The early layers receive almost no learning signal; they are nearly frozen in their initial random state.

This vanishing gradient problem placed a practical ceiling on network depth. Networks with more than two or three hidden layers were nearly impossible to train. The theoretical power of deep networks remained locked away, inaccessible to the training algorithms of the day.

Local Minima

Gradient descent is a greedy, local algorithm. It follows the direction of steepest descent, always seeking to reduce the loss. But the loss landscape of a neural network is not a simple bowl with a single minimum. It is a complex, high-dimensional surface with many valleys, ridges, and saddle points.

A network trained by gradient descent might settle into a local minimum far worse than the global optimum. There is no guarantee that the learned solution is the best one, or even a good one. For practitioners in the 1980s and 1990s, this meant that training was often unreliable. The same architecture trained on the same data might find completely different solutions depending on the random initialization.

Computational Constraints

The computers of 1986 were slow by modern standards. Training a network on a dataset of thousands of examples might take hours or days. Datasets of millions of examples were out of the question. The networks that could be practically trained were small: tens to hundreds of units, not thousands or millions.

These constraints kept neural networks in a modest niche throughout the 1990s. They worked well on small, clean problems: recognizing handwritten digits, predicting simple time series, learning basic patterns. But they were outcompeted on many practical tasks by other methods such as support vector machines and random forests that did not suffer from the same optimization difficulties.

The Promise and the Wait

By the early 1990s, neural networks had been reborn, but their full potential remained unrealized. Backpropagation had solved the credit assignment problem. The Universal Approximation Theorem had established their theoretical power. Distributed representations offered an elegant alternative to brittle symbolic systems.

Yet the practical limitations were severe. Networks could not be made very deep. Training was slow and unreliable. Larger datasets were computationally prohibitive.

The connectionists had proven that neural networks could work. What they had not yet discovered was how to make them work well at scale. That breakthrough would require new algorithms, new hardware, and new insights that were still twenty years in the future.

In Chapter 10, we will see how those pieces finally came together. But first, we must understand another transformation sweeping through AI in the 1990s: the statistical turn. For while the connectionists were rebuilding neural networks, another group of researchers was asking a different question: What if we approach AI not as engineering intelligent behavior, but as learning statistical patterns from data? Their answer would reshape the field and set the stage for the deep learning revolution to come.

Neural networks were back. But they were not yet deep.


Chapter Notes

Key Figures

  • David Rumelhart (1942-2011): Cognitive psychologist, co-editor of the PDP books, co-author of the backpropagation Nature paper. Rumelhart’s later career was cut short by Pick’s disease, a devastating neurodegenerative disorder.
  • Geoffrey Hinton (1947-): British-Canadian computer scientist, often called the “godfather of deep learning.” Continued developing neural network methods through the 1990s dark years and played a central role in the deep learning revolution of the 2000s. Nobel Prize in Physics (2024).
  • James McClelland (1948-): Cognitive scientist, co-editor of the PDP books. Continued applying connectionist models to language and cognitive development.
  • Ronald Williams: Computer scientist, co-author of the backpropagation Nature paper and inventor of the REINFORCE algorithm for reinforcement learning.
  • George Cybenko: Mathematician who proved the first universal approximation theorem for neural networks (1989).
  • Paul Werbos (1947-): First to describe backpropagation in his 1974 PhD thesis, though the work went largely unnoticed at the time.

Primary Sources

  • Rumelhart, D. E., McClelland, J. L., & the PDP Research Group (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition (Vols. 1-2). MIT Press.
  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature, 323(6088), 533-536.
  • Cybenko, G. (1989). “Approximation by superpositions of a sigmoidal function.” Mathematics of Control, Signals and Systems, 2(4), 303-314.
  • Hornik, K., Stinchcombe, M., & White, H. (1989). “Multilayer feedforward networks are universal approximators.” Neural Networks, 2(5), 359-366.
  • Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. [For context on what backpropagation overcame]

Figures Needed

Further Reading

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 6. MIT Press. [Modern treatment of backpropagation]
  • Olazaran, M. (1996). “A Sociological Study of the Official History of the Perceptrons Controversy.” Social Studies of Science, 26(3), 611-659. [Fascinating account of how the Minsky-Papert critique shaped the field]
  • Anderson, J. A., & Rosenfeld, E. (Eds.) (1988). Neurocomputing: Foundations of Research. MIT Press. [Collected papers including early backpropagation work]

Word count: approximately 3,900 words