Chapter 3: Information as Physics

Central Question: What is the nature of information?

3.1 Shannon’s Revolution

It is 1948, and we are standing in the corridors of Bell Telephone Laboratories in Murray Hill, New Jersey. The war is over. America is triumphant. And in this sprawling industrial research campus, some of the finest minds of the century are wrestling with a problem that seems, at first glance, merely technical: how do we send messages through wires?

The problem is not new. Since Samuel Morse’s first telegraph message in 1844—“What hath God wrought”—engineers have grappled with the practical challenges of electrical communication. Signals fade over distance. Noise corrupts transmission. Bandwidth is limited and expensive. The telephone network, now threading through every American city, demands constant improvement: more calls through the same wires, clearer voices over longer distances, fewer dropped connections.

But no one has asked the deeper question: What is a message? What, exactly, is being transmitted when we speak into a telephone or tap out Morse code? Everyone assumes they know—words, meanings, ideas. But can we quantify it? Can we measure information the way we measure voltage or current?

A thirty-two-year-old mathematician named Claude Shannon believes we can.

Shannon is an unlikely revolutionary. Raised in Gaylord, Michigan, a small town in the northern part of the state, he shows an early talent for tinkering—building a barbed-wire telegraph to a friend’s house half a mile away, constructing a crude elevator from pulleys and rope. At the University of Michigan, he earns degrees in both mathematics and electrical engineering. At MIT, he writes a master’s thesis demonstrating that Boolean algebra can describe the operation of switching circuits—the work that, as we saw in Chapter 1, unites Boole’s logic with the physical world of relays and switches.

During the war, Shannon works on cryptography and fire-control systems. He encounters the problem of communication in its most urgent form: how do you send secret messages that the enemy cannot decode? How do you transmit targeting data accurately through the noise of battle?

But Shannon’s mind operates at a level of abstraction that transcends any particular application. While others see specific problems—this cipher, that circuit—Shannon sees patterns. He sees the general structure lurking behind the particulars.

In July 1948, the Bell System Technical Journal publishes his paper: “A Mathematical Theory of Communication.” It is dense with mathematics. It runs to nearly eighty pages across two issues. And it changes everything.

Shannon’s first radical move is to separate information from meaning.

This seems almost perverse. When we communicate, meaning is the whole point. We speak to convey ideas. We write to share knowledge. The semantic content of a message—what it says—is what matters to us as human beings.

But Shannon recognizes that the engineering problem is different. A communication system must be able to transmit any message the sender might choose. The telephone company cannot know in advance whether you will discuss the weather or whisper words of love. The system must be indifferent to content. It must handle all possible messages with equal reliability.

From this perspective, a message is not a vehicle for meaning but a selection from a set of possibilities. Before you speak, many things could be said. After you speak, one thing has been said. The message is whatever distinguishes the actual from the possible. Information, in Shannon’s formulation, is the resolution of uncertainty.

This leads to a natural measure. Suppose you flip a fair coin. Before the flip, two outcomes are equally possible: heads or tails. The flip resolves your uncertainty completely. Shannon calls this one “bit” of information—from “binary digit,” a term coined by his Bell Labs colleague John Tukey.

One bit is the information content of a single yes-or-no question answered truthfully. It is the minimum unit of choice, the quantum of decision.

If you flip the coin twice, you have four equally likely outcomes: HH, HT, TH, TT. Specifying which one occurred requires two bits. Three flips give eight outcomes, requiring three bits. The pattern is clear: n flips require n bits. Or equivalently: specifying one outcome from among N equally likely possibilities requires log_2(N) bits.

But what if the outcomes are not equally likely? Suppose the coin is biased—it lands heads 90% of the time. Now there is less uncertainty to resolve. If you guess “heads” every time, you will usually be right. The biased coin carries less information per flip than the fair coin.

Shannon captures this with a formula of elegant simplicity. Let X be a random variable taking values x_1, x_2, …, x_n with probabilities p_1, p_2, …, p_n. The information content—which Shannon calls entropy, borrowing the term from thermodynamics—is:

Technical Box: Shannon Entropy

The entropy of a discrete random variable X is defined as:

\[H(X) = -\sum_{i} p(x_i) \log_2 p(x_i)\]

where p(x_i) is the probability of outcome x_i, and the logarithm is base 2 (so entropy is measured in bits).

Intuition: Entropy measures average surprise. A rare event is surprising; a common event is not. The term -log_2 p(x) gives the “surprise” of event x—higher when p(x) is small, lower when p(x) is large. Entropy is the expected value of this surprise, averaged over all possible outcomes.

Examples:

Fair coin: p(heads) = p(tails) = 0.5 \[H = -0.5 \log_2(0.5) - 0.5 \log_2(0.5) = 0.5 + 0.5 = 1 \text{ bit}\]

Biased coin (90% heads): p(heads) = 0.9, p(tails) = 0.1 \[H = -0.9 \log_2(0.9) - 0.1 \log_2(0.1) \approx 0.137 + 0.332 = 0.469 \text{ bits}\]

The biased coin has less than half the entropy of the fair coin. There is less uncertainty to resolve.

Maximum entropy: For a fixed number of outcomes, entropy is maximized when all outcomes are equally likely. The fair coin is the most unpredictable.

Connection to compression: Entropy sets a fundamental limit on data compression. A source producing symbols with entropy H bits per symbol cannot be compressed below H bits per symbol on average, no matter how clever the compression scheme. Conversely, optimal compression can achieve arbitrarily close to H bits per symbol. Entropy is the irreducible information content—the core that remains after all redundancy is squeezed out.

The connection to thermodynamic entropy is not merely terminological. Ludwig Boltzmann had shown, in the nineteenth century, that physical entropy measures the number of microscopic arrangements consistent with a macroscopic state. Shannon’s entropy measures the number of possible messages consistent with a given probability distribution. Both quantify a kind of hidden possibility, a space of alternatives.

Indeed, when Shannon was developing his theory, he consulted with the legendary mathematician John von Neumann about what to call his measure of information. Von Neumann’s advice was characteristically shrewd: “You should call it entropy, for two reasons. In the first place, your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.”

But Shannon’s theory goes far beyond defining a measure. He proves two remarkable theorems that establish the fundamental limits of communication.

The first is the source coding theorem, also known as the noiseless coding theorem. It says that a source of information can be encoded using, on average, no fewer bits than its entropy. You cannot compress below the entropy limit. But—and this is the crucial positive result—you can get arbitrarily close to that limit with a sufficiently clever coding scheme.

This explains why compression works. English text is highly redundant. After the letter Q, the letter U almost always follows. The word “the” appears far more often than “zyx.” These patterns mean that the entropy of English is far below the 4.7 bits per character that would be needed if all letters were equally likely. By exploiting these redundancies, we can compress text to a fraction of its original size—down toward the entropy limit.

The second theorem is even more profound. Consider a noisy channel—a communication link corrupted by random errors. You send a signal; the receiver gets a degraded version. Common sense suggests that some information must inevitably be lost. If the channel is very noisy, surely reliable communication is impossible?

Shannon proves that common sense is wrong.

The noisy channel coding theorem states that every channel has a definite capacity—a maximum rate, measured in bits per second, at which information can be transmitted reliably. Below this capacity, it is possible to encode messages so that the error rate can be made arbitrarily small, as close to zero as we like. Above the capacity, reliable communication is impossible.

The theorem is existence-proof: it shows that good codes exist without constructing them explicitly. Shannon demonstrates that random codes work, on average, but finding practical codes that approach the capacity limit remains an engineering challenge—one that would occupy researchers for decades, culminating in the turbo codes and low-density parity-check codes used in modern communications.

The conceptual revolution is complete. Information is not a vague notion but a precise quantity. Communication has fundamental limits—but those limits are generous, and we can approach them with sufficient ingenuity. The theory applies to any channel: telephone lines, radio waves, fiber optics, the neural pathways in a brain. Wherever signals are sent and received, Shannon’s framework applies.

The influence spreads far beyond telecommunications. In the 1950s and 1960s, information theory intersects with linguistics, psychology, and biology. Perhaps the genetic code is an information channel, with DNA as the message? Perhaps perception is a process of entropy reduction, extracting signal from noise? Perhaps thinking itself is computation in Shannon’s sense?

Not all of these applications prove fruitful. Shannon himself cautions against “the bandwagon”—the tendency to apply information theory to everything without careful thought. But the core insight is permanent. Information is physical. It can be measured, quantified, and optimized. It obeys laws as definite as the laws of thermodynamics.

Claude Shannon lives until 2001, long enough to see the digital revolution his work made possible. In his later years, he retreats from the spotlight, tinkering with mechanical toys and solving problems that amuse him. He builds a motorized pogo stick, a juggling machine, a robot mouse that can learn to navigate a maze. He is awarded the National Medal of Science, the Kyoto Prize, and virtually every other honor available to a scientist. But fame matters little to him. He is, to the end, a man who simply loves problems—the more abstract and puzzling, the better.

His work reminds us that the deepest practical advances often come from asking the most abstract questions. “What is information?” is not an engineering question. It is a philosophical question, a mathematical question, a question about the nature of knowledge itself. Shannon answered it, and in doing so, he built the foundation of our digital world.

3.2 Wiener and Cybernetics

While Shannon is developing information theory at Bell Labs, another mathematician is pursuing a parallel path that will intersect in profound ways. Norbert Wiener is working on the problem of control—not just communication, but action.

Wiener is a legend even before the war. Born in 1894 in Columbia, Missouri, he is a child prodigy of almost unbelievable precocity. He enters college at eleven, graduates at fourteen, and earns his Ph.D. from Harvard at eighteen. His father, Leo Wiener, a professor of Slavic languages, pushes him relentlessly—a pressure that leaves lasting psychological scars. The young Norbert is brilliant but anxious, socially awkward, and plagued by self-doubt.

By the 1930s, Wiener has established himself at MIT as a mathematician of the first rank. His work ranges across analysis, probability theory, and the foundations of quantum mechanics. He is rotund, nearsighted, and famously absent-minded. Stories circulate: Wiener wandering the halls of MIT, stopping a student to ask for directions, then asking, “Which way was I going when I stopped you?” When told, he replies, “Good, then I’ve already had lunch.”

But the war transforms Wiener from a pure mathematician into something else. In 1940, as German bombers devastate London, the British military faces a desperate problem: how to aim anti-aircraft guns at fast-moving planes. The challenge is not just ballistics but prediction. By the time a shell reaches the altitude of an enemy bomber, the plane has moved. The gunner must aim not at where the plane is, but at where it will be.

The problem is assigned to Wiener and his colleague Julian Bigelow. And as they work on it, Wiener begins to see something deeper than anti-aircraft fire.

The human gunner is part of a loop. He observes the plane’s position, predicts its trajectory, adjusts his aim, fires, observes the result, and corrects. The process is continuous, a dance of observation, prediction, action, and feedback. The gunner and the gun together form a system—and the key to the system is the feedback loop.

Feedback is not a new idea. Engineers have used governors and regulators since James Watt’s steam engine. But Wiener sees that feedback is more than an engineering trick. It is a fundamental principle that unites seemingly disparate phenomena.

Consider a thermostat. It measures temperature, compares it to a target, and activates heating or cooling to reduce the difference. The output (temperature change) feeds back to affect the input (the temperature reading). This circular causality keeps the system stable, hovering around its set point.

Now consider a human reaching for a glass of water. The eye tracks the hand’s position, the brain computes the error between hand and glass, and the motor system adjusts the trajectory. The process is continuous, with visual feedback guiding motor output. Damage the cerebellum—the brain region that integrates this feedback—and the patient develops ataxia, a condition where movements become wild and uncoordinated, overshooting and oscillating around the target.

Wiener realizes that the mathematics is the same. The thermostat and the reaching hand are both feedback control systems. They can be analyzed with the same tools, described by the same equations. The boundary between machine and organism dissolves in the abstract language of control.

In 1948—the same year Shannon publishes his information theory—Wiener releases a book with an invented title: Cybernetics: Or Control and Communication in the Animal and the Machine. The word comes from the Greek kubernetes, meaning “steersman” or “governor.” Wiener chooses it deliberately, evoking both the self-regulating systems of engineering and the deeper question of how living things maintain themselves in a changing world.

The book is dense, mathematical, and wide-ranging. It draws connections between neurophysiology and electrical engineering, between servo-mechanisms and reflexes, between information and entropy. Wiener argues that the key to understanding both machines and organisms is not matter or energy but information—the patterns of signals that flow through feedback loops, enabling systems to sense, respond, and adapt.

The cybernetic vision is holistic. Wiener sees the world as a web of information flows, with feedback loops maintaining stability at every level. A cell maintains its internal chemistry through metabolic feedback. An animal maintains its body temperature through physiological feedback. A society maintains its institutions through political feedback. The same principles apply everywhere, from thermostats to governments.

This vision proves enormously influential—and enormously controversial. The Macy Conferences on cybernetics, running from 1946 to 1953, bring together an extraordinary group: mathematicians, neurophysiologists, anthropologists, psychologists, linguists. The participants include Warren McCulloch and Walter Pitts (whose work we will examine shortly), the anthropologists Margaret Mead and Gregory Bateson, the psychologist Lawrence Frank, the neurologist Ralph Gerard. They argue, speculate, and synthesize, seeking a unified science of mind and machine.

Some of the ideas prove fruitful. The notion of feedback control becomes central to systems engineering. The concept of information as distinct from matter and energy reshapes biology, particularly after the discovery of the genetic code. The analogy between computers and brains—both processing information, both using feedback—becomes a guiding metaphor for cognitive science.

Other ideas prove premature or misguided. The cybernetic vision sometimes leads to grandiose claims about understanding consciousness and society. Critics accuse Wiener and his followers of ignoring the differences between organisms and machines, of reducing mind to mechanism without justification.

Wiener himself is ambivalent about the implications of his work. His 1950 book, The Human Use of Human Beings, worries about automation, unemployment, and the concentration of power that intelligent machines might enable. He is among the first to raise what we now call “AI safety” concerns—the dangers of creating systems that might escape human control.

“The world of the future,” Wiener writes, “will be an ever more demanding struggle against the limitations of our intelligence, not a comfortable hammock in which we can lie down to be waited upon by our robot slaves.”

Wiener dies in 1964, at the age of sixty-nine, during a trip to Stockholm. His legacy is complex. Cybernetics as a unified discipline does not survive; its insights are absorbed into control theory, systems biology, cognitive science, and artificial intelligence, each developing its own traditions and terminology. But the core idea persists: feedback loops connect sensing to action, enabling systems—whether mechanical or biological—to pursue goals and maintain stability in a changing world.

This insight will be central to understanding intelligence. A thermostat is not intelligent, but it exhibits something that looks like purpose. It “wants” to maintain a target temperature. Of course, this is mere metaphor—the thermostat has no desires, no experiences. But the structure of goal-directed behavior is present: sense the environment, compare to a goal, act to reduce the difference.

When we ask what intelligence is, and how machines might achieve it, we are asking questions that Wiener helped to formulate. Intelligence is not just computation; it is computation in the service of action. And action requires feedback—the continuous loop of sensing, deciding, acting, and sensing again. The cybernetic perspective does not solve the problem of intelligence, but it gives us a framework for thinking about it.

3.3 The McCulloch-Pitts Neuron

We now turn back five years, to 1943, and to a collaboration that will plant the seed for all neural network research to come.

Warren McCulloch is, at forty-five, a respected neurophysiologist with a philosophical bent. Trained in both medicine and psychology, he has spent years wrestling with the mind-body problem: how does the physical brain produce the phenomenal mind? How does meat think?

McCulloch’s approach is computational. He believes that the brain’s activity can be understood as logical operations, that neural processes implement something like a formal calculus. But turning this intuition into a rigorous theory requires mathematical tools he does not possess.

He finds them in Walter Pitts.

Pitts is one of the strangest and most tragic figures in the history of science. Born in Detroit in 1923, he grows up in poverty and abuse, running away from home repeatedly to escape his violent father. He is largely self-educated, haunting public libraries and teaching himself logic, mathematics, and languages. At fifteen, he reads Bertrand Russell’s Principia Mathematica and writes to Russell pointing out errors. Russell, astonished, invites him to Cambridge. Pitts cannot go—he is, after all, a homeless teenager—but the correspondence marks him as extraordinary.

At seventeen, Pitts arrives at the University of Chicago, where McCulloch has just taken a position. They meet and form an intense intellectual partnership. McCulloch provides the neurophysiology; Pitts provides the mathematical logic. Together, they produce a paper that will echo through decades: “A Logical Calculus of the Ideas Immanent in Nervous Activity,” published in 1943 in the Bulletin of Mathematical Biophysics.

The paper’s central claim is striking: neurons can be modeled as logic gates. The brain, whatever else it is, is a computing device.

Here is the model. A neuron receives inputs from other neurons through connections called synapses. Each synapse has a weight—a strength of connection. The neuron sums the weighted inputs. If the sum exceeds a threshold, the neuron “fires,” sending a signal to other neurons. If not, it remains silent.

This is a radical simplification. Real neurons are fantastically complex, with elaborate biochemistry, continuous rather than binary signals, and temporal dynamics that the model ignores. McCulloch and Pitts know this. Their model is not meant as a detailed account of neurophysiology. It is an idealization, a minimal abstraction that captures what they believe is the essential logic of neural computation.

Technical Box: The McCulloch-Pitts Formalism

A McCulloch-Pitts neuron is defined as follows:

Inputs: Binary values x_1, x_2, …, x_n (each either 0 or 1)

Weights: Fixed values w_1, w_2, …, w_n (positive for excitatory connections, negative for inhibitory)

Threshold: A value theta

Output: \[y = \begin{cases} 1 & \text{if } \sum_{i} w_i x_i \geq \theta \\ 0 & \text{otherwise} \end{cases}\]

Implementing basic logic gates:

AND gate (output 1 only if all inputs are 1): - Two inputs with weights w_1 = w_2 = 1 - Threshold theta = 2 - Output is 1 only when both inputs are 1: 1+1 = 2 >= 2

OR gate (output 1 if any input is 1): - Two inputs with weights w_1 = w_2 = 1 - Threshold theta = 1 - Output is 1 when at least one input is 1

NOT gate (inverts the input): - One input with weight w = -1 - Threshold theta = 0 - When input is 0: sum = 0 >= 0, so output is 1 - When input is 1: sum = -1 < 0, so output is 0

The key result: Since AND, OR, and NOT are universal (any Boolean function can be built from them), networks of McCulloch-Pitts neurons can compute any logical function.

The critical limitation: The weights are fixed in advance. There is no mechanism for the network to learn from examples. The designer must specify every weight and threshold by hand. This limitation will take decades to overcome.

The implications are profound. McCulloch and Pitts show that networks of their idealized neurons can compute any Boolean function—any function from a set of binary inputs to a binary output. By combining AND, OR, and NOT gates appropriately, we can build any logical circuit. And since computation (as Turing had shown) can be reduced to logical operations on symbols, networks of neurons can compute anything that a Turing machine can compute.

The brain, in this view, is a kind of computer—not a digital computer with a von Neumann architecture, but a massively parallel network of simple processing elements. The computation is distributed across millions of neurons, emerging from their collective activity rather than proceeding step-by-step through a central processor.

McCulloch and Pitts are explicit about the connection to Turing. Their paper shows that neural networks are Turing-complete: with enough neurons and the right connections, any computable function can be implemented. The brain has the computational power to do anything a computer can do—in principle.

This result is both liberating and limiting. It tells us that there is no fundamental barrier to neural computation. Whatever the brain does, it can be understood (at some level of abstraction) as computation. But it does not tell us how the brain actually computes—which algorithms it uses, how it learns, what representations it employs.

The McCulloch-Pitts model has a critical flaw: the weights are fixed. A human designer must specify every connection strength in advance. The network cannot learn from experience, cannot adapt to new situations, cannot improve with practice. Real brains, obviously, do all of these things. The question of how they do it—how learning might be implemented in neural networks—remains open.

It will take another two decades before Frank Rosenblatt develops the perceptron and shows that neural networks can learn from examples. It will take longer still for the field to mature into modern deep learning. But the seed is planted here, in 1943, in a paper whose title (“A Logical Calculus of the Ideas Immanent in Nervous Activity”) announces its ambition: to find the logic hidden in the brain.

Pitts himself never fulfills his early promise. He and McCulloch move to MIT after the war, joining the cybernetics circle around Wiener. But a rift develops—the exact causes are disputed, possibly involving Wiener’s wife’s suspicions about Pitts and McCulloch’s relationship with Wiener’s daughter. Wiener abruptly cuts off contact, and Pitts is devastated. He destroys much of his unpublished work. He drinks heavily, retreats from research, and dies in 1969 at the age of forty-six, largely forgotten.

McCulloch outlives him by only a few months. The two architects of computational neuroscience die in the same year, their partnership severed by personal tragedy but their intellectual contribution permanent.

Conclusion: The Stage Is Set

We have traveled, in Part I of this book, from the gears of Babbage’s Difference Engine to the abstract realms of Turing machines and Shannon entropy. Along the way, we have asked what seem like purely theoretical questions: What is computation? What is information? How might a brain compute?

But these questions turn out to be practical. The answers shape the technologies we build and the way we understand ourselves.

From Turing, we learned that computation can be defined precisely, that there exists a universal machine capable of executing any algorithm, and that some problems are forever beyond algorithmic solution. This gives us both a framework for understanding computation and a sense of its limits.

From Shannon, we learned that information can be quantified, that communication has fundamental limits defined by entropy and channel capacity, and that these limits can be approached with clever coding. This gives us the foundation for all digital communication, from telephone calls to internet traffic to the training data that feeds large language models.

From Wiener, we learned that feedback connects sensing to action, that the same principles govern thermostats and organisms, and that goal-directed behavior can emerge from simple mechanisms. This gives us a framework for understanding how systems—including intelligent systems—interact with their environments.

From McCulloch and Pitts, we learned that neurons can be modeled as logic gates, that networks of neurons can compute any logical function, and that the brain might be understood as a kind of computing device. This plants the seed that will eventually grow into neural networks and deep learning.

We have theory (Turing), information (Shannon), feedback (Wiener), and a model of neural computation (McCulloch-Pitts). The conceptual tools are in place. The stage is set.

In Part II, we turn from foundations to applications. The question shifts from “What is intelligence?” to “How do we build it?” We will meet the dreamers and builders who tried—sometimes brilliantly, sometimes naively—to create thinking machines. We will see their triumphs and their failures, their promises and their disappointments. The golden age of AI is about to begin.

Chapter Notes

Key Figures

Claude Shannon (1916-2001): Mathematician and engineer; father of information theory
Norbert Wiener (1894-1964): Mathematician and founder of cybernetics
Warren McCulloch (1898-1969): Neurophysiologist and early computational neuroscientist
Walter Pitts (1923-1969): Self-taught logician and mathematician

Primary Sources

Shannon, C. E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal, 27(3), 379-423.
Wiener, N. (1948). Cybernetics: Or Control and Communication in the Animal and the Machine. MIT Press.
McCulloch, W. S., & Pitts, W. (1943). “A Logical Calculus of the Ideas Immanent in Nervous Activity.” Bulletin of Mathematical Biophysics, 5(4), 115-133.

Figures Needed

Shannon portrait (Bell Labs era)
Information channel diagram (source -> encoder -> channel -> decoder -> destination)
Entropy as a function of probability for binary source
Wiener portrait
Feedback loop diagram (sensor -> controller -> actuator -> environment -> back to sensor)
McCulloch-Pitts neuron schematic
Boolean gates implemented with neurons (AND, OR, NOT)
Comparison: biological neuron vs. McCulloch-Pitts abstraction