Information Theory Fundamentals

Information theory provides the mathematical framework for understanding data: how to quantify it, compress it reliably, and transmit it accurately across noisy channels. While born from communication engineering, its principles now underpin machine learning, data science, cryptography, and neuroscience. Mastering its fundamentals—entropy, mutual information, and capacity—empowers you to analyze and design efficient information-processing systems.

Quantifying Uncertainty: Shannon Entropy

The cornerstone of information theory is Shannon entropy, which measures the uncertainty or information content of a random variable. Formally, for a discrete random variable $X$ with probability mass function $p (x)$ , its entropy $H (X)$ is defined as:

$H (X) = - x \in X \sum p (x) lo g_{2} p (x)$

The units are bits when the logarithm is base 2. Entropy quantifies the average "surprise" inherent in $X$ 's outcomes. A fair coin flip, with $p (heads) = p (tails) = 0.5$ , has entropy $H = - [0.5 lo g_{2} 0.5 + 0.5 lo g_{2} 0.5] = 1$ bit. This represents maximum uncertainty for a binary variable. If the coin is biased, say $p (heads) = 0.9$ , the entropy decreases to approximately 0.47 bits; the outcome is more predictable, hence less "information" is gained when you observe it.

A key property is that entropy is maximized when all outcomes are equally likely. This concept is foundational for data compression. Intuitively, a source with high entropy (high unpredictability) is harder to compress than one with low entropy (high predictability). The link between entropy and compression is formalized by the Source Coding Theorem.

The Source Coding Theorem and Data Compression

The Source Coding Theorem, also known as Shannon's first theorem, establishes the fundamental limit of lossless data compression. It states that for a discrete memoryless source (independent, identically distributed symbols) with entropy $H (X)$ bits per symbol, you can represent the source's output using, on average, $H (X)$ bits per symbol with negligible risk of information loss as the sequence length grows. Conversely, if you try to compress it to fewer than $H (X)$ bits per symbol on average, information loss becomes inevitable.

This theorem proves that Shannon entropy is not just an abstract measure—it is the definitive lower bound for lossless compression. Practical compression schemes, like Huffman coding or arithmetic coding, aim to approach this bound. Huffman coding, for instance, assigns shorter binary codes to more probable symbols and longer codes to less probable ones, achieving an average code length that closely matches $H (X)$ for the source distribution. The theorem guarantees that no algorithm can beat this entropy limit.

Measuring Shared Information: Mutual Information and Channels

Real-world communication involves a sender, a channel (which may introduce noise), and a receiver. To analyze this, we need a measure of how much information the received signal gives us about the transmitted signal. This is mutual information, denoted $I (X; Y)$ . For two random variables $X$ (input) and $Y$ (output), it is defined as:

$I (X; Y) = x, y \sum p (x, y) lo g_{2} \frac{p ( x , y )}{p ( x ) p ( y )} = H (X) - H (X ∣ Y)$

Mutual information is symmetric: $I (X; Y) = I (Y; X)$ . It quantifies the reduction in uncertainty about $X$ after observing $Y$ . The conditional entropy $H (X ∣ Y)$ represents the remaining uncertainty in $X$ given $Y$ . If $X$ and $Y$ are independent, $I (X; Y) = 0$ ; if they are deterministically linked, $I (X; Y) = H (X)$ .

A communication channel is characterized by its conditional probability distribution $p (y ∣ x)$ . The channel capacity $C$ is the maximum possible mutual information between the input and output of the channel, where the maximum is taken over all possible input distributions $p (x)$ :

$C = p (x) max I (X; Y)$

Capacity is measured in bits per channel use and represents the ultimate upper limit on reliable communication speed over that noisy channel.

The Noisy Channel Coding Theorem

The Noisy Channel Coding Theorem, or Shannon's second theorem, is a revolutionary result. It states that for any channel with capacity $C$ and any data rate $R < C$ , there exists an error-correcting code that allows for transmission at rate $R$ with an arbitrarily low probability of error. Conversely, if you attempt to communicate at a rate $R > C$ , error-free communication is impossible.

This is a profound existence proof. It doesn't specify how to construct such codes (that is the field of coding theory), but it guarantees they exist. The intuition comes from a sphere-packing argument in a high-dimensional space of possible transmitted sequences. Valid codewords are like spheres. With a rate below capacity, you can pack enough disjoint spheres so that noise is unlikely to push a received signal into the wrong sphere. The theorem justifies the separation principle in digital communication: you can design an optimal source coder and an optimal channel coder independently.

Practical Applications: Compression, Codes, and ML

Data compression algorithms directly apply entropy concepts. Lossless methods (ZIP, PNG) implement source coding to approach the entropy bound. Lossy compression (JPEG, MP3) intentionally discards some information, often guided by models of human perception, to achieve much higher compression ratios for a target fidelity level.

Error-correcting codes are the constructive realization of the channel coding theorem. Codes like Hamming codes, Reed-Solomon codes (used in CDs and QR codes), and modern Turbo codes or Low-Density Parity-Check (LDPC) codes create structured redundancy that allows receivers to detect and correct errors, bringing practical systems remarkably close to the Shannon capacity limit.

In machine learning, information theory provides essential tools. Mutual information is used for feature selection, identifying which input variables share the most information with the target output. The information bottleneck theory frames deep learning as a trade-off between compressing the input representation and preserving information about the output label. Furthermore, the cross-entropy loss function, ubiquitous in classification, is derived from information-theoretic principles and is minimized when a model's predictions match the true data distribution.

Common Pitfalls

Confusing Entropy with Randomness: Entropy measures uncertainty, not the randomness of a sequence itself. A perfectly ordered sequence from a high-entropy source (like fair coin flips) is a low-probability but possible outcome. The entropy describes the expected behavior over all possible sequences from the source distribution.
Equating Mutual Information with Correlation: Mutual information is a more general measure of dependence than linear correlation. It captures any statistical relationship, including nonlinear ones, whereas correlation only measures linear relationships. Two variables can have zero correlation but high mutual information.
Misinterpreting Channel Capacity: Capacity $C$ is a property of the channel ( $p (y ∣ x)$ ), not of a specific code or modulation scheme. It is the maximum mutual information achievable by optimizing the input distribution. A specific communication system operates at a rate $R \leq C$ .
Overlooking the Asymptotic Nature of Theorems: The coding theorems hold in the asymptotic limit of infinitely long block lengths. Practical codes use finite block lengths, so there is always a non-zero probability of error, and achievable rates are slightly below the theoretical capacity. This gap is studied in finite blocklength information theory.

Summary

Shannon Entropy $H (X)$ quantifies the average information content or uncertainty of a random variable and sets the absolute limit for lossless data compression.
The Source Coding Theorem proves that a source can be losslessly compressed to a rate arbitrarily close to its entropy, but not below it.
Mutual Information $I (X; Y)$ measures the amount of information shared between two variables, and Channel Capacity $C$ is the maximum mutual information achievable over a noisy channel.
The Noisy Channel Coding Theorem establishes that reliable communication is possible at any rate below channel capacity and impossible above it, forming the bedrock of modern communication system design.
These theoretical concepts have direct, powerful applications in data compression algorithms, the design of error-correcting codes, and foundational frameworks in machine learning and data science.

Information Theory Fundamentals

Information Theory Fundamentals

Quantifying Uncertainty: Shannon Entropy

The Source Coding Theorem and Data Compression

Measuring Shared Information: Mutual Information and Channels

The Noisy Channel Coding Theorem

Practical Applications: Compression, Codes, and ML

Common Pitfalls

Summary

Write better notes with AI