Information Theory

Information theory is the mathematical study of how information is measured, represented, transmitted, and compressed. It underpins everything from data compression formats and error-correcting codes to modern machine learning methods that use probabilistic models and representation learning. At its core are a few deceptively simple quantities, such as entropy and mutual information, that precisely capture uncertainty and dependence.

The central question: how much information?

Any system that stores or communicates data faces constraints: limited bandwidth, noise in transmission, finite memory, or a requirement to preserve only the most meaningful aspects of a signal. Information theory provides answers to questions like:

What is the minimum number of bits needed to encode a source without losing information?
What is the maximum reliable communication rate over a noisy channel?
If some distortion is acceptable, how far can we compress?

The field’s power comes from treating information as a measurable quantity tied to probability, not meaning. That distinction is essential: information theory does not interpret a message, it quantifies how surprising it is under a model.

Entropy: the unit of uncertainty

Consider a discrete random variable $X$ with possible outcomes $x$ and probabilities $p (x)$ . Shannon entropy measures the average uncertainty in $X$ :

$H (X) = - x \sum p (x) lo g_{2} p (x) .$

Entropy is measured in bits when using $lo g_{2}$ . Intuitively:

If $X$ is deterministic, $H (X) = 0$ : there is no uncertainty.
If $X$ is uniform over $n$ outcomes, $H (X) = lo g_{2} n$ : uncertainty is maximized.

A practical way to read entropy is as an idealized lower bound on average code length. If you want to encode outcomes of $X$ into binary strings, you cannot beat $H (X)$ bits per symbol on average without losing information, assuming you are coding long sequences and the probabilities are accurate.

Cross-entropy and KL divergence

In real systems, we often code using an estimated model $q (x)$ rather than the true distribution $p (x)$ . The expected code length then relates to cross-entropy:

$H (p, q) = - x \sum p (x) lo g_{2} q (x) .$

The penalty for using the wrong model is captured by Kullback-Leibler divergence:

$D_{K L} (p ∥ q) = x \sum p (x) lo g_{2} \frac{p ( x )}{q ( x )} .$

These quantities appear throughout machine learning. Minimizing cross-entropy loss is equivalent to maximizing likelihood, and the gap between cross-entropy and entropy is exactly $D_{K L} (p ∥ q)$ .

Mutual information: dependence quantified

Mutual information measures how much knowing one variable reduces uncertainty about another. For random variables $X$ and $Y$ :

$I (X; Y) = H (X) - H (X ∣ Y) = H (Y) - H (Y ∣ X) .$

Equivalently,

$I (X; Y) = x, y \sum p (x, y) lo g_{2} \frac{p ( x , y )}{p ( x ) p ( y )} .$

Key interpretations:

$I (X; Y) = 0$ if and only if $X$ and $Y$ are independent.
Mutual information is symmetric and non-negative.
It provides a model-agnostic measure of association, not limited to linear correlation.

In communications, mutual information describes how much information about the transmitted signal can be recovered from the received signal. In machine learning, it shows up in feature selection, representation learning, and analyses of generalization.

Source coding: compressing without losing information

Source coding asks: given a probabilistic source, what is the best lossless compression you can achieve?

Shannon’s source coding theorem

For a memoryless source emitting symbols according to $p (x)$ , Shannon’s theorem states that the average number of bits per symbol required for lossless compression can approach $H (X)$ , but cannot be less than $H (X)$ in the limit of long sequences.

This is the theoretical foundation behind practical methods such as Huffman coding and arithmetic coding:

Huffman coding constructs variable-length prefix codes close to entropy, optimal among prefix codes for known symbol probabilities.
Arithmetic coding can come arbitrarily close to $H (X)$ by encoding entire sequences into subintervals based on probabilities, which is why it often outperforms Huffman when probabilities are well-modeled.

In practice, compression algorithms also rely on modeling. Better probability estimates yield shorter codes, which is exactly why language models can function as compressors: predicting likely next tokens reduces surprise and thus reduces required bits.

Channel capacity: communicating reliably over noise

Channel coding addresses a different constraint: noise. A channel takes an input $X$ and produces an output $Y$ according to some conditional distribution $p (y ∣ x)$ . The central quantity is channel capacity, the maximum achievable reliable communication rate:

$C = p (x) max I (X; Y) .$

Capacity is measured in bits per channel use. It is not about a particular code, but about what is possible in principle given the channel’s statistics.

Shannon’s channel coding theorem

Shannon’s channel coding theorem states:

If you transmit at any rate $R < C$ , there exist coding schemes for which the probability of decoding error can be made arbitrarily small by using sufficiently long block lengths.
If $R > C$ , the error probability cannot be driven to zero, no matter what code you use.

This result reshaped engineering. It separated the problem into two layers: a physical channel with a fundamental limit, and coding schemes that can approach that limit.

Error-correcting codes in the real world

Modern error-correcting codes aim to operate close to capacity with feasible computation. Examples include turbo codes, LDPC codes, and polar codes, widely used in cellular networks, Wi-Fi, satellite links, and storage devices. Their practical success is a direct consequence of Shannon’s theorem: capacity is achievable, not merely a bound.

Rate-distortion theory: when some loss is acceptable

Many applications do not require perfect reconstruction. Images, audio, and sensor streams can tolerate controlled distortion. Rate-distortion theory formalizes the best tradeoff between compression rate and fidelity.

Given a distortion measure $d (x, \overset{x}{^})$ and an allowed expected distortion $D$ , the rate-distortion function $R (D)$ is the minimum number of bits per symbol needed to encode $X$ so that the reconstructed $\hat{X}$ satisfies $E [d (X, \hat{X})] \leq D$ .

The details depend on the source distribution and chosen distortion metric, but the conceptual takeaway is consistent: allowing more distortion can dramatically reduce the required rate, and information theory characterizes the optimal frontier. This perspective helps explain why lossy codecs can be extremely efficient: they exploit perceptual or task-specific tolerance for error.

Applications to machine learning

Information theory is not just foundational to communications. It provides tools and language for ML, especially where uncertainty, compression, and generalization matter.

Cross-entropy as a training objective

In classification and language modeling, cross-entropy loss measures the mismatch between the model distribution and the data distribution. Lower cross-entropy means the model assigns higher probability to the observed outcomes, which also corresponds to fewer bits needed to encode the labels or tokens using the model as a compressor.

Mutual information and representation learning

Mutual information can be used to reason about what a representation retains. Informally, a good representation $Z$ of input $X$ for predicting target $Y$ should keep information relevant to $Y$ while discarding nuisance variability. This idea motivates objectives that encourage $I (Z; Y)$ to be high and, in some formulations, $I (Z; X)$ to be controlled to avoid memorization.

Channel capacity as an analogy for learning systems

Neural networks, optimization noise, and finite-precision computation can be viewed through an information-theoretic lens: internal representations are constrained channels that transform inputs into outputs. While the analogy should be used carefully, it is often helpful for thinking about bottlenecks, robustness, and the limits of recoverability under noise.

Why information theory remains essential

Information theory endures because it provides sharp, quantitative answers to practical questions:

Entropy sets limits on lossless compression.
Mutual information quantifies dependence and recoverability.
Channel capacity defines what reliable communication can achieve.
Rate-distortion theory formalizes compression under acceptable loss.

Whether you are designing a transmission system, building a compression pipeline, or training probabilistic models, information theory offers the clearest way to connect probability, uncertainty, and the fundamental limits of performance.

Information Theory

Information Theory

The central question: how much information?

Entropy: the unit of uncertainty

Cross-entropy and KL divergence

Mutual information: dependence quantified

Source coding: compressing without losing information

Shannon’s source coding theorem

Channel capacity: communicating reliably over noise

Shannon’s channel coding theorem

Error-correcting codes in the real world

Rate-distortion theory: when some loss is acceptable

Applications to machine learning

Cross-entropy as a training objective

Mutual information and representation learning

Channel capacity as an analogy for learning systems

Why information theory remains essential

Write better notes with AI