Information Theoretic Machine Learning

Following is a very minimal introduction to the mathematical (statistical) ideas that show up very often in information theoretic machine learning and deep learning papers.

Important/Common Concepts and Equations

Shannon Entropy

Measures the uncertainty or information of a random variable X
Measured in bits
Intuition:
- If X is totally predictable -> H(X) = 0 bits
- If X is uniform over n values -> H(X) = log n bits

$$ H(X) = -\sum_x{p(x) \log_2{p(x)}} $$

Mutual Information

Measures how much information is shared between X and Y
Intuition:
- I(X;Y) = 0 -> X and Y are independent
- Larger I(X;Y) -> knowing X tells you more about Y
- For I(X;Y), this means that if we know Y, we can save an average of I(X;Y) bits when encoding X
- MI is symmetric meaning I(X;Y) = I(Y;X)

$$ I(X;Y) = \sum_{x,y}{p(x,y) \log_2{\frac{p(x,y)}{p(x)p(y)}}} $$

$$ I(X;Y) = H(X) - H(X | Y) $$

Kullback-Leibler (KL) Divergence

Measures how different two discrete probability distributions P and Q are
Intuition:
- KL Divergence is 0 if P = Q
- Not symmetric so D_KL(P||Q) != D_KL(Q||P)
- Extra bits required to encode P using a code optimized for Q

$$ D_{KL}(P||Q) = \sum_x{P(x) \log_2{\frac{P(x)}{Q(x)}}} $$

Multivariate Gaussian

A multidimensional bell curve where Σ controls the shape/spread

$$ p(x) = \frac{1}{\sqrt{(2\pi)^k |\sum|}} \exp(- \frac{1}{2} (x - \mu)^T \sum^{-1}(x - \mu)) $$

Expectation (E)

Average value of f(X) if X is drawn according to P
Weighted average of f(X) over the probability of X

$$ \mathbb{E}_{P(X)}[f(X)] = \int P(x)f(x) dx $$

Markov Chain

A Markov chain is simply a set of discrete random processes that happen one after another in which each next process is only dependent on the current process. This can be viewed from the perspective of a neural network in which each layers outputs only depends on the previous layers outputs.

$$ X \rightarrow Z \rightarrow Y $$

Data Processing Inequality (DPI)

The data processing inequality simply states that for any Markov chain, the mutual information between two stochastic processes in said chain can only decrease.

$$ \text{For Markov chain } X \rightarrow Z \rightarrow Y \text{ : } I(X;Z) \geq I(X;Y) $$

Reparameterization Invariance [1]

For two invertable functions ϕ, ψ, the mutual information still holds:

$$ I(X;Y) = I(\phi(X);\psi(Y)). $$

For deep neural networks, this simply means that shuffling the weights of a given layer does not change the mutual information between that layer and all the others. This is important when considering computational complexity as done in the V-Information paper because the mutual information between the two random variables does not change, but the computational complexity can increase heavily depending on how hard the functions ϕ, ψ are to invert. The paper introduces a new type of information that takes computational constraints into consideration in this case.

2025-10-15

../