Information Theoretic Machine Learning

Posted on October 15, 2025

Following is a very minimal introduction to the mathematical (statistical) ideas that show up very often in information theoretic machine learning and deep learning papers.

Important/Common Concepts and Equations

Shannon Entropy

  • Measures the uncertainty or information of a random variable X
  • Measured in bits
  • Intuition:
    • If X is totally predictable -> H(X) = 0 bits
    • If X is uniform over n values -> H(X) = log n bits

H(X) = −∑xp(x)log2p(x)

Mutual Information

  • Measures how much information is shared between X and Y
  • Intuition:
    • I(X;Y) = 0 -> X and Y are independent
    • Larger I(X;Y) -> knowing X tells you more about Y
    • For I(X;Y), this means that if we know Y, we can save an average of I(X;Y) bits when encoding X
    • MI is symmetric meaning I(X;Y) = I(Y;X)

$$ I(X;Y) = \sum_{x,y}{p(x,y) \log_2{\frac{p(x,y)}{p(x)p(y)}}} $$

I(X; Y) = H(X) − H(X|Y)

Kullback-Leibler (KL) Divergence

  • Measures how different two discrete probability distributions P and Q are
  • Intuition:
    • KL Divergence is 0 if P = Q
    • Not symmetric so D_KL(P||Q) != D_KL(Q||P)
    • Extra bits required to encode P using a code optimized for Q

$$ D_{KL}(P||Q) = \sum_x{P(x) \log_2{\frac{P(x)}{Q(x)}}} $$

Multivariate Gaussian

  • A multidimensional bell curve where Σ controls the shape/spread

$$ p(x) = \frac{1}{\sqrt{(2\pi)^k |\sum|}} \exp(- \frac{1}{2} (x - \mu)^T \sum^{-1}(x - \mu)) $$

Expectation (E)

  • Average value of f(X) if X is drawn according to P
  • Weighted average of f(X) over the probability of X

𝔼P(X)[f(X)] = ∫P(x)f(x)dx

Markov Chain

A Markov chain is simply a set of discrete random processes that happen one after another in which each next process is only dependent on the current process. This can be viewed from the perspective of a neural network in which each layers outputs only depends on the previous layers outputs.

X → Z → Y

Data Processing Inequality (DPI)

The data processing inequality simply states that for any Markov chain, the mutual information between two stochastic processes in said chain can only decrease.

For Markov chain X → Z → Y : I(X; Z) ≥ I(X; Y)

Reparameterization Invariance [1]

For two invertable functions ϕ, ψ, the mutual information still holds:

I(X; Y) = I(ϕ(X); ψ(Y)).

For deep neural networks, this simply means that shuffling the weights of a given layer does not change the mutual information between that layer and all the others. This is important when considering computational complexity as done in the V-Information paper because the mutual information between the two random variables does not change, but the computational complexity can increase heavily depending on how hard the functions ϕ, ψ are to invert. The paper introduces a new type of information that takes computational constraints into consideration in this case.