Information Theoretic Machine Learning

Following is a very minimal introduction to the mathematical (statistical) ideas that show up very often in information theoretic machine learning and deep learning papers.

Important/Common Concepts and Equations

Shannon Entropy

Measures the uncertainty or information of a random variable X
Measured in bits
Intuition:
- If X is totally predictable -> H(X) = 0 bits
- If X is uniform over n values -> H(X) = log n bits

H(X) = −∑_xp(x)log₂p(x)

Mutual Information

Measures how much information is shared between X and Y
Intuition:
- I(X;Y) = 0 -> X and Y are independent
- Larger I(X;Y) -> knowing X tells you more about Y
- For I(X;Y), this means that if we know Y, we can save an average of I(X;Y) bits when encoding X
- MI is symmetric meaning I(X;Y) = I(Y;X)

$$ I(X;Y) = \sum_{x,y}{p(x,y) \log_2{\frac{p(x,y)}{p(x)p(y)}}} $$

I(X; Y) = H(X) − H(X|Y)

Kullback-Leibler (KL) Divergence

Measures how different two discrete probability distributions P and Q are
Intuition:
- KL Divergence is 0 if P = Q
- Not symmetric so D_KL(P||Q) != D_KL(Q||P)
- Extra bits required to encode P using a code optimized for Q

$$ D_{KL}(P||Q) = \sum_x{P(x) \log_2{\frac{P(x)}{Q(x)}}} $$

Multivariate Gaussian

A multidimensional bell curve where Σ controls the shape/spread

$$ p(x) = \frac{1}{\sqrt{(2\pi)^k |\sum|}} \exp(- \frac{1}{2} (x - \mu)^T \sum^{-1}(x - \mu)) $$

Expectation (E)

Average value of f(X) if X is drawn according to P
Weighted average of f(X) over the probability of X

𝔼_P(X)[f(X)] = ∫P(x)f(x)dx

Markov Chain

A Markov chain is simply a set of discrete random processes that happen one after another in which each next process is only dependent on the current process. This can be viewed from the perspective of a neural network in which each layers outputs only depends on the previous layers outputs.

X → Z → Y

Data Processing Inequality (DPI)

The data processing inequality simply states that for any Markov chain, the mutual information between two stochastic processes in said chain can only decrease.

For Markov chain X → Z → Y : I(X; Z) ≥ I(X; Y)

Reparameterization Invariance [1]

For two invertable functions ϕ, ψ, the mutual information still holds:

I(X; Y) = I(ϕ(X); ψ(Y)).

For deep neural networks, this simply means that shuffling the weights of a given layer does not change the mutual information between that layer and all the others. This is important when considering computational complexity as done in the V-Information paper because the mutual information between the two random variables does not change, but the computational complexity can increase heavily depending on how hard the functions ϕ, ψ are to invert. The paper introduces a new type of information that takes computational constraints into consideration in this case.