Information Theoretic Machine Learning
Following is a very minimal introduction to the mathematical (statistical) ideas that show up very often in information theoretic machine learning and deep learning papers.
Important/Common Concepts and Equations
Shannon Entropy
- Measures the uncertainty or information of a random variable X
- Measured in bits
- Intuition:
- If X is totally predictable -> H(X) = 0 bits
- If X is uniform over n values -> H(X) = log n bits
$$ H(X) = -\sum_x{p(x) \log_2{p(x)}} $$
Mutual Information
- Measures how much information is shared between X and Y
- Intuition:
- I(X;Y) = 0 -> X and Y are independent
- Larger I(X;Y) -> knowing X tells you more about Y
- For I(X;Y), this means that if we know Y, we can save an average of I(X;Y) bits when encoding X
- MI is symmetric meaning I(X;Y) = I(Y;X)
$$ I(X;Y) = \sum_{x,y}{p(x,y) \log_2{\frac{p(x,y)}{p(x)p(y)}}} $$
$$ I(X;Y) = H(X) - H(X | Y) $$
Kullback-Leibler (KL) Divergence
- Measures how different two discrete probability distributions P and Q are
- Intuition:
- KL Divergence is 0 if P = Q
- Not symmetric so D_KL(P||Q) != D_KL(Q||P)
- Extra bits required to encode P using a code optimized for Q
$$ D_{KL}(P||Q) = \sum_x{P(x) \log_2{\frac{P(x)}{Q(x)}}} $$
Multivariate Gaussian
- A multidimensional bell curve where Σ controls the shape/spread
$$ p(x) = \frac{1}{\sqrt{(2\pi)^k |\sum|}} \exp(- \frac{1}{2} (x - \mu)^T \sum^{-1}(x - \mu)) $$
Expectation (E)
- Average value of f(X) if X is drawn according to P
- Weighted average of f(X) over the probability of X
$$ \mathbb{E}_{P(X)}[f(X)] = \int P(x)f(x) dx $$
Markov Chain
A Markov chain is simply a set of discrete random processes that happen one after another in which each next process is only dependent on the current process. This can be viewed from the perspective of a neural network in which each layers outputs only depends on the previous layers outputs.
$$ X \rightarrow Z \rightarrow Y $$
Data Processing Inequality (DPI)
The data processing inequality simply states that for any Markov chain, the mutual information between two stochastic processes in said chain can only decrease.
$$ \text{For Markov chain } X \rightarrow Z \rightarrow Y \text{ : } I(X;Z) \geq I(X;Y) $$
Reparameterization Invariance [1]
For two invertable functions ϕ, ψ, the mutual information still holds:
$$ I(X;Y) = I(\phi(X);\psi(Y)). $$
For deep neural networks, this simply means that shuffling the weights of a given layer does not change the mutual information between that layer and all the others. This is important when considering computational complexity as done in the V-Information paper because the mutual information between the two random variables does not change, but the computational complexity can increase heavily depending on how hard the functions ϕ, ψ are to invert. The paper introduces a new type of information that takes computational constraints into consideration in this case.