Information Theoretic Machine Learning
Following is a very minimal introduction to the mathematical (statistical) ideas that show up very often in information theoretic machine learning and deep learning papers.
Shannon Entropy
- Measures the uncertainty or information of a random variable X
- Measured in bits
- Intuition:
- If X is totally predictable -> H(X) = 0 bits
- If X is uniform over n values -> H(X) = log n bits
$$ H(X) = -\sum_x{p(x) \log_2{p(x)}} $$
Mutual Information
- Measures how much information is shared between X and Y
- Intuition:
- I(X;Y) = 0 -> X and Y are independent
- Larger I(X;Y) -> knowing X tells you more about Y
$$ I(X;Y) = \sum_{x,y}{p(x,y) \log_2{\frac{p(x,y)}{p(x)p(y)}}} $$
Kullback-Leibler (KL) Divergence
- Measures how different two probability distributions P and Q are
- Intuition:
- KL Divergence is 0 if P = Q
- Not symmetric so D_KL(P||Q) != D_KL(Q||P)
- Extra bits required to encode P using a code optimized for Q
$$ D_{KL}(P||Q) = \sum_x{P(x) \log_2{\frac{P(x)}{Q(x)}}} $$
Multivariate Gaussian
- A multidimensional bell curve where Σ controls the shape/spread
$$ p(x) = \frac{1}{\sqrt{(2\pi)^k |\sum|}} \exp(- \frac{1}{2} (x - \mu)^T \sum^{-1}(x - \mu)) $$
Expectation
- Average value of f(X) if X is drawn according to P
- Weighted average of f(X) over the probability of X
$$ \mathbb{E}_{P(X)}[f(X)] = \int P(x)f(x) dx $$
Kronecker Delta
- A simple way to pick out one element or enforce equality in sums
- Basically like a switch
$$ \delta(i,j) = (1 \text{ if } i = j, 0 \text{ if } i \neq j) $$