Information Theoretic Machine Learning

Following is a very minimal introduction to the mathematical (statistical) ideas that show up very often in information theoretic machine learning and deep learning papers.

Shannon Entropy

$$ H(X) = -\sum_x{p(x) \log_2{p(x)}} $$

Mutual Information

$$ I(X;Y) = \sum_{x,y}{p(x,y) \log_2{\frac{p(x,y)}{p(x)p(y)}}} $$

Kullback-Leibler (KL) Divergence

$$ D_{KL}(P||Q) = \sum_x{P(x) \log_2{\frac{P(x)}{Q(x)}}} $$

Multivariate Gaussian

$$ p(x) = \frac{1}{\sqrt{(2\pi)^k |\sum|}} \exp(- \frac{1}{2} (x - \mu)^T \sum^{-1}(x - \mu)) $$

Expectation

$$ \mathbb{E}_{P(X)}[f(X)] = \int P(x)f(x) dx $$

Kronecker Delta

$$ \delta(i,j) = (1 \text{ if } i = j, 0 \text{ if } i \neq j) $$

../