information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...

, the information projection or I-projection of a

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...

''q'' onto a set of distributions ''P'' is :

p^* = \underset \operatorname_(p, , q)

. where

D_

is the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...

from ''q'' to ''p''. Viewing the Kullback–Leibler divergence as a measure of distance, the I-projection

p^*

is the "closest" distribution to ''q'' of all the distributions in ''P''. The I-projection is useful in setting up

information geometry Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to p ...

, notably because of the following inequality, valid when ''P'' is convex:

\operatorname_(p, , q) \geq \operatorname_(p, , p^*) + \operatorname_(p^*, , q)

. This inequality can be interpreted as an information-geometric version of Pythagoras' triangle-inequality theorem, where KL divergence is viewed as squared distance in a Euclidean space. It is worthwhile to note that since

\operatorname_(p, , q) \geq 0

and continuous in p, if ''P'' is closed and non-empty, then there exists at least one minimizer to the optimization problem framed above. Furthermore, if ''P'' is convex, then the optimum distribution is unique. The reverse I-projection also known as moment projection or M-projection is :

p^* = \underset \operatorname_(q, , p)

. Since the KL divergence is not symmetric in its arguments, the I-projection and the M-projection will exhibit different behavior. For I-projection,

p(x)

will typically under-estimate the support of

q(x)

and will lock onto one of its modes. This is due to

p(x)=0

, whenever

q(x)=0

to make sure KL divergence stays finite. For M-projection,

p(x)

will typically over-estimate the support of

q(x)

. This is due to

p(x) > 0

whenever

q(x) > 0

to make sure KL divergence stays finite. The reverse I-projection plays a fundamental role in the construction of optimal e-variables. The concept of information projection can be extended to arbitrary ''f''-divergences and other

divergence In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of t ...

References

*K. Murphy, "Machine Learning: a Probabilistic Perspective", The MIT Press, 2012. Information theory {{probability-stub

See also

References