In
information geometry
Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to proba ...
, a divergence is a kind of
statistical distance
In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be bet ...
: a
binary function
In mathematics, a binary function (also called bivariate function, or function of two variables) is a function that takes two inputs.
Precisely stated, a function f is binary if there exists sets X, Y, Z such that
:\,f \colon X \times Y \right ...
which establishes the separation from one
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
to another on a
statistical manifold.
The simplest divergence is
squared Euclidean distance (SED), and divergences can be viewed as generalizations of SED. The other most important divergence is
relative entropy
Relative may refer to:
General use
*Kinship and family, the principle binding the most basic social units of society. If two people are connected by circumstances of birth, they are said to be ''relatives''.
Philosophy
*Relativism, the concept t ...
(also called
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
), which is central to
information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
. There are numerous other specific divergences and classes of divergences, notably
''f''-divergences and
Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. ...
s (see ).
Definition
Given a
differentiable manifold
In mathematics, a differentiable manifold (also differential manifold) is a type of manifold that is locally similar enough to a vector space to allow one to apply calculus. Any manifold can be described by a collection of charts (atlas). One ...
of dimension
, a divergence on
is a
-function
satisfying:
#
for all
(non-negativity),
#
if and only if
(positivity),
# At every point
,
is a positive-definite quadratic form for infinitesimal displacements
from
.
In applications to statistics, the manifold
is typically the space of parameters of a Parametric family, parametric family of probability distributions.
Condition 3 means that
defines an inner product on the tangent space
for every
. Since
is
on
, this defines a Riemannian metric
on
.
Locally at
, we may construct a local
coordinate chart
In topology, a topological manifold is a topological space that locally resembles real ''n''- dimensional Euclidean space. Topological manifolds are an important class of topological spaces, with applications throughout mathematics. All manifolds ...
with coordinates
, then the divergence is
where
is a matrix of size
. It is the Riemannian metric at point
expressed in coordinates
.
Dimensional analysis of condition 3 shows that divergence has the dimension of squared distance.
The dual divergence
is defined as
:
When we wish to contrast
against
, we refer to
as primal divergence.
Given any divergence
, its symmetrized version is obtained by averaging it with its dual divergence:
:
Difference from other similar concepts
Unlike
metrics
Metric or metrical may refer to:
Measuring
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
...
, divergences are not required to be symmetric, and the asymmetry is important in applications. Accordingly, one often refers asymmetrically to the divergence "of ''q'' from ''p''" or "from ''p'' to ''q''", rather than "between ''p'' and ''q''". Secondly, divergences generalize ''squared'' distance, not linear distance, and thus do not satisfy the
triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of Degeneracy (mathematics)#T ...
, but some divergences (such as the
Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. ...
) do satisfy generalizations of the
Pythagorean theorem
In mathematics, the Pythagorean theorem or Pythagoras' theorem is a fundamental relation in Euclidean geometry between the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse (the side opposite t ...
.
In general statistics and probability, "divergence" generally refers to any kind of function
, where
are probability distributions or other objects under consideration, such that conditions 1, 2 are satisfied. Condition 3 is required for "divergence" as used in information geometry.
As an example, the
total variation distance
In probability theory, the total variation distance is a statistical distance between probability distributions, and is sometimes called the statistical distance, statistical difference or variational distance.
Definition
Consider a measurable ...
, a commonly used statistical divergence, does not satisfy condition 3.
Notation
Notation for divergences varies significantly between fields, though there are some conventions.
Divergences are generally notated with an uppercase 'D', as in
, to distinguish them from metric distances, which are notated with a lowercase 'd'. When multiple divergences are in use, they are commonly distinguished with subscripts, as in
for
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
(KL divergence).
Often a different separator between parameters is used, particularly to emphasize the asymmetry. In
information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
, a double bar is commonly used:
; this is similar to, but distinct from, the notation for
conditional probability
In probability theory, conditional probability is a measure of the probability of an Event (probability theory), event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This ...
,
, and emphasizes interpreting the divergence as a relative measurement, as in
relative entropy
Relative may refer to:
General use
*Kinship and family, the principle binding the most basic social units of society. If two people are connected by circumstances of birth, they are said to be ''relatives''.
Philosophy
*Relativism, the concept t ...
; this notation is common for the KL divergence. A colon may be used instead, as
; this emphasizes the relative information supporting the two distributions.
The notation for parameters varies as well. Uppercase
interprets the parameters as probability distributions, while lowercase
or
interprets them geometrically as points in a space, and
or
interprets them as measures.
Geometrical properties
Many properties of divergences can be derived if we restrict ''S'' to be a statistical manifold, meaning that it can be parametrized with a finite-dimensional coordinate system ''θ'', so that for a distribution we can write .
For a pair of points with coordinates ''θ''
''p'' and ''θ''
''q'', denote the partial derivatives of ''D''(''p'', ''q'') as
:
Now we restrict these functions to a diagonal , and denote
:
By definition, the function ''D''(''p'', ''q'') is minimized at , and therefore
:
where matrix ''g''
(''D'') is
positive semi-definite and defines a unique
Riemannian metric
In differential geometry, a Riemannian manifold is a geometric space on which many geometric notions such as distance, angles, length, volume, and curvature are defined. Euclidean space, the N-sphere, n-sphere, hyperbolic space, and smooth surf ...
on the manifold ''S''.
Divergence ''D''(·, ·) also defines a unique
torsion-free
affine connection
In differential geometry, an affine connection is a geometric object on a smooth manifold which ''connects'' nearby tangent spaces, so it permits tangent vector fields to be differentiated as if they were functions on the manifold with values i ...
∇
(''D'') with coefficients
:
and the
dual to this connection ∇* is generated by the dual divergence ''D''*.
Thus, a divergence ''D''(·, ·) generates on a statistical manifold a unique dualistic structure (''g''
(''D''), ∇
(''D''), ∇
(''D''*)). The converse is also true: every torsion-free dualistic structure on a statistical manifold is induced from some globally defined divergence function (which however need not be unique).
For example, when ''D'' is an
f-divergence
In probability theory, an f-divergence is a certain type of function D_f(P\, Q) that measures the difference between two probability distributions P and Q. Many common divergences, such as KL-divergence, Hellinger distance, and total variation ...
[
] for some function ƒ(·), then it generates the
metric
Metric or metrical may refer to:
Measuring
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
...
and the connection , where ''g'' is the canonical
Fisher information metric
In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability distributions. It can be used to calculate the ...
, ∇
(''α'') is the
α-connection, , and .
Examples
The two most important divergences are the
relative entropy
Relative may refer to:
General use
*Kinship and family, the principle binding the most basic social units of society. If two people are connected by circumstances of birth, they are said to be ''relatives''.
Philosophy
*Relativism, the concept t ...
(
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
, KL divergence), which is central to
information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
and statistics, and the
squared Euclidean distance (SED). Minimizing these two divergences is the main way that
linear inverse problem
An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, source reconstruction in acoustics, or calculating the ...
s are solved, via the
principle of maximum entropy
The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...
and
least squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
, notably in
logistic regression
In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...
and
linear regression
In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...
.
The two most important classes of divergences are the
''f''-divergences and
Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. ...
s; however, other types of divergence functions are also encountered in the literature. The only divergence for probabilities over a finite
alphabet
An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...
that is both an ''f''-divergence and a Bregman divergence is the Kullback–Leibler divergence.
The squared Euclidean divergence is a Bregman divergence (corresponding to the function ) but not an ''f''-divergence.
f-divergences
Given a convex function