information geometry Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to pro ...

, a divergence is a kind of statistical distance: a

binary function In mathematics, a binary function (also called bivariate function, or function of two variables) is a function that takes two inputs. Precisely stated, a function f is binary if there exists sets X, Y, Z such that :\,f \colon X \times Y \righta ...

which establishes the separation from one

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...

to another on a statistical manifold. The simplest divergence is squared Euclidean distance (SED), and divergences can be viewed as generalizations of SED. The other most important divergence is

relative entropy Relative may refer to: General use *Kinship and family, the principle binding the most basic social units society. If two people are connected by circumstances of birth, they are said to be ''relatives'' Philosophy *Relativism, the concept that ...

(

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...

, KL divergence), which is central to

information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...

. There are numerous other specific divergences and classes of divergences, notably ''f''-divergences and Bregman divergences (see ).

Definition

Given a

differentiable manifold In mathematics, a differentiable manifold (also differential manifold) is a type of manifold that is locally similar enough to a vector space to allow one to apply calculus. Any manifold can be described by a collection of charts (atlas). One ma ...

M

of dimension

n

, a divergence on

M

is a

C^2

-function

D: M\times M\to [0, \infty)

satisfying: #

D(p, q) \geq 0

for all

p, q \in M

(non-negativity), #

D(p, q) = 0

if and only if

p=q

(positivity), # At every point

p\in M

D(p, p+dp)

is a positive-definite quadratic form for infinitesimal displacements

dp

from

p

. In applications to statistics, the manifold

M

is typically the space of parameters of a Parametric family, parametric family of probability distributions. Condition 3 means that

D

defines an inner product on the tangent space

T_pM

for every

p\in M

. Since

D

C^2

M

, this defines a Riemannian metric

g

M

. Locally at

p\in M

, we may construct a local coordinate chart with coordinates

x

, then the divergence is

D(x(p), x(p) + dx) = \textstyle\frac dx^T g_p(x) dx + O(, dx, ^3)

where

g_p(x)

is a matrix of size

n\times n

. It is the Riemannian metric at point

p

expressed in coordinates

x

Dimensional analysis In engineering and science, dimensional analysis is the analysis of the relationships between different physical quantities by identifying their base quantities (such as length, mass, time, and electric current) and units of measure (such as mi ...

of condition 3 shows that divergence has the dimension of squared distance. The dual divergence

D^*

is defined as :

D^*(p, q) = D(q, p).

When we wish to contrast

D

against

D^*

, we refer to

D

as primal divergence. Given any divergence

D

, its symmetrized version is obtained by averaging it with its dual divergence: :

D_S(p, q) = \textstyle\frac\big(D(p,q) + D(q, p)\big).

Difference from other similar concepts

Unlike metrics, divergences are not required to be symmetric, and the asymmetry is important in applications. Accordingly, one often refers asymmetrically to the divergence "of ''q'' from ''p''" or "from ''p'' to ''q''", rather than "between ''p'' and ''q''". Secondly, divergences generalize ''squared'' distance, not linear distance, and thus do not satisfy the

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...

, but some divergences (such as the Bregman divergence) do satisfy generalizations of the

Pythagorean theorem In mathematics, the Pythagorean theorem or Pythagoras' theorem is a fundamental relation in Euclidean geometry between the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse (the side opposit ...

. In general statistics and probability, "divergence" generally refers to any kind of function

D(p, q)

, where

p, q

are probability distributions or other objects under consideration, such that conditions 1, 2 are satisfied. Condition 3 is required for "divergence" as used in information geometry. As an example, the

total variation distance In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational dist ...

, a commonly used statistical divergence, does not satisfy condition 3.

Notation

Notation for divergences varies significantly between fields, though there are some conventions. Divergences are generally notated with an uppercase 'D', as in

D(x, y)

, to distinguish them from metric distances, which are notated with a lowercase 'd'. When multiple divergences are in use, they are commonly distinguished with subscripts, as in

D_\text

for

(KL divergence). Often a different separator between parameters is used, particularly to emphasize the asymmetry. In

, a double bar is commonly used:

D(p \parallel q)

; this is similar to, but distinct from, the notation for

conditional probability In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occu ...

P(A ,  B)

, and emphasizes interpreting the divergence as a relative measurement, as in

; this notation is common for the KL divergence. A colon may be used instead, as

D(p : q)

; this emphasizes the relative information supporting the two distributions. The notation for parameters varies as well. Uppercase

P, Q

interprets the parameters as probability distributions, while lowercase

p, q

x, y

interprets them geometrically as points in a space, and

\mu_1, \mu_2

m_1, m_2

interprets them as measures.

Geometrical properties

Many properties of divergences can be derived if we restrict ''S'' to be a statistical manifold, meaning that it can be parametrized with a finite-dimensional coordinate system ''θ'', so that for a distribution we can write . For a pair of points with coordinates ''θ''_''p'' and ''θ''_''q'', denote the partial derivatives of ''D''(''p'', ''q'') as :

\begin
    D((\partial_i)_p, q) \ \ &\stackrel\ \ \tfrac D(p, q), \\
    D((\partial_i\partial_j)_p, (\partial_k)_q) \ \ &\stackrel\ \ \tfrac \tfrac\tfracD(p, q), \ \ \mathrm
  \end

Now we restrict these functions to a diagonal , and denote :

&:\ p \mapsto D((\partial_i)_p, (\partial_j)_p),\ \ \mathrm \end

By definition, the function ''D''(''p'', ''q'') is minimized at , and therefore :

\ \equiv\ g_^, \end

where matrix ''g''^(''D'') is positive semi-definite and defines a unique

Riemannian metric In differential geometry, a Riemannian manifold or Riemannian space , so called after the German mathematician Bernhard Riemann, is a real, smooth manifold ''M'' equipped with a positive-definite inner product ''g'p'' on the tangent space '' ...

on the manifold ''S''. Divergence ''D''(·, ·) also defines a unique

torsion Torsion may refer to: Science * Torsion (mechanics), the twisting of an object due to an applied torque * Torsion of spacetime, the field used in Einstein–Cartan theory and ** Alternatives to general relativity * Torsion angle, in chemistry Bi ...

-free

affine connection In differential geometry, an affine connection is a geometric object on a smooth manifold which ''connects'' nearby tangent spaces, so it permits tangent vector fields to be differentiated as if they were functions on the manifold with values i ...

∇^(''D'') with coefficients :

\Gamma_^ = -D partial_i\partial_j, \partial_k

and the dual to this connection ∇* is generated by the dual divergence ''D''*. Thus, a divergence ''D''(·, ·) generates on a statistical manifold a unique dualistic structure (''g''^(''D''), ∇^(''D''), ∇^(''D''*)). The converse is also true: every torsion-free dualistic structure on a statistical manifold is induced from some globally defined divergence function (which however need not be unique). For example, when ''D'' is an

f-divergence In probability theory, an f-divergence is a function D_f(P\, Q) that measures the difference between two probability distributions P and Q. Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are s ...

for some function ƒ(·), then it generates the

metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathe ...

and the connection , where ''g'' is the canonical

Fisher information metric In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability measures defined on a common probability spa ...

, ∇^(''α'') is the α-connection, , and .

Examples

The two most important divergences are the

(

, KL divergence), which is central to

and statistics, and the squared Euclidean distance (SED). Minimizing these two divergences is the main way that

linear inverse problem An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, source reconstruction in acoustics, or calculating the ...

are solved, via the

principle of maximum entropy The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...

and

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the re ...

, notably in

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression an ...

and

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...

. The two most important classes of divergences are the ''f''-divergences and Bregman divergences; however, other types of divergence functions are also encountered in the literature. The only divergence that is both an ''f''-divergence and a Bregman divergence is the Kullback–Leibler divergence; the squared Euclidean divergence is a Bregman divergence (corresponding to the function ), but not an ''f''-divergence.

f-divergences

Given convex function

f:, \infty)\to (-\infty, \infty /math> such that f(0) = \lim_f(t), f(1) = 0, the ''f''-divergence generated by f is defined as
: D_f(p, q) = \int p(x)f\bigg(\frac\bigg) dx

Bregman divergences

Bregman divergences correspond to convex functions on convex sets. Given a strictly convex, continuously differentiable function on a

convex set In geometry, a subset of a Euclidean space, or more generally an affine space over the reals, is convex if, given any two points in the subset, the subset contains the whole line segment that joins them. Equivalently, a convex set or a convex ...

, known as the ''Bregman generator'', the Bregman divergence measures the convexity of: the error of the linear approximation of from as an approximation of the value at : :

D_F(p, q) = F(p)-F(q)-\langle \nabla F(q), p-q\rangle.

The dual divergence to a Bregman divergence is the divergence generated by the

convex conjugate In mathematics and mathematical optimization, the convex conjugate of a function is a generalization of the Legendre transformation which applies to non-convex functions. It is also known as Legendre–Fenchel transformation, Fenchel transformatio ...

of the Bregman generator of the original divergence. For example, for the squared Euclidean distance, the generator is , while for the relative entropy the generator is the

negative entropy In information theory and statistics, negentropy is used as a measure of distance to normality. The concept and phrase "negative entropy" was introduced by Erwin Schrödinger in his 1944 popular-science book '' What is Life?'' Later, Léon Bril ...

History

The use of the term "divergence" – both what functions it refers to, and what various statistical distances are called – has varied significantly over time, but by c. 2000 had settled on the current usage within information geometry, notably in the textbook . The term "divergence" for a statistical distance was used informally in various contexts from c. 1910 to c. 1940. Its formal use dates at least to , entitled "On a measure of divergence between two statistical populations defined by their probability distributions", which defined the

Bhattacharyya distance In statistics, the Bhattacharyya distance measures the similarity of two probability distributions. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two statistical samples or populations. ...

, and , entitled "On a Measure of Divergence between Two Multinomial Populations", which defined the

Bhattacharyya angle In statistics, Bhattacharyya angle, also called statistical angle, is a measure of distance between two probability measures defined on a finite probability space. It is defined as : \Delta(p,q) = \arccos \operatorname(p,q) where ''p'i'', ...

. The term was popularized by its use for the

in and its use in the textbook . The term "divergence" was used generally by for statistically distances. Numerous references to earlier uses of statistical distances are given in and . actually used "divergence" to refer to the ''symmetrized'' divergence (this function had already been defined and used by

Harold Jeffreys Sir Harold Jeffreys, FRS (22 April 1891 – 18 March 1989) was a British mathematician, statistician, geophysicist, and astronomer. His book, ''Theory of Probability'', which was first published in 1939, played an important role in the revival ...

in 1948), referring to the asymmetric function as "the mean information for discrimination ... per observation", while referred to the asymmetric function as the "directed divergence". referred generally to such a function as a "coefficient of divergence", and showed that many existing functions could be expressed as ''f''-divergences, referring to Jeffreys' function as "Jeffreys' measure of divergence" (today "Jeffreys divergence"), and Kullback–Leibler's asymmetric function (in each direction) as "Kullback's and Leibler's measures of discriminatory information" (today "Kullback–Leibler divergence"). The information geometry definition of divergence (the subject of this article) was initially referred to by alternative terms, including "quasi-distance" and "contrast function" , though "divergence" was used in for the -divergence, and has become standard for the general class. The term "divergence" is in contrast to a distance (metric), since the symmetrized divergence does not satisfy the triangle inequality. For example, the term "Bregman distance" is still found, but "Bregman divergence" is now preferred. Notationally, denoted their asymmetric function as

I(1:2)

, while denote their functions with a lowercase 'd' as

d\left(P_1, P_2\right)

Notes

References

Bibliography

* * * * * * * * * * . Republished by

Dover Publications Dover Publications, also known as Dover Books, is an American book publisher founded in 1941 by Hayward and Blanche Cirker. It primarily reissues books that are out of print from their original publishers. These are often, but not always, books ...

in 1968; reprinted in 1978: * {{DEFAULTSORT:Divergence (Statistics) Statistical distance F-divergences