probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

, an

f

-divergence is a certain type of function

D_f(P\,  Q)

that measures the difference between two

probability distributions In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spac ...

P

and

Q

. Many common divergences, such as KL-divergence,

Hellinger distance In probability and statistics, the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to quantify the similarity between two probability distributions. It is a type of ''f''-divergence. The Hell ...

, and

total variation distance In probability theory, the total variation distance is a statistical distance between probability distributions, and is sometimes called the statistical distance, statistical difference or variational distance. Definition Consider a measurable ...

, are special cases of

f

-divergence.

History

These divergences were introduced by

Alfréd Rényi Alfréd Rényi (20 March 1921 – 1 February 1970) was a Hungarian mathematician known for his work in probability theory, though he also made contributions in combinatorics, graph theory, and number theory. Life Rényi was born in Budapest to A ...

in the same paper where he introduced the well-known

Rényi entropy In information theory, the Rényi entropy is a quantity that generalizes various notions of Entropy (information theory), entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alf ...

. He proved that these divergences decrease in

Markov process In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, ...

es. ''f''-divergences were studied further independently by , and and are sometimes known as Csiszár

f

-divergences, Csiszár–Morimoto divergences, or Ali–Silvey distances.

Definition

Non-singular case

Let

P

and

Q

be two probability distributions over a space

\Omega

, such that

P\ll Q

, that is,

P

absolutely continuous In calculus and real analysis, absolute continuity is a smoothness property of functions that is stronger than continuity and uniform continuity. The notion of absolute continuity allows one to obtain generalizations of the relationship betwe ...

with respect to

Q

(meaning

Q>0

wherever

P>0

). Then, for a

convex function In mathematics, a real-valued function is called convex if the line segment between any two distinct points on the graph of a function, graph of the function lies above or on the graph between the two points. Equivalently, a function is conve ...

f:, +\infty)\to(-\infty, +\infty /math> such that f(x) is finite for all x > 0, f(1)=0, and f(0)=\lim_ f(t) (which could be infinite), the f -divergence of P from Q is defined as

: D_f(P\parallel Q) \equiv \int_ f\left(\frac\right)\,dQ. We call f the generator of D_f .

In concrete applications, there is usually a reference distribution \mu on \Omega (for example, when \Omega = \R^n, the reference distribution is the

Lebesgue measure In measure theory, a branch of mathematics, the Lebesgue measure, named after French mathematician Henri Lebesgue, is the standard way of assigning a measure to subsets of higher dimensional Euclidean '-spaces. For lower dimensions or , it c ...

), such that

P, Q \ll \mu

, then we can use

Radon–Nikodym theorem In mathematics, the Radon–Nikodym theorem is a result in measure theory that expresses the relationship between two measures defined on the same measurable space. A ''measure'' is a set function that assigns a consistent magnitude to the measurab ...

to take their probability densities

p

and

q

, giving :

D_f(P\parallel Q) = \int_ f\left(\frac\right)q(x)\,d\mu(x).

When there is no such reference distribution ready at hand, we can simply define

\mu = P+Q

, and proceed as above. This is a useful technique in more abstract proofs.

Extension to singular measures

The above definition can be extended to cases where

P\ll Q

is no longer satisfied (Definition 7.1 of ). Since

f

is convex, and

f(1) = 0

, the function

\frac

must be nondecreasing, so there exists

f'(\infty) := \lim_f(x)/x

, taking value in

(-\infty, +\infty]

. Since for any

p(x)>0

, we have

\lim_ q(x)f \left(\frac\right) = p(x)f'(\infty)

, we can extend f-divergence to the

P\not\ll Q

Properties

Basic relations between f-divergences

* Linearity:

D_ = \sum_i a_i D_

given a finite sequence of nonnegative real numbers

a_i

and generators

f_i

. *

D_f = D_g

iff

f(x) = g(x) + c(x-1)

for some

c\in \R

Basic properties of f-divergences

In particular, the monotonicity implies that if a

has a positive equilibrium probability distribution

P^*

then

D_f(P(t)\parallel P^*)

is a monotonic (non-increasing) function of time, where the probability distribution

P(t)

is a solution of the Kolmogorov forward equations (or

Master equation In physics, chemistry, and related fields, master equations are used to describe the time evolution of a system that can be modeled as being in a probabilistic combination of states at any given time, and the switching between states is determi ...

), used to describe the time evolution of the probability distribution in the Markov process. This means that all ''f''-divergences

D_f(P(t)\parallel P^*)

are the

Lyapunov function In the theory of ordinary differential equations (ODEs), Lyapunov functions, named after Aleksandr Lyapunov, are scalar functions that may be used to prove the stability of an equilibrium of an ODE. Lyapunov functions (also called Lyapunov’s ...

s of the Kolmogorov forward equations. The converse statement is also true: If

H(P)

is a Lyapunov function for all Markov chains with positive equilibrium

P^*

and is of the trace-form (

H(P)=\sum_f(P_,P_^)

) then

H(P)= D_f(P(t)\parallel P^*)

, for some convex function ''f''. For example, Bregman divergences in general do not have such property and can increase in Markov processes.

Analytic properties

The f-divergences can be expressed using

Taylor series In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...

and rewritten using a weighted sum of chi-type distances ().

Basic variational representation

Let

f^*

be the

convex conjugate In mathematics and mathematical optimization, the convex conjugate of a function is a generalization of the Legendre transformation which applies to non-convex functions. It is also known as Legendre–Fenchel transformation, Fenchel transformati ...

f

. Let

\mathrm(f^*)

be the

effective domain In convex analysis, a branch of mathematics, the effective domain extends of the domain of a function defined for functions that take values in the extended real number line \infty, \infty= \mathbb \cup \. In convex analysis and variational an ...

f^*

, that is,

\mathrm(f^*) = \

. Then we have two variational representations of

D_f

, which we describe below. Under the above setup, This is Theorem 7.24 in.

Example applications

Using this theorem on total variation distance, with generator

f(x)= \frac 1 2 , x-1, ,

its convex conjugate is

\\ +\infty \text \end

, and we obtain

TV(P\,  Q) = \sup_ E_P (X) - E_Q (X)

For chi-squared divergence, defined by

f(x) = (x-1)^2, f^*(y) = y^2/4 + y

, we obtain

\chi^2(P; Q) = \sup_g E_P (X) - E_Q (X)^2/4 + g(X)

Since the variation term is not affine-invariant in

g

, even though the domain over which

g

varies ''is'' affine-invariant, we can use up the affine-invariance to obtain a leaner expression. Replacing

g

a g + b

and taking the maximum over

a, b \in \R

, we obtain

\chi^2(P; Q) = \sup_g \frac,

which is just a few steps away from the Hammersley–Chapman–Robbins bound and the

Cramér–Rao bound In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic (fixed, though unknown) parameter. The result is named in honor of Harald Cramér and Calyampudi Radhakrishna Rao, but has also been d ...

(Theorem 29.1 and its corollary in ). For

\alpha

-divergence with

\alpha \in (-\infty, 0)\cup(0, 1)

, we have

f_\alpha(x) = \frac

, with range

\right),

or, releasing the constraint on

h

+ E_P\left[\frac\right \right).

Setting

\alpha=-1

yields the variational representation of

\chi^2

-divergence obtained above. The domain over which

h

varies is not affine-invariant in general, unlike the

\chi^2

-divergence case. The

\chi^2

-divergence is special, since in that case, we can remove the

, \cdot ,

from

, h,

. For general

\alpha \in (-\infty, 0)\cup(0, 1)

, the domain over which

h

varies is merely scale invariant. Similar to above, we can replace

h

a h

, and take minimum over

a>0

to obtain

D_\alpha(P\,  Q) = \sup_ \left[\frac \left(
1-\frac
\right) \right].

Setting

\alpha=\frac 1 2

, and performing another substitution by

g=\sqrt h

, yields two variational representations of the squared Hellinger distance:

\right),

H^2(P\, Q) = 2 \sup_ \left(1-\sqrt\right).

Applying this theorem to the KL-divergence, defined by

f(x) = x\ln x, f^*(y) = e^

, yields

D_(P; Q) =\sup_g E_P (X) - e^E_Q^

This is strictly less efficient than the Donsker–Varadhan representation

D_(P; Q) = \sup_g E_P (X) \ln E_Q^

This defect is fixed by the next theorem.

Improved variational representation

Assume the setup in the beginning of this section ("Variational representations"). This is Theorem 7.25 in.

Example applications

Applying this theorem to KL-divergence yields the Donsker–Varadhan representation. Attempting to apply this theorem to the general

\alpha

-divergence with

\alpha \in (-\infty, 0)\cup(0, 1)

does not yield a closed-form solution.

Common examples of ''f''-divergences

The following table lists many of the common divergences between probability distributions and the possible generating functions to which they correspond. Notably, except for total variation distance, all others are special cases of

\alpha

-divergence, or linear sums of

\alpha

-divergences. For each f-divergence

D_f

, its generating function is not uniquely defined, but only up to

c\cdot(t-1)

, where

c

is any real constant. That is, for any

f

that generates an f-divergence, we have

D_ = D_

. This freedom is not only convenient, but actually necessary.

Let

f_\alpha

be the generator of

\alpha

-divergence, then

f_\alpha

and

f_

are convex inversions of each other, so

D_(P\,  Q) = D_(Q\,  P)

. In particular, this shows that the squared Hellinger distance and Jensen-Shannon divergence are symmetric. In the literature, the

\alpha

-divergences are sometimes parametrized as

\begin
    \frac\big(1 - t^\big), & \text\ \alpha\neq\pm1, \\
    t \ln t, & \text\ \alpha=1, \\
    - \ln t, & \text\ \alpha=-1
  \end

which is equivalent to the parametrization in this page by substituting

\alpha \leftarrow \frac

Relations to other statistical divergences

Here, we compare ''f''-divergences with other statistical divergences.

Rényi divergence

The Rényi divergences is a family of divergences defined by

\Bigg) \,

when

\alpha \in (0, 1)\cup (1, +\infty)

. It is extended to the cases of

\alpha =0, 1, +\infty

by taking the limit. Simple algebra shows that

R_\alpha(P\,  Q) = \frac\ln (1+\alpha(\alpha-1)D_\alpha(P\, Q))

, where

D_\alpha

is the

\alpha

-divergence defined above.

Bregman divergence

The only f-divergence that is also a Bregman divergence is the KL divergence.

Integral probability metrics

The only f-divergence that is also an integral probability metric is the total variation.

Financial interpretation

A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities. Knowledge of the actual probabilities allows a player to profit from the game. For a large class of rational players the expected profit rate has the same general form as the ''ƒ''-divergence.

References

* * * * * * * * {{refend

History

Definition

Non-singular case

Extension to singular measures

Properties

Basic relations between f-divergences

Basic properties of f-divergences

Analytic properties

Basic variational representation

Example applications

Improved variational representation

Example applications

Common examples of ''f''-divergences

Relations to other statistical divergences

Rényi divergence

Bregman divergence

Integral probability metrics

Financial interpretation

See also

References