probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...

, an

f

-divergence is a function

D_f(P\,  Q)

that measures the difference between two

probability distributions In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...

P

and

Q

. Many common divergences, such as KL-divergence, Hellinger distance, and

total variation distance In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational dist ...

, are special cases of

f

-divergence.

History

These divergences were introduced by

Alfréd Rényi Alfréd Rényi (20 March 1921 – 1 February 1970) was a Hungarian mathematician known for his work in probability theory, though he also made contributions in combinatorics, graph theory, and number theory. Life Rényi was born in Budapest to A ...

in the same paper where he introduced the well-known

Rényi entropy In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for t ...

. He proved that these divergences decrease in

Markov process A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...

es. ''f''-divergences were studied further independently by , and and are sometimes known as Csiszár

f

-divergences, Csiszár-Morimoto divergences, or Ali-Silvey distances.

Definition

Non-singular case

Let

P

and

Q

be two probability distributions over a space

\Omega

, such that

P\ll Q

, that is,

P

is absolutely continuous with respect to

Q

. Then, for a

convex function In mathematics, a real-valued function is called convex if the line segment between any two points on the graph of the function lies above the graph between the two points. Equivalently, a function is convex if its epigraph (the set of poi ...

f:, \infty)\to(-\infty, \infty /math> such that f(x) is finite for all x > 0, f(1)=0, and f(0)=\lim_ f(t) (which could be infinite), the f -divergence of P from Q is defined as

: D_f(P\parallel Q) \equiv \int_ f\left(\frac\right)\,dQ. f is called the generator of D_f . In concrete applications, there is usually a reference distribution \mu on \Omega (for example, when \Omega = \R^n, the reference distribution is the

Lebesgue measure In measure theory, a branch of mathematics, the Lebesgue measure, named after French mathematician Henri Lebesgue, is the standard way of assigning a measure to subsets of ''n''-dimensional Euclidean space. For ''n'' = 1, 2, or 3, it coincides wi ...

), such that

P, Q \ll \mu

, then we can use Radon-Nikodym theorem to take their probability densities

p

and

q

, giving :

D_f(P\parallel Q) = \int_ f\left(\frac\right)q(x)\,d\mu(x).

When there is no such reference distribution ready at hand, we can simply define

\mu = P+Q

, and proceed as above. This is a useful technique in more abstract proofs.

Extension to singular measures

The above definition can be extended to cases where

P\ll Q

is no longer satisfied. Since

f

is convex, and

f(1) = 0

, the function

\frac

must nondecrease, so there exists

f'(\infty) := \lim_f(x)/x

, taking value in

(-\infty, +\infty]

. Since for any

p(x)>0

, we have

\lim_ q(x)f \left(\frac\right) = p(x)f'(\infty)

, we can extend f-divergence to the

P\not\ll Q

, that is, if

=0

, then

=0

, even if

f^(\infty) =\infty

Properties

Basic properties

* Linearity:

D_ = \sum_i a_i D_

given a finite sequence of nonnegative real numbers

a_i

and generators

f_i

. *

D_f = D_g

iff

f(x) = g(x) + c(x-1)

for some

c\in \R

. In particular, the monotonicity implies that if a

has a positive equilibrium probability distribution

P^*

then

D_f(P(t)\parallel P^*)

is a monotonic (non-increasing) function of time, where the probability distribution

P(t)

is a solution of the Kolmogorov forward equations (or Master equation), used to describe the time evolution of the probability distribution in the Markov process. This means that all ''f''-divergences

D_f(P(t)\parallel P^*)

are the

Lyapunov function In the theory of ordinary differential equations (ODEs), Lyapunov functions, named after Aleksandr Lyapunov, are scalar functions that may be used to prove the stability of an equilibrium of an ODE. Lyapunov functions (also called Lyapunov’s se ...

s of the Kolmogorov forward equations. Reverse statement is also true: If

H(P)

is a Lyapunov function for all Markov chains with positive equilibrium

P^*

and is of the trace-form (

H(P)=\sum_f(P_,P_^)

) then

H(P)= D_f(P(t)\parallel P^*)

, for some convex function ''f''. For example, Bregman divergences in general do not have such property and can increase in Markov processes.

Analytic properties

The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances ().

Variational representations

Let

f^*

be the convex conjugate of

f

. Let

\mathrm(f^*)

be the effective domain of

f^*

, that is,

\mathrm(f^*) = \

. Then we have two variational representations of

D_f

: This is Theorem 7.14 in. edit: there is no such theorem in this reference.

Example applications

Using this theorem on total variation distance, with generator

f(x)= \frac 1 2 , x-1, ,

its convex conjugate is

\\ +\infty \text \end

, and we obtain

TV(P\,  Q) = \sup_ E_P (X) - E_Q (X) /math>For chi-squared divergence, defined by f(x) = (x-1)^2, f^*(y) = y^2/4 + y, we obtain \chi^2(P; Q) = \sup_g E_P (X) - E_Q (X)^2/4 + g(X) /math>Since the variation term is not affine-invariant in g, even though the domain over which g varies ''is'' affine-invariant, we can use up the affine-invariance to obtain a leaner expression. 

Replace g by a g + b, and take maximum over a, b \in \R, we obtain \chi^2(P; Q) = \sup_g \frac which is just a few steps away from the Hammersley–Chapman–Robbins bound and the

Cramér–Rao bound In estimation theory and statistics, the Cramér–Rao bound (CRB) expresses a lower bound on the variance of unbiased estimators of a deterministic (fixed, though unknown) parameter, the variance of any such estimator is at least as high as the ...

. For

\alpha

-divergence with

\alpha \in (-\infty, 0)\cup(0, 1)

, we have

f_\alpha(x) = \frac

, with range

x\in D_\alpha(P\.html" ;"title=", \infty)

. Its convex conjugate is

f_\alpha^*(y)=\frac(x(y)^\alpha - 1)

with range

y\in(-\infty, (1-\alpha)^)

, where

x(y) = ((\alpha-1)y + 1)^

. Applying this theorem yields, after substitution with

h = ((\alpha-1)g+1)^

+_E_P\left[\frac\right \right)

or,_releasing_the_constraint_on_

h

,_D_\alpha(P\.html" ;"title="frac\right.html" ;"title="frac\right + E_P\left[\frac\right">frac\right + E_P\left[\frac\right \right)or, releasing the constraint on

h

D_\alpha(P\"> Q) = \frac - \inf_\left(
E_Q\left[\frac\right
+ E_P\left[\frac\right] 
\right)

Setting

\alpha=-1

yields the variational representation of

\chi^2

-divergence obtained above. The domain over which

h

varies is not affine-invariant in general, unlike the

\chi^2

-divergence case. The

\chi^2

-divergence is special, since in that case, we can remove the

, \cdot ,

from

, h,

. For general

\alpha \in (-\infty, 0)\cup(0, 1)

, the domain over which

h

varies is merely scale-invariant. Similar to above, we can replace

h

a h

, and take minimum over

a>0

to obtain

D_\alpha(P\,  Q) = \sup_ \left frac \left(
1-\frac
\right) \right /math>Setting \alpha=\frac 1 2, and performing another substitution by g=\sqrt h, yields two variational representations of the squared Hellinger distance: H^2(P\, Q) = \frac 1 2 D_(P\,  Q) = 2 - \inf_\left(
E_Q\left (X)\right + E_P\left (X)^\right \right) H^2(P\, Q) = 2 \sup_ \left(1-\sqrt\right) Applying this theorem to the KL-divergence, defined by f(x) = x\ln x, f^*(y) = e^yields D_(P; Q) =\sup_g E_P (X) - e^E_Q^/math>This is strictly less efficient than the Donsker-Varadhan representation D_(P; Q) = \sup_g E_P (X) \ln E_Q^/math>This defect is fixed by the next theorem.


This is Theorem 7.15 in.

Example applications

Applying this theorem to KL-divergence yields the Donsker-Varadhan representation. Attempting to apply this theorem to general

\alpha

-divergence with

\alpha \in (-\infty, 0)\cup(0, 1)

does not yield a closed-form solution.

Common examples of ''f''-divergences

The following table lists many of the common divergences between probability distributions and the possible generating functions to which they correspond. Notably, except for total variation distance, all others are special cases of

\alpha

-divergence, or linear sums of

\alpha

-divergences. For each f-divergence

D_f

, its generating function is not uniquely defined, but only up to

c\cdot(t-1)

, where

c

is any real constant. That is, for any

f

that generates an f-divergence, we have

D_ = D_

. This freedom is not only convenient, but actually necessary.

Let

f_\alpha

be the generator of

\alpha

-divergence, then

f_\alpha

and

f_

are convex inversions of each other, so

D_(P\,  Q) = D_(Q\,  P)

. In particular, this shows that the squared Hellinger distance and Jensen-Shannon divergence are symmetric. In the literature, the

\alpha

-divergences are sometimes parametrized as

\begin
    \frac\big(1 - t^\big), & \text\ \alpha\neq\pm1, \\
    t \ln t, & \text\ \alpha=1, \\
    - \ln t, & \text\ \alpha=-1
  \end

which is equivalent to the parametrization in this page by substituting

\alpha \leftarrow \frac

Relations to other statistical divergences

Rényi divergence

The Rényi divergences is a family of divergences defined by

\Bigg) \,

when

\alpha \in (0, 1)\cup (1, \infty)

. It is extended to the cases of

\alpha =0, 1, \infty

by taking the limit. Simple algebra shows that

R_\alpha(P\,  Q) = \frac\ln (1+\alpha(\alpha-1)D_\alpha(P\, Q))

, where

D_\alpha

is the

\alpha

-divergence defined above.

KL divergence

The KL divergence is the f-divergence generated by

f(x) = x\ln x

Bregman divergence

The only f-divergence that is also a Bregman divergence is the KL divergence.

Financial interpretation

A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities. Knowledge of the actual probabilities allows a player to profit from the game. For a large class of rational players the expected profit rate has the same general form as the ''ƒ''-divergence.

References

* * * * * * * * {{refend

History

Definition

Non-singular case

Extension to singular measures

Properties

Basic properties

Analytic properties

Variational representations

Example applications

Example applications

Common examples of ''f''-divergences

Relations to other statistical divergences

Rényi divergence

KL divergence

Bregman divergence

Financial interpretation

See also

References