HOME

TheInfoList



OR:

In
probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...
, an f-divergence is a function D_f(P\, Q) that measures the difference between two
probability distributions In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
P and Q. Many common divergences, such as KL-divergence, Hellinger distance, and
total variation distance In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational dist ...
, are special cases of f-divergence.


History

These divergences were introduced by
Alfréd Rényi Alfréd Rényi (20 March 1921 – 1 February 1970) was a Hungarian mathematician known for his work in probability theory, though he also made contributions in combinatorics, graph theory, and number theory. Life Rényi was born in Budapest to A ...
in the same paper where he introduced the well-known
Rényi entropy In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for t ...
. He proved that these divergences decrease in
Markov process A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...
es. ''f''-divergences were studied further independently by , and and are sometimes known as Csiszár f-divergences, Csiszár-Morimoto divergences, or Ali-Silvey distances.


Definition


Non-singular case

Let P and Q be two probability distributions over a space \Omega, such that P\ll Q, that is, P is absolutely continuous with respect to Q. Then, for a
convex function In mathematics, a real-valued function is called convex if the line segment between any two points on the graph of the function lies above the graph between the two points. Equivalently, a function is convex if its epigraph (the set of poi ...
f: , \infty)\to(-\infty, \infty/math> such that f(x) is finite for all x > 0, f(1)=0, and f(0)=\lim_ f(t) (which could be infinite), the f-divergence of P from Q is defined as : D_f(P\parallel Q) \equiv \int_ f\left(\frac\right)\,dQ. f is called the generator of D_f. In concrete applications, there is usually a reference distribution \mu on \Omega (for example, when \Omega = \R^n, the reference distribution is the
Lebesgue measure In measure theory, a branch of mathematics, the Lebesgue measure, named after French mathematician Henri Lebesgue, is the standard way of assigning a measure to subsets of ''n''-dimensional Euclidean space. For ''n'' = 1, 2, or 3, it coincides wi ...
), such that P, Q \ll \mu, then we can use Radon-Nikodym theorem to take their probability densities p and q, giving : D_f(P\parallel Q) = \int_ f\left(\frac\right)q(x)\,d\mu(x). When there is no such reference distribution ready at hand, we can simply define \mu = P+Q, and proceed as above. This is a useful technique in more abstract proofs.


Extension to singular measures

The above definition can be extended to cases where P\ll Q is no longer satisfied. Since f is convex, and f(1) = 0 , the function \frac must nondecrease, so there exists f'(\infty) := \lim_f(x)/x, taking value in (-\infty, +\infty]. Since for any p(x)>0, we have \lim_ q(x)f \left(\frac\right) = p(x)f'(\infty) , we can extend f-divergence to the P\not\ll Q , that is, if P =0= 0 , then f^(\infty) P =00 , even if f^(\infty) =\infty.


Properties


Basic properties

* Linearity: D_ = \sum_i a_i D_ given a finite sequence of nonnegative real numbers a_i and generators f_i. * D_f = D_g iff f(x) = g(x) + c(x-1) for some c\in \R. In particular, the monotonicity implies that if a
Markov process A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...
has a positive equilibrium probability distribution P^* then D_f(P(t)\parallel P^*) is a monotonic (non-increasing) function of time, where the probability distribution P(t) is a solution of the Kolmogorov forward equations (or Master equation), used to describe the time evolution of the probability distribution in the Markov process. This means that all ''f''-divergences D_f(P(t)\parallel P^*) are the
Lyapunov function In the theory of ordinary differential equations (ODEs), Lyapunov functions, named after Aleksandr Lyapunov, are scalar functions that may be used to prove the stability of an equilibrium of an ODE. Lyapunov functions (also called Lyapunov’s se ...
s of the Kolmogorov forward equations. Reverse statement is also true: If H(P) is a Lyapunov function for all Markov chains with positive equilibrium P^* and is of the trace-form (H(P)=\sum_f(P_,P_^)) then H(P)= D_f(P(t)\parallel P^*), for some convex function ''f''. For example, Bregman divergences in general do not have such property and can increase in Markov processes.


Analytic properties

The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances ().


Variational representations

Let f^* be the convex conjugate of f. Let \mathrm(f^*) be the effective domain of f^*, that is, \mathrm(f^*) = \. Then we have two variational representations of D_f: This is Theorem 7.14 in. edit: there is no such theorem in this reference.


Example applications

Using this theorem on total variation distance, with generator f(x)= \frac 1 2 , x-1, , its convex conjugate is f^*(x^*) = \begin x^* \text 1/2, 1/2\\ +\infty \text \end, and we obtainTV(P\, Q) = \sup_ E_P (X)- E_Q (X)/math>For chi-squared divergence, defined by f(x) = (x-1)^2, f^*(y) = y^2/4 + y, we obtain\chi^2(P; Q) = \sup_g E_P (X)- E_Q (X)^2/4 + g(X)/math>Since the variation term is not affine-invariant in g, even though the domain over which g varies ''is'' affine-invariant, we can use up the affine-invariance to obtain a leaner expression. Replace g by a g + b, and take maximum over a, b \in \R, we obtain\chi^2(P; Q) = \sup_g \fracwhich is just a few steps away from the Hammersley–Chapman–Robbins bound and the
Cramér–Rao bound In estimation theory and statistics, the Cramér–Rao bound (CRB) expresses a lower bound on the variance of unbiased estimators of a deterministic (fixed, though unknown) parameter, the variance of any such estimator is at least as high as the ...
. For \alpha-divergence with \alpha \in (-\infty, 0)\cup(0, 1), we have f_\alpha(x) = \frac, with range x\in D_\alpha(P\.html" ;"title=", \infty). Its convex conjugate is f_\alpha^*(y)=\frac(x(y)^\alpha - 1) with range y\in(-\infty, (1-\alpha)^), where x(y) = ((\alpha-1)y + 1)^. Applying this theorem yields, after substitution with h = ((\alpha-1)g+1)^,D_\alpha(P\"> Q) = \frac - \inf_\left( E_Q\left frac\right +_E_P\left[\frac\right \right)or,_releasing_the_constraint_on_h,_D_\alpha(P\.html" ;"title="frac\right.html" ;"title="frac\right + E_P\left[\frac\right">frac\right + E_P\left[\frac\right \right)or, releasing the constraint on h, D_\alpha(P\"> Q) = \frac - \inf_\left( E_Q\left[\frac\right + E_P\left[\frac\right] \right)Setting \alpha=-1 yields the variational representation of \chi^2-divergence obtained above. The domain over which h varies is not affine-invariant in general, unlike the \chi^2-divergence case. The \chi^2-divergence is special, since in that case, we can remove the , \cdot , from , h, . For general \alpha \in (-\infty, 0)\cup(0, 1), the domain over which h varies is merely scale-invariant. Similar to above, we can replace h by a h, and take minimum over a>0 to obtainD_\alpha(P\, Q) = \sup_ \left frac \left( 1-\frac \right) \right/math>Setting \alpha=\frac 1 2, and performing another substitution by g=\sqrt h, yields two variational representations of the squared Hellinger distance:H^2(P\, Q) = \frac 1 2 D_(P\, Q) = 2 - \inf_\left( E_Q\left (X)\right + E_P\left (X)^\right \right)H^2(P\, Q) = 2 \sup_ \left(1-\sqrt\right) Applying this theorem to the KL-divergence, defined by f(x) = x\ln x, f^*(y) = e^ yields D_(P; Q) =\sup_g E_P (X)- e^E_Q ^/math>This is strictly less efficient than the Donsker-Varadhan representationD_(P; Q) = \sup_g E_P (X) \ln E_Q ^/math>This defect is fixed by the next theorem. This is Theorem 7.15 in.


Example applications

Applying this theorem to KL-divergence yields the Donsker-Varadhan representation. Attempting to apply this theorem to general \alpha-divergence with \alpha \in (-\infty, 0)\cup(0, 1) does not yield a closed-form solution.


Common examples of ''f''-divergences

The following table lists many of the common divergences between probability distributions and the possible generating functions to which they correspond. Notably, except for total variation distance, all others are special cases of \alpha-divergence, or linear sums of \alpha-divergences. For each f-divergence D_f, its generating function is not uniquely defined, but only up to c\cdot(t-1), where c is any real constant. That is, for any f that generates an f-divergence, we have D_ = D_. This freedom is not only convenient, but actually necessary. Let f_\alpha be the generator of \alpha-divergence, then f_\alpha and f_ are convex inversions of each other, so D_(P\, Q) = D_(Q\, P) . In particular, this shows that the squared Hellinger distance and Jensen-Shannon divergence are symmetric. In the literature, the \alpha-divergences are sometimes parametrized as \begin \frac\big(1 - t^\big), & \text\ \alpha\neq\pm1, \\ t \ln t, & \text\ \alpha=1, \\ - \ln t, & \text\ \alpha=-1 \end which is equivalent to the parametrization in this page by substituting \alpha \leftarrow \frac.


Relations to other statistical divergences


Rényi divergence

The Rényi divergences is a family of divergences defined by R_ (P \, Q) = \frac\log\Bigg( E_Q\left left(\frac\right)^\alpha\right\Bigg) \, when \alpha \in (0, 1)\cup (1, \infty). It is extended to the cases of \alpha =0, 1, \infty by taking the limit. Simple algebra shows that R_\alpha(P\, Q) = \frac\ln (1+\alpha(\alpha-1)D_\alpha(P\, Q)), where D_\alpha is the \alpha-divergence defined above.


KL divergence

The KL divergence is the f-divergence generated by f(x) = x\ln x.


Bregman divergence

The only f-divergence that is also a Bregman divergence is the KL divergence.


Financial interpretation

A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities. Knowledge of the actual probabilities allows a player to profit from the game. For a large class of rational players the expected profit rate has the same general form as the ''ƒ''-divergence.


See also

*
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...
* Bregman divergence


References

* * * * * * * * {{refend