Bretagnolle–Huber Inequality
   HOME

TheInfoList



OR:

In
information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
, the Bretagnolle–Huber inequality bounds the
total variation In mathematics, the total variation identifies several slightly different concepts, related to the (local property, local or global) structure of the codomain of a Function (mathematics), function or a measure (mathematics), measure. For a real ...
distance between two probability distributions P and Q by a concave and
bounded function In mathematics, a function f defined on some set X with real or complex values is called bounded if the set of its values (its image) is bounded. In other words, there exists a real number M such that :, f(x), \le M for all x in X. A functi ...
of the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
D_\mathrm(P \parallel Q). The bound can be viewed as an alternative to the well-known Pinsker's inequality: when D_\mathrm(P \parallel Q) is large (larger than 2 for instance.), Pinsker's inequality is vacuous, while Bretagnolle–Huber remains bounded and hence non-vacuous. It is used in statistics and
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
to prove information-theoretic lower bounds relying on
hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
  ( Bretagnolle–Huber–Carol Inequality is a variation of Concentration inequality for multinomially distributed random variables which bounds the total variation distance.)


Formal statement


Preliminary definitions

Let P and Q be two
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
s on a
measurable space In mathematics, a measurable space or Borel space is a basic object in measure theory. It consists of a set and a σ-algebra, which defines the subsets that will be measured. It captures and generalises intuitive notions such as length, area, an ...
(\mathcal, \mathcal). Recall that the
total variation In mathematics, the total variation identifies several slightly different concepts, related to the (local property, local or global) structure of the codomain of a Function (mathematics), function or a measure (mathematics), measure. For a real ...
between P and Q is defined by : d_\mathrm(P,Q) = \sup_ \. The Kullback-Leibler divergence is defined as follows: :D_\mathrm(P \parallel Q) = \begin \int_ \log\bigl(\frac\bigr)\, dP & \text P \ll Q, \\ mm+\infty & \text. \end In the above, the notation P\ll Q stands for
absolute continuity In calculus and real analysis, absolute continuity is a smoothness property of functions that is stronger than continuity and uniform continuity. The notion of absolute continuity allows one to obtain generalizations of the relationship between ...
of P with respect to Q, and \frac stands for the Radon–Nikodym derivative of P with respect to Q.


General statement

The Bretagnolle–Huber inequality says: : d_\mathrm(P,Q) \leq \sqrt \leq 1 - \frac\exp(-D_\mathrm(P \parallel Q))


Alternative version

The following version is directly implied by the bound above but some authors prefer stating it this way. Let A\in \mathcal be any event. Then : P(A) + Q(\bar) \geq \frac\exp(-D_\mathrm(P \parallel Q)) where \bar = \Omega \smallsetminus A is the complement of A. Indeed, by definition of the total variation, for any A \in \mathcal, : \begin Q(A) - P(A) \leq d_\mathrm(P,Q) & \leq 1- \frac\exp(-D_\mathrm(P \parallel Q)) \\ & = Q(A) + Q(\bar) - \frac\exp(-D_\mathrm(P \parallel Q)) \end Rearranging, we obtain the claimed lower bound on P(A)+Q(\bar).


Proof

We prove the main statement following the ideas in Tsybakov's book (Lemma 2.6, page 89), which differ from the original proof (see C.Canonne's note for a modernized retranscription of their argument). The proof is in two steps: 1. Prove using Cauchy–Schwarz that the total variation is related to the
Bhattacharyya coefficient In statistics, the Bhattacharyya distance is a quantity which represents a notion of similarity between two probability distributions. It is closely related to the Bhattacharyya coefficient, which is a measure of the amount of overlap between two ...
(right-hand side of the inequality): :: 1-d_\mathrm(P,Q)^2 \geq \left(\int \sqrt\right)^2 2. Prove by a clever application of Jensen’s inequality that :: \left(\int \sqrt\right)^2 \geq \exp(-D_\mathrm(P \parallel Q)) * Step 1: : First notice that : d_\mathrm(P,Q) = 1-\int \min(P,Q) = \int \max(P,Q) -1 : To see this, denote A^* = \arg\max_ , P(A)-Q(A), and
without loss of generality ''Without loss of generality'' (often abbreviated to WOLOG, WLOG or w.l.o.g.; less commonly stated as ''without any loss of generality'' or ''with no loss of generality'') is a frequently used expression in mathematics. The term is used to indicat ...
, assume that P(A^*)>Q(A^*) such that d_\mathrm(P,Q)=P(A^*)-Q(A^*). Then we can rewrite :: d_\mathrm(P,Q) = \int_ \max(P,Q) - \int_ \min(P,Q) : And then adding and removing \int_ \max(P,Q) \text \int_\min(P,Q) we obtain both identities. : Then :: \begin 1-d_\mathrm(P,Q)^2 & = (1-d_\mathrm(P,Q))(1+d_\mathrm(P,Q)) \\ & = \int \min(P,Q) \int \max(P,Q) \\ & \geq \left(\int \sqrt\right)^2 \\ & = \left(\int \sqrt\right)^2 \end : because PQ = \min(P,Q) \max(P,Q). * Step 2: : We write (\cdot)^2=\exp(2\log(\cdot)) and apply
Jensen's inequality In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier p ...
: :: \begin \left(\int \sqrt\right)^2 &= \exp\left(2\log\left(\int \sqrt\right)\right) \\ & = \exp\left(2\log\left(\int P\sqrt\right)\right) \\ & =\exp\left(2\log\left(\operatorname_P \left left(\sqrt\right)^ \, \right\right) \right) \\ & \geq \exp\left(\operatorname_P\left \log\left(\frac \right)\right\right) = \exp(-D_(P,Q)) \end : Combining the results of steps 1 and 2 leads to the claimed bound on the total variation.


Examples of applications


Sample complexity of biased coin tosses

Source: The question is ''How many coin tosses do I need to distinguish a fair coin from a biased one?'' Assume you have 2 coins, a
fair coin In probability theory and statistics, a sequence of Independence (probability theory), independent Bernoulli trials with probability 1/2 of success on each trial is metaphorically called a fair coin. One for which the probability is not 1/2 is ca ...
( Bernoulli distributed with mean p_1=1/2) and an \varepsilon-biased coin (p_2=1/2+\varepsilon). Then, in order to identify the biased coin with probability at least 1-\delta (for some \delta>0), at least : n\geq \frac\log\left(\frac\right). In order to obtain this lower bound we impose that the total variation distance between two sequences of n samples is at least 1-2\delta. This is because the total variation upper bounds the probability of under- or over-estimating the coins' means. Denote P_1^n and P_2^n the respective joint distributions of the n coin tosses for each coin, then We have : \begin (1-2\delta)^2 & \leq d_\mathrm\left(P_1^n, P_2^n \right)^2 \\ pt& \leq 1-e^ \\ pt&= 1-e^\\ pt& = 1-e^ \end The result is obtained by rearranging the terms.


Information-theoretic lower bound for ''k''-armed bandit games

In multi-armed bandit, a lower bound on the minimax regret of any bandit algorithm can be proved using Bretagnolle–Huber and its consequence on hypothesis testing (see Chapter 15 of ''Bandit Algorithms'').


History

The result was first proved in 1979 by Jean Bretagnolle and Catherine Huber, and published in the proceedings of the Strasbourg Probability Seminar. Alexandre Tsybakov's book features an early re-publication of the inequality and its attribution to Bretagnolle and Huber, which is presented as an early and less general version of Assouad's lemma (see notes 2.8). A constant improvement on Bretagnolle–Huber was proved in 2014 as a consequence of an extension of Fano's Inequality.


See also

*
Total variation In mathematics, the total variation identifies several slightly different concepts, related to the (local property, local or global) structure of the codomain of a Function (mathematics), function or a measure (mathematics), measure. For a real ...
for a list of upper bounds * Bretagnolle–Huber–Carol Inequality in Concentration inequality


References

Information theory Probabilistic inequalities {{DEFAULTSORT:Bretagnolle-Huber inequality