mathematics Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...

, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an

integral In mathematics, an integral is the continuous analog of a Summation, sum, which is used to calculate area, areas, volume, volumes, and their generalizations. Integration, the process of computing an integral, is one of the two fundamental oper ...

to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation (or equivalently, the opposite inequality for concave transformations). Jensen's inequality generalizes the statement that the

secant line In geometry, a secant is a line (geometry), line that intersects a curve at a minimum of two distinct Point (geometry), points.. The word ''secant'' comes from the Latin word ''secare'', meaning ''to cut''. In the case of a circle, a secant inter ...

of a convex function lies ''above'' the graph of the function, which is Jensen's inequality for two points: the secant line consists of weighted means of the convex function (for ''t'' ∈ ,1, :

t f(x_1) + (1-t) f(x_2),

while the graph of the function is the convex function of the weighted means, :

f(t x_1 + (1-t) x_2).

Thus, Jensen's inequality in this case is :

f(t x_1 + (1-t) x_2) \leq t f(x_1) + (1-t) f(x_2).

In the context of

probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

, it is generally stated in the following form: if ''X'' is a random variable and is a convex function, then :

\varphi(\operatorname \leq \operatorname \left varphi(X)\right

The difference between the two sides of the inequality,

\operatorname \left varphi(X)\right - \varphi\left(\operatorname right)

, is called the Jensen gap.

Statements

The classical form of Jensen's inequality involves several numbers and weights. The inequality can be stated quite generally using either the language of

measure theory In mathematics, the concept of a measure is a generalization and formalization of geometrical measures (length, area, volume) and other common notions, such as magnitude (mathematics), magnitude, mass, and probability of events. These seemingl ...

or (equivalently) probability. In the probabilistic setting, the inequality can be further generalized to its ''full strength''.

Finite form

For a real convex function

\varphi

, numbers

x_1, x_2, \ldots, x_n

in its domain, and positive weights

a_i

, Jensen's inequality can be stated as: and the inequality is reversed if

\varphi

is concave, which is Equality holds if and only if

x_1=x_2=\cdots =x_n

\varphi

is linear on a domain containing

x_1,x_2,\cdots ,x_n

. As a particular case, if the weights

a_i

are all equal, then () and () become For instance, the function is '' concave'', so substituting

\varphi(x) = \log(x)

in the previous formula () establishes the (logarithm of the) familiar arithmetic-mean/geometric-mean inequality:

\log\!\left( \frac\right) \geq \frac

\exp\!\left(\log\!\left( \frac\right)\right) \geq \exp\!\left(\frac \right)

\frac \geq \sqrt /math>

A common application has  as a function of another variable (or set of variables) , that is, x_i = g(t_i) . All of this carries directly over to the general continuous case: the weights  are replaced by a non-negative integrable function , such as a probability distribution, and the summations are replaced by integrals.

Measure-theoretic form

Let

(\Omega, A, \mu)

be a probability space. Let

f : \Omega \to \mathbb

be a

\mu

-measurable function and

\varphi : \mathbb \to \mathbb

be convex. Then:

\varphi\left(\int_\Omega f \,\mathrm\mu\right) \leq \int_\Omega \varphi \circ f \,\mathrm\mu

In real analysis, we may require an estimate on :

\varphi\left(\int_a^b f(x)\, dx\right)

where

a, b \in \mathbb

, and

\to \R

is a non-negative Lebesgue- integrable function. In this case, the Lebesgue measure of

, b /math> need not be 1. However, by integration by substitution, the interval can be rescaled so that it has measure 1. Then Jensen's inequality can be applied to get

: \varphi\left(\frac\int_a^b  f(x)\, dx\right) \le \frac \int_a^b \varphi(f(x)) \,dx.

Probabilistic form

The same result can be equivalently stated in a

setting, by a simple change of notation. Let

(\Omega, \mathfrak,\operatorname)

be a probability space, ''X'' an integrable real-valued random variable and

\varphi

a convex function. Then

\varphi\big(\operatorname big) \leq \operatorname varphi(X)

In this probability setting, the measure is intended as a probability

\operatorname

, the integral with respect to as an expected value

\operatorname

, and the function

f

as a random variable ''X''. Note that the equality holds if and only if

\varphi

is a linear function on some convex set

A

such that

P(X \in A)=1

(which follows by inspecting the measure-theoretical proof below).

General inequality in a probabilistic setting

More generally, let ''T'' be a real topological vector space, and ''X'' a ''T''-valued integrable random variable. In this general setting, ''integrable'' means that there exists an element

\operatorname /math> in ''T'', such that for any element ''z'' in the dual space of ''T'': \operatorname, \langle z, X \rangle, <\infty, and \langle z, \operatorname rangle = \operatorname langle z, X \rangle /math>. Then, for any measurable convex function  and any sub- σ-algebra \mathfrak of \mathfrak :

: \varphi\left(\operatorname\left \mid\mathfrak\right right) \leq  \operatorname\left varphi(X)\mid\mathfrak\right Here \operatorname cdot\mid\mathfrak /math> stands for the expectation conditioned to the σ-algebra \mathfrak . This general statement reduces to the previous ones when the topological vector space  is the real axis, and \mathfrak is the trivial -algebra  (where  is the

empty set In mathematics, the empty set or void set is the unique Set (mathematics), set having no Element (mathematics), elements; its size or cardinality (count of elements in a set) is 0, zero. Some axiomatic set theories ensure that the empty set exi ...

, and is the sample space).

A sharpened and generalized form

Let ''X'' be a one-dimensional random variable with mean

\mu

and variance

\sigma^2\ge 0

. Let

\varphi(x)

be a twice differentiable function, and define the function :

h(x)\triangleq\frac-\frac.

Then :

right)\le \sigma^2\sup h(x) \le \sigma^2\sup \frac.

In particular, when

\varphi(x)

is convex, then

\varphi''(x)\ge 0

, and the standard form of Jensen's inequality immediately follows for the case where

\varphi(x)

is additionally assumed to be twice differentiable.

Proofs

Intuitive graphical proof

Jensen's inequality can be proved in several ways, and three different proofs corresponding to the different statements above will be offered. Before embarking on these mathematical derivations, however, it is worth analyzing an intuitive graphical argument based on the probabilistic case where is a real number (see figure). Assuming a hypothetical distribution of values, one can immediately identify the position of

\operatorname /math> and its image \varphi(\operatorname in the graph. Noticing that for convex mappings  of some  values the corresponding distribution of  values is increasingly "stretched up" for increasing values of , it is easy to see that the distribution of  is broader in the interval corresponding to  and narrower in  for any ; in particular, this is also true for X_0 = \operatorname /math>. Consequently, in this picture the expectation of  will always shift upwards with respect to the position of \varphi(\operatorname . A similar reasoning holds if the distribution of  covers a decreasing portion of the convex function, or both a decreasing and an increasing portion of it. This "proves" the inequality, i.e. 

: \varphi(\operatorname \leq  \operatorname varphi(X) = \operatorname with equality when  is not strictly convex, e.g. when it is a straight line, or when  follows a degenerate distribution (i.e. is a constant).

The proofs below formalize this intuitive notion.

Proof 1 (finite form)

If and are two arbitrary nonnegative real numbers such that then convexity of implies :

\forall x_1, x_2: \qquad \varphi \left (\lambda_1 x_1+\lambda_2 x_2 \right )\leq \lambda_1\,\varphi(x_1)+\lambda_2\,\varphi(x_2).

This can be generalized: if are nonnegative real numbers such that , then :

\varphi(\lambda_1 x_1+\lambda_2 x_2+\cdots+\lambda_n x_n)\leq \lambda_1\,\varphi(x_1)+\lambda_2\,\varphi(x_2)+\cdots+\lambda_n\,\varphi(x_n),

for any . The ''finite form'' of the Jensen's inequality can be proved by induction: by convexity hypotheses, the statement is true for ''n'' = 2. Suppose the statement is true for some ''n'', so :

\varphi\left(\sum_^\lambda_i x_i\right) \leq \sum_^\lambda_i \varphi\left(x_i\right)

for any such that . One needs to prove it for . At least one of the is strictly smaller than

1

, say ; therefore by convexity inequality: :

\begin
\varphi\left(\sum_^\lambda_i x_i\right) &= \varphi\left((1-\lambda_)\sum_^ \frac x_i + \lambda_ x_ \right) \\
&\leq (1-\lambda_) \varphi\left(\sum_^ \frac x_i \right)+\lambda_\,\varphi(x_).
\end

Since , :

\sum_^ \frac = 1

, applying the inductive hypothesis gives :

\varphi\left(\sum_^\frac x_i\right) \leq \sum_^\frac \varphi(x_i)

therefore :

\begin
\varphi\left(\sum_^\lambda_i x_i\right) 
&\leq (1-\lambda_) \sum_^\frac \varphi(x_i)+\lambda_\,\varphi(x_) 
=\sum_^\lambda_i \varphi(x_i)
\end

We deduce the inequality is true for , by induction it follows that the result is also true for all integer greater than 2. In order to obtain the general inequality from this finite form, one needs to use a density argument. The finite form can be rewritten as: :

\varphi\left(\int x\,d\mu_n(x) \right)\leq \int \varphi(x)\,d\mu_n(x),

where ''μ''_''n'' is a measure given by an arbitrary convex combination of Dirac deltas: :

\mu_n= \sum_^n \lambda_i \delta_.

Since convex functions are continuous, and since convex combinations of Dirac deltas are weakly dense in the set of probability measures (as could be easily verified), the general statement is obtained simply by a limiting procedure.

Proof 2 (measure-theoretic form)

Let

g

be a real-valued

\mu

-integrable function on a probability space

\Omega

, and let

\varphi

be a convex function on the real numbers. Since

\varphi

is convex, at each real number

x

we have a nonempty set of subderivatives, which may be thought of as lines touching the graph of

\varphi

x

, but which are below the graph of

\varphi

at all points (support lines of the graph). Now, if we define :

x_0:=\int_\Omega g\, d\mu,

because of the existence of subderivatives for convex functions, we may choose

a

and

b

such that :

ax + b \leq \varphi(x),

for all real

x

and :

ax_0+ b = \varphi(x_0).

But then we have that :

\varphi \circ g (\omega) \geq ag(\omega)+ b

for almost all

\omega \in \Omega

. Since we have a probability measure, the integral is monotone with

\mu(\Omega) = 1

so that :

\int_\Omega \varphi \circ g\, d\mu  \geq \int_\Omega (ag + b)\, d\mu  = a\int_\Omega g\, d\mu + b\int_\Omega d\mu = ax_0 + b = \varphi (x_0) = \varphi \left (\int_\Omega g\, d\mu \right ),

as desired.

Proof 3 (general inequality in a probabilistic setting)

Let ''X'' be an integrable random variable that takes values in a real topological vector space ''T''. Since

\varphi: T \to \R

is convex, for any

x,y \in T

, the quantity :

\frac,

is decreasing as approaches 0⁺. In particular, the ''subdifferential'' of

\varphi

evaluated at in the direction is well-defined by :

(D\varphi)(x)\cdot y:=\lim_ \frac=\inf_ \frac.

It is easily seen that the subdifferential is linear in (that is false and the assertion requires Hahn-Banach theorem to be proved) and, since the infimum taken in the right-hand side of the previous formula is smaller than the value of the same term for , one gets :

\varphi(x)\leq \varphi(x+y)-(D\varphi)(x)\cdot y.

In particular, for an arbitrary sub--algebra

\mathfrak

we can evaluate the last inequality when

x = \operatorname \mid\mathfrak \,y=X-\operatorname \mid\mathfrak /math> to obtain

: \varphi(\operatorname \mid\mathfrak \leq \varphi(X)-(D\varphi)(\operatorname \mid\mathfrak \cdot (X-\operatorname \mid\mathfrak . Now, if we take the expectation conditioned to \mathfrak on both sides of the previous expression, we get the result since:

: \operatorname \left [\left[(D\varphi)(\operatorname \mid\mathfrak \cdot (X-\operatorname \mid\mathfrak \right]\mid\mathfrak \right] = (D\varphi)(\operatorname \mid\mathfrak \cdot \operatorname[\left( X-\operatorname[X\mid\mathfrak] \right) \mid \mathfrak]=0, by the linearity of the subdifferential in the ''y'' variable, and the following well-known property of the conditional expectation :

: \operatorname \left \left(\operatorname[X\mid\mathfrak \right) \mid\mathfrak \right ">\mid\mathfrak.html" ;"title="\left(\operatorname[X\mid\mathfrak">\left(\operatorname[X\mid\mathfrak\right) \mid\mathfrak \right = \operatorname[ X \mid\mathfrak].

Applications and special cases

Form involving a probability density function

Suppose is a measurable subset of the real line and ''f''(''x'') is a non-negative function such that :

\int_^\infty f(x)\,dx = 1.

In probabilistic language, ''f'' is a probability density function. Then Jensen's inequality becomes the following statement about convex integrals: If ''g'' is any real-valued measurable function and

\varphi

is convex over the range of ''g'', then :

\varphi\left(\int_^\infty g(x)f(x)\, dx\right) \le \int_^\infty \varphi(g(x)) f(x)\, dx.

If ''g''(''x'') = ''x'', then this form of the inequality reduces to a commonly used special case: :

\varphi\left(\int_^\infty x\, f(x)\, dx\right) \le \int_^\infty \varphi(x)\,f(x)\, dx.

This is applied in Variational Bayesian methods.

Example: even moments of a random variable

If ''g''(''x'') = ''x²ⁿ'', and ''X'' is a random variable, then ''g'' is convex as :

\frac(x) = 2n(2n - 1)x^ \geq 0\quad \forall\ x \in \R

and so :

g(\operatorname = (\operatorname^ \leq\operatorname^

In particular, if some even moment ''2n'' of ''X'' is finite, ''X'' has a finite mean. An extension of this argument shows ''X'' has finite moments of every order

l\in\N

dividing ''n''.

Alternative finite form

Let and take to be the counting measure on , then the general form reduces to a statement about sums: :

\varphi\left(\sum_^ g(x_i)\lambda_i \right) \le \sum_^ \varphi(g(x_i)) \lambda_i,

provided that and :

\lambda_1 + \cdots + \lambda_n = 1.

There is also an infinite discrete form.

Statistical physics

Jensen's inequality is of particular importance in statistical physics when the convex function is an exponential, giving: :

e^ \leq \operatorname \left e^X \right

where the expected values are with respect to some probability distribution in the random variable . Proof: Let

\varphi(x) = e^x

\varphi\left(\operatorname right) \leq \operatorname \left \varphi(X) \right

Information theory

If is the true probability density for , and is another density, then applying Jensen's inequality for the random variable and the convex function gives :

\operatorname varphi(Y) \ge \varphi(\operatorname

Therefore: :

-D(p(x)\, q(x))=\int p(x) \log \left (\frac \right ) \, dx \le \log \left ( \int p(x) \frac\,dx \right ) = \log \left (\int q(x)\,dx \right ) =0

a result called Gibbs' inequality. It shows that the average message length is minimised when codes are assigned on the basis of the true probabilities ''p'' rather than any other distribution ''q''. The quantity that is non-negative is called the Kullback–Leibler divergence of ''q'' from ''p'', where

D(p(x)\, q(x))=\int p(x) \log \left (\frac \right ) dx

. Since is a strictly convex function for , it follows that equality holds when equals almost everywhere.

Rao–Blackwell theorem

If ''L'' is a convex function and

\mathfrak

a sub-sigma-algebra, then, from the conditional version of Jensen's inequality, we get :

L(\operatorname delta(X) \mid \mathfrak \le \operatorname (\delta(X)) \mid \mathfrak \quad \Longrightarrow \quad \operatorname delta(X) \mid \mathfrak ">(\operatorname delta(X) \mid \mathfrak \le \operatorname (\delta(X))

So if δ(''X'') is some estimator of an unobserved parameter θ given a vector of observables ''X''; and if ''T''(''X'') is a sufficient statistic for θ; then an improved estimator, in the sense of having a smaller expected loss ''L'', can be obtained by calculating :

\delta_1 (X) = \operatorname_delta(X') \mid T(X')= T(X)

the expected value of δ with respect to θ, taken over all possible vectors of observations ''X'' compatible with the same value of ''T''(''X'') as that observed. Further, because T is a sufficient statistic,

\delta_1 (X)

does not depend on θ, hence, becomes a statistic. This result is known as the Rao–Blackwell theorem.

Risk aversion

The relation between risk aversion and declining marginal utility for scalar outcomes can be stated formally with Jensen's inequality: risk aversion can be stated as preferring a certain outcome

u(E

to a fair gamble with potentially larger but uncertain outcome of

u(x)

u(E > E (x) /math>.

But this is simply Jensen's inequality for a ''concave'' u(x) : a utility function that exhibits declining marginal utility.

Generalizations

Beyond its classical formulation for real numbers and convex functions, Jensen’s inequality has been extended to the realm of operator theory. In this non‐commutative setting the inequality is expressed in terms of operator convex functions—that is, functions defined on an interval I that satisfy :

f\bigl(\lambda x + (1-\lambda)y\bigr)\le\lambda f(x)+(1-\lambda)f(y)

for every pair of self‐adjoint operators x and y (with spectra in I) and every scalar

\lambda\in,1 /math>. Hansen and Pedersen established a definitive version of this inequality by considering genuine non‐commutative convex combinations. In particular, if one has an n‑tuple of bounded self‐adjoint operators x_1,\dots,x_n with spectra in I and an n‑tuple of operators a_1,\dots,a_n satisfying
: \sum_^a_i^*a_i=I, then the following operator Jensen inequality holds:
: f\Bigl(\sum_^a_i^*x_ia_i\Bigr)\le\sum_^a_i^*f(x_i)a_i. This result shows that the convex transformation “respects” non-commutative convex combinations, thereby extending the classical inequality to operators without the need for additional restrictions on the interval of definition. A closely related extension is given by the Jensen trace inequality. For a continuous convex function f defined on I, if one considers self‐adjoint matrices x_1,\dots,x_n (with spectra in I) and matrices a_1,\dots,a_n satisfying \sum_^a_i^*a_i=I, then one has
: \operatorname\Bigl(f\Bigl(\sum_^a_i^*x_ia_i\Bigr)\Bigr)\le\operatorname\Bigl(\sum_^a_i^*f(x_i)a_i\Bigr). This inequality naturally extends to C*-algebras equipped with a finite trace and is particularly useful in applications ranging from quantum statistical mechanics to information theory.
Furthermore, contractive versions of these operator inequalities are available when one only assumes \sum_^a_i^t a_i\le I, provided that additional conditions such as f(0)\le0 (when 0 ∈ I) are imposed. Extensions to continuous fields of operators and to settings involving conditional expectations on C-algebras further illustrate the broad applicability of these generalizations.

Notes

References

* * Tristan Needham (1993) "A Visual Explanation of Jensen's Inequality", American Mathematical Monthly 100(8):768–71. * * * *Sam Savage (2012
The Flaw of Averages: Why We Underestimate Risk in the Face of Uncertainty
(1st ed.) Wiley. ISBN 978-0471381976

External links

Jensen's Operator Inequality
of Hansen and Pedersen. * * * {{Convex analysis and variational analysis Convex analysis Inequalities (mathematics) Probabilistic inequalities Statistical inequalities Theorems in mathematical analysis Theorems involving convexity Articles containing proofs