estimation theory Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value ...

and statistics, the Cramér–Rao bound (CRB) expresses a lower bound on the variance of

unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...

s of a deterministic (fixed, though unknown) parameter, the variance of any such estimator is at least as high as the inverse of the

Fisher information In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that mode ...

. Equivalently, it expresses an upper bound on the precision (the inverse of variance) of unbiased estimators: the precision of any such estimator is at most the Fisher information. The result is named in honor of Harald Cramér and

C. R. Rao Calyampudi Radhakrishna Rao FRS (born 10 September 1920), commonly known as C. R. Rao, is an Indian-American mathematician and statistician. He is currently professor emeritus at Pennsylvania State University and Research Professor at the ...

, but has independently also been derived by Maurice Fréchet,

Georges Darmois Georges Darmois (24 June 1888 – 3 January 1960) was a French mathematician and statistician. He pioneered in the theory of sufficiency, in stellar statistics, and in factor analysis. He was also one of the first French mathematicians to teach ...

, as well as

Alexander Aitken Alexander Craig "Alec" Aitken (1 April 1895 – 3 November 1967) was one of New Zealand's most eminent mathematicians. In a 1935 paper he introduced the concept of generalized least squares, along with now standard vector/matrix notation fo ...

and Harold Silverstone. An unbiased estimator that achieves this lower bound is said to be (fully) '' efficient''. Such a solution achieves the lowest possible

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...

among all unbiased methods, and is therefore the minimum variance unbiased (MVU) estimator. However, in some cases, no unbiased technique exists which achieves the bound. This may occur either if for any unbiased estimator, there exists another with a strictly smaller variance, or if an MVU estimator exists, but its variance is strictly greater than the inverse of the Fisher information. The Cramér–Rao bound can also be used to bound the variance of estimators of given bias. In some cases, a biased approach can result in both a variance and a

that are the unbiased Cramér–Rao lower bound; see estimator bias.

Statement

The Cramér–Rao bound is stated in this section for several increasingly general cases, beginning with the case in which the parameter is a

scalar Scalar may refer to: *Scalar (mathematics), an element of a field, which is used to define a vector space, usually the field of real numbers *Scalar (physics), a physical quantity that can be described by a single element of a number field such a ...

and its estimator is

unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...

. All versions of the bound require certain regularity conditions, which hold for most well-behaved distributions. These conditions are listed later in this section.

Scalar unbiased case

Suppose

\theta

is an unknown deterministic parameter that is to be estimated from

n

independent observations (measurements) of

x

, each from a distribution according to some probability density function

f(x;\theta)

. The variance of any ''unbiased'' estimator

\hat

\theta

is then bounded by the

reciprocal Reciprocal may refer to: In mathematics * Multiplicative inverse, in mathematics, the number 1/''x'', which multiplied by ''x'' gives the product 1, also known as a ''reciprocal'' * Reciprocal polynomial, a polynomial obtained from another pol ...

of the

I(\theta)

: :

\operatorname(\hat)
\geq
\frac

where the Fisher information

I(\theta)

is defined by :

I(\theta) = n \operatorname_\theta
 \left \left(
   \frac
  \right)^2
 \right

and

\ell(x;\theta)=\log  (f(x;\theta))

is the natural logarithm of the likelihood function for a single sample

x

and

\operatorname_\theta

denotes the

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...

with respect to the density

f(x;\theta)

X

. If

\ell(x;\theta)

is twice differentiable and certain regularity conditions hold, then the Fisher information can also be defined as follows: :

I(\theta) = -n \operatorname_\theta\left \frac \right

The

efficiency Efficiency is the often measurable ability to avoid wasting materials, energy, efforts, money, and time in doing something or in producing a desired result. In a more general sense, it is the ability to do things well, successfully, and without ...

of an unbiased estimator

\hat

measures how close this estimator's variance comes to this lower bound; estimator efficiency is defined as :

e(\hat) = \frac

or the minimum possible variance for an unbiased estimator divided by its actual variance. The Cramér–Rao lower bound thus gives :

e(\hat) \le 1

General scalar case

A more general form of the bound can be obtained by considering a biased estimator

T(X)

, whose expectation is not

\theta

but a function of this parameter, say,

\psi(\theta)

. Hence

E\ - \theta = \psi(\theta) - \theta

is not generally equal to 0. In this case, the bound is given by :

\operatorname(T)
\geq
\frac

where

\psi'(\theta)

is the derivative of

\psi(\theta)

(by

\theta

), and

I(\theta)

is the Fisher information defined above.

Bound on the variance of biased estimators

Apart from being a bound on estimators of functions of the parameter, this approach can be used to derive a bound on the variance of biased estimators with a given bias, as follows. Consider an estimator

\hat

with bias

b(\theta) = E\ - \theta

, and let

\psi(\theta) = b(\theta) + \theta

. By the result above, any unbiased estimator whose expectation is

\psi(\theta)

has variance greater than or equal to

(\psi'(\theta))^2/I(\theta)

. Thus, any estimator

\hat

whose bias is given by a function

b(\theta)

satisfies :

\operatorname \left(\hat\right)
\geq
\frac.

The unbiased version of the bound is a special case of this result, with

b(\theta)=0

. It's trivial to have a small variance − an "estimator" that is constant has a variance of zero. But from the above equation we find that the

of a biased estimator is bounded by :

\operatorname\left((\hat-\theta)^2\right)\geq\frac+b(\theta)^2,

using the standard decomposition of the MSE. Note, however, that if

1+b'(\theta)<1

this bound might be less than the unbiased Cramér–Rao bound

1/I(\theta)

. For instance, in the example of estimating variance below,

1+b'(\theta)= \frac <1

Multivariate case

Extending the Cramér–Rao bound to multiple parameters, define a parameter column vector :

T \in \mathbb^d

with probability density function

f(x; \boldsymbol)

which satisfies the two regularity conditions below. The

Fisher information matrix In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that model ...

is a

d \times d

matrix with element

I_

defined as :

I_
= \operatorname \left \frac \log f\left(x; \boldsymbol\right)
 \frac \log f\left(x; \boldsymbol\right)
\right = -\operatorname \left \frac \log f\left(x; \boldsymbol\right)
\right

Let

\boldsymbol(X)

be an estimator of any vector function of parameters,

\boldsymbol(X) = (T_1(X), \ldots, T_d(X))^T

, and denote its expectation vector

\operatorname boldsymbol(X) /math> by \boldsymbol(\boldsymbol) . The Cramér–Rao bound then states that the

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square Matrix (mathematics), matrix giving the covariance between ea ...

\boldsymbol(X)

satisfies :

I\left(\boldsymbol\right)
\geq
\phi(\theta)^T
\operatorname_\left(\boldsymbol(X)\right)^\phi(\theta)

, :

\operatorname_\left(\boldsymbol(X)\right)
\geq
\phi(\theta)
I\left(\boldsymbol\right)^
\phi(\theta)^T

where * The matrix inequality

A \ge B

is understood to mean that the matrix

A-B

is positive semidefinite, and *

\phi(\theta) := \partial \boldsymbol(\boldsymbol)/\partial \boldsymbol

is the Jacobian matrix whose

ij

element is given by

\partial \psi_i(\boldsymbol)/\partial \theta_j

. If

\boldsymbol(X)

is an

estimator of

\boldsymbol

(i.e.,

\boldsymbol\left(\boldsymbol\right) = \boldsymbol

), then the Cramér–Rao bound reduces to :

\operatorname_\left(\boldsymbol(X)\right)
\geq
I\left(\boldsymbol\right)^.

If it is inconvenient to compute the inverse of the

, then one can simply take the reciprocal of the corresponding diagonal element to find a (possibly loose) lower bound. :

\right)^.

Regularity conditions

The bound relies on two weak regularity conditions on the probability density function,

f(x; \theta)

, and the estimator

T(X)

: * The Fisher information is always defined; equivalently, for all

x

such that

f(x; \theta) > 0

, ::

\frac \log f(x;\theta)

:exists, and is finite. * The operations of integration with respect to

x

and differentiation with respect to

\theta

can be interchanged in the expectation of

T

; that is, ::

\frac
 \left \int T(x) f(x;\theta) \,dx
 \right =
 \int T(x)
  \left \frac f(x;\theta)
  \right \,dx

:whenever the right-hand side is finite. :This condition can often be confirmed by using the fact that integration and differentiation can be swapped when either of the following cases hold: :# The function

f(x;\theta)

has bounded support in

x

, and the bounds do not depend on

\theta

; :# The function

f(x;\theta)

has infinite support, is

continuously differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in it ...

, and the integral converges uniformly for all

\theta

Proof

Proof for the general case based on the Chapman–Robbins bound

Proof based on.

A standalone proof for the general scalar case

Assume that

T=t(X)

is an estimator with expectation

\psi(\theta)

(based on the observations

X

), i.e. that

\operatorname(T) = \psi (\theta)

. The goal is to prove that, for all

\theta

, :

\operatorname(t(X)) \geq \frac.

Let

X

be a random variable with probability density function

f(x; \theta)

. Here

T = t(X)

is a

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hyp ...

, which is used as an

estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...

for

\psi (\theta)

. Define

V

as the

score Score or scorer may refer to: *Test score, the result of an exam or test Business * Score Digital, now part of Bauer Radio * Score Entertainment, a former American trading card design and manufacturing company * Score Media, a former Canadian ...

: :

V = \frac \ln f(X;\theta) = \frac\fracf(X;\theta)

where the

chain rule In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x)=f(g(x)) for every , ...

is used in the final equality above. Then the expectation of

V

, written

\operatorname(V)

, is zero. This is because: :

\, dx = \frac\int f(x;\theta) \, dx = 0

where the integral and partial derivative have been interchanged (justified by the second regularity condition). If we consider the

covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the les ...

\operatorname(V, T)

V

and

T

, we have

\operatorname(V, T) = \operatorname(V T)

, because

\operatorname(V) = 0

. Expanding this expression we have :

= \frac E(T) = \psi^\prime(\theta) \end

again because the integration and differentiation operations commute (second condition). The

Cauchy–Schwarz inequality The Cauchy–Schwarz inequality (also called Cauchy–Bunyakovsky–Schwarz inequality) is considered one of the most important and widely used inequalities in mathematics. The inequality for sums was published by . The corresponding inequality fo ...

shows that :

\sqrt \geq \left,  \operatorname(V,T) \ = \left ,  \psi^\prime (\theta)
\right ,

therefore :

\operatorname  (T) \geq \frac
= \frac

which proves the proposition.

Examples

Multivariate normal distribution

For the case of a ''d''-variate normal distribution :

\boldsymbol
\sim
N_d
\left(
 \boldsymbol( \boldsymbol)
 ,
  ( \boldsymbol)
\right)

the

has elements :

I_
= \frac
^
\frac
+ \frac
\operatorname
\left(
 ^
 \frac
 ^
 \frac
\right)

where "tr" is the trace. For example, let

w /math> be a sample of N independent observations with unknown mean \theta and known variance \sigma^2 .
: w \sim \mathbb_N \left(\theta , \sigma^2  \right). Then the Fisher information is a scalar given by

: I(\theta)
=
\left(\frac\right)^T^ \left(\frac\right)
= \sum^N_\frac = \frac, and so the Cramér–Rao bound is

: \operatorname\left(\hat \theta\right)
\geq
\frac.

Normal variance with known mean

Suppose ''X'' is a normally distributed random variable with known mean

\mu

and unknown variance

\sigma^2

. Consider the following statistic: :

T=\frac.

Then ''T'' is unbiased for

\sigma^2

, as

E(T)=\sigma^2

. What is the variance of ''T''? :

\operatorname(T) = \operatorname\left(\frac\right)=\frac=\frac
\left \operatorname\left\-\left(\operatorname\\right)^2
\right

(the second equality follows directly from the definition of variance). The first term is the fourth moment about the mean and has value

3(\sigma^2)^2

; the second is the square of the variance, or

(\sigma^2)^2

. Thus :

\operatorname(T)=\frac.

Now, what is the

in the sample? Recall that the

V

is defined as :

V=\frac\log\left L(\sigma^2,X)\right

where

L

is the likelihood function. Thus in this case, :

=-\log(\sqrt)-\frac

=-\frac+\frac

where the second equality is from elementary calculus. Thus, the information in a single observation is just minus the expectation of the derivative of

V

, or :

I
=-\operatorname\left(\frac\right)
=-\operatorname\left(-\frac+\frac\right)
=\frac-\frac
=\frac.

Thus the information in a sample of

n

independent observations is just

n

times this, or

\frac.

The Cramér–Rao bound states that :

\operatorname(T)\geq\frac.

In this case, the inequality is saturated (equality is achieved), showing that the

is efficient. However, we can achieve a lower

using a biased estimator. The estimator :

T=\frac.

obviously has a smaller variance, which is in fact :

\operatorname(T)=\frac.

Its bias is :

\left(1-\frac\right)\sigma^2=\frac

so its mean squared error is :

\operatorname(T)=\left(\frac+\frac\right)(\sigma^2)^2
=\frac

which is clearly less than what unbiased estimators can achieve according to the Cramér–Rao bound. When the mean is not known, the minimum mean squared error estimate of the variance of a sample from Gaussian distribution is achieved by dividing by

n+1

, rather than

n-1

n+2

References and notes

External links

FandPLimitTool
a GUI-based software to calculate the Fisher information and Cramér-Rao lower bound with application to single-molecule microscopy. {{DEFAULTSORT:Cramer-Rao bound Articles containing proofs Statistical inequalities Estimation theory