Bayesian statistics Bayesian statistics ( or ) is a theory in the field of statistics based on the Bayesian interpretation of probability, where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about ...

, the Jeffreys prior is a non-informative

prior distribution A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

for a

parameter space The parameter space is the space of all possible parameter values that define a particular mathematical model. It is also sometimes called weight space, and is often a subset of finite-dimensional Euclidean space. In statistics, parameter spaces a ...

. Named after Sir

Harold Jeffreys Sir Harold Jeffreys, FRS (22 April 1891 – 18 March 1989) was a British geophysicist who made significant contributions to mathematics and statistics. His book, ''Theory of Probability'', which was first published in 1939, played an importan ...

, its density function is proportional to the

square root In mathematics, a square root of a number is a number such that y^2 = x; in other words, a number whose ''square'' (the result of multiplying the number by itself, or y \cdot y) is . For example, 4 and −4 are square roots of 16 because 4 ...

of the

determinant In mathematics, the determinant is a Scalar (mathematics), scalar-valued function (mathematics), function of the entries of a square matrix. The determinant of a matrix is commonly denoted , , or . Its value characterizes some properties of the ...

of the

Fisher information In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that models ''X''. Formally, it is the variance ...

matrix:

p\left( \theta \right) \propto \left,  I (\theta) \^ .\,

It has the key feature that it is invariant under a

change of coordinates In mathematics, an ordered basis of a vector space of finite dimension (vector space), dimension allows representing uniquely any element of the vector space by a coordinate vector, which is a finite sequence, sequence of scalar (mathematics), ...

for the parameter vector

\theta

. That is, the relative probability assigned to a volume of a probability space using a Jeffreys prior will be the same regardless of the parameterization used to define the Jeffreys prior. This makes it of special interest for use with ''scale parameters''. As a concrete example, a

Bernoulli distribution In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with pro ...

can be parameterized by the probability of occurrence , or by the

odds In probability theory, odds provide a measure of the probability of a particular outcome. Odds are commonly used in gambling and statistics. For example for an event that is 40% probable, one could say that the odds are or When gambling, o ...

. A uniform prior on one of these is not the same as a uniform prior on the other, even accounting for reparameterization in the usual way, but the Jeffreys prior on one reparameterizes to the Jeffreys prior on the other. In

maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...

exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...

models, penalty terms based on the Jeffreys prior were shown to reduce asymptotic bias in point estimates.

Reparameterization

One-parameter case

\theta

and

\varphi

are two possible parameterizations of a statistical model, and

\theta

is a

continuously differentiable In mathematics, a differentiable function of one Real number, real variable is a Function (mathematics), function whose derivative exists at each point in its Domain of a function, domain. In other words, the Graph of a function, graph of a differ ...

function of

\varphi

, we say that the prior

p_\theta(\theta)

is "invariant" under a reparameterization if

p_\varphi(\varphi) = p_\theta(\theta) \left, \frac\,

that is, if the priors

p_\theta(\theta)

and

p_\varphi(\varphi)

are related by the usual change of variables theorem. Since the Fisher information transforms under reparameterization as

I_\varphi(\varphi) = I_\theta(\theta) \left( \frac \right)^2,

defining the priors as

p_\varphi(\varphi) \propto \sqrt

and

p_\theta(\theta) \propto \sqrt

gives us the desired "invariance".

Multiple-parameter case

Analogous to the one-parameter case, let

\vec\theta

and

\vec\varphi

be two possible parameterizations of a statistical model, with

\vec\theta

a continuously differentiable function of

\vec\varphi

. We call the prior

p_\theta(\vec\theta)

"invariant" under reparameterization if

p_\varphi(\vec\varphi) = p_\theta(\vec\theta) ~, \det J, \,,

where

J

is the

Jacobian matrix In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. If this matrix is square, that is, if the number of variables equals the number of component ...

with entries

J_ = \frac .

Since the Fisher information matrix transforms under reparameterization as

I_\varphi(\vec\varphi) = J^T I_\theta(\vec\theta) J,

we have that

\det I_\varphi(\varphi) = \det I_\theta(\theta) (\det J)^2

and thus defining the priors as

p_\varphi(\vec\varphi) \propto \sqrt

and

p_\theta(\vec\theta) \propto \sqrt

gives us the desired "invariance".

Attributes

From a practical and mathematical standpoint, a valid reason to use this non-informative prior instead of others, like the ones obtained through a limit in conjugate families of distributions, is that the relative probability of a volume of the probability space is not dependent upon the set of parameter variables that is chosen to describe parameter space. Sometimes the Jeffreys prior cannot be normalized, and is thus an improper prior. For example, the Jeffreys prior for the distribution mean is uniform over the entire real line in the case of a

Gaussian distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...

of known variance. Use of the Jeffreys prior violates the strong version of the

likelihood principle In statistics, the likelihood principle is the proposition that, given a statistical model, all the evidence in a sample relevant to model parameters is contained in the likelihood function. A likelihood function arises from a probability densit ...

, which is accepted by many, but by no means all, statisticians. When using the Jeffreys prior, inferences about

\vec\theta

depend not just on the probability of the observed data as a function of

\vec\theta

, but also on the universe of all possible experimental outcomes, as determined by the experimental design, because the Fisher information is computed from an expectation over the chosen universe. Accordingly, the Jeffreys prior, and hence the inferences made using it, may be different for two experiments involving the same

\vec\theta

parameter even when the likelihood functions for the two experiments are the same—a violation of the strong likelihood principle.

Minimum description length

In the

minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...

approach to statistics the goal is to describe data as compactly as possible where the length of a description is measured in bits of the code used. For a parametric family of distributions one compares a code with the best code based on one of the distributions in the parameterized family. The main result is that in exponential families, asymptotically for large sample size, the code based on the distribution that is a mixture of the elements in the exponential family with the Jeffreys prior is optimal. This result holds if one restricts the parameter set to a compact subset in the interior of the full parameter space. If the full parameter is used a modified version of the result should be used.

Examples

The Jeffreys prior for a parameter (or a set of parameters) depends upon the statistical model.

Gaussian distribution with mean parameter

For the

of the real value

x

f(x\mid\mu) = \frac

with

\sigma

fixed, the Jeffreys prior for the mean

\mu

\begin p(\mu) & \propto \sqrt
= \sqrt
= \sqrt \\
& = \sqrt
=  \sqrt \propto 1.\end

That is, the Jeffreys prior for

\mu

does not depend upon

\mu

; it is the unnormalized uniform distribution on the real line — the distribution that is 1 (or some other fixed constant) for all points. This is an improper prior, and is, up to the choice of constant, the unique ''translation''-invariant distribution on the reals (the

Haar measure In mathematical analysis, the Haar measure assigns an "invariant volume" to subsets of locally compact topological groups, consequently defining an integral for functions on those groups. This Measure (mathematics), measure was introduced by Alfr� ...

with respect to addition of reals), corresponding to the mean being a measure of ''location'' and translation-invariance corresponding to no information about location.

Gaussian distribution with standard deviation parameter

For the

of the real value

x

f(x\mid\sigma) = \frac,

with

\mu

fixed, the Jeffreys prior for the standard deviation

\sigma > 0

\beginp(\sigma) & \propto \sqrt
= \sqrt
= \sqrt \\
& = \sqrt
= \sqrt
\propto \frac.
\end

Equivalently, the Jeffreys prior for

\log \sigma = \int d\sigma/\sigma

is the unnormalized uniform distribution on the real line, and thus this distribution is also known as the . Similarly, the Jeffreys prior for

\log \sigma^2 = 2 \log \sigma

is also uniform. It is the unique (up to a multiple) prior (on the positive reals) that is ''scale''-invariant (the

with respect to multiplication of positive reals), corresponding to the standard deviation being a measure of ''scale'' and scale-invariance corresponding to no information about scale. As with the uniform distribution on the reals, it is an improper prior.

Poisson distribution with rate parameter

For the

Poisson distribution In probability theory and statistics, the Poisson distribution () is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known const ...

of the non-negative integer

n

f(n \mid \lambda) = e^\frac,

the Jeffreys prior for the rate parameter

\lambda \ge 0

\beginp(\lambda) &\propto \sqrt
= \sqrt
= \sqrt \\
& = \sqrt
= \sqrt.\end

Equivalently, the Jeffreys prior for

\sqrt\lambda = \int d\lambda/\sqrt\lambda

is the unnormalized uniform distribution on the non-negative real line.

Bernoulli trial

For a coin that is "heads" with probability

\gamma \in,1 /math> and is "tails" with probability 1 - \gamma, for a given (H,T) \in \ the probability is \gamma^H (1-\gamma)^T . The Jeffreys prior for the parameter \gamma is \beginp(\gamma) & \propto \sqrt
= \sqrt
= \sqrt \\
& = \sqrt
= \frac\,.\end This is the arcsine distribution and is a

beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

or (0, 1) in terms of two positive Statistical parameter, parameters, denoted by ''alpha'' (''α'') an ...

with

\alpha = \beta = 1/2

. Furthermore, if

\gamma = \sin^2(\theta)

then

\Pr

theta Theta (, ) uppercase Θ or ; lowercase θ or ; ''thē̂ta'' ; Modern: ''thī́ta'' ) is the eighth letter of the Greek alphabet, derived from the Phoenician letter Teth 𐤈. In the system of Greek numerals, it has a value of 9. Gree ...

= \Pr

gamma Gamma (; uppercase , lowercase ; ) is the third letter of the Greek alphabet. In the system of Greek numerals it has a value of 3. In Ancient Greek, the letter gamma represented a voiced velar stop . In Modern Greek, this letter normally repr ...

\frac \propto \frac ~2 \sin \theta \cos \theta =2\,. That is, the Jeffreys prior for

\theta

is uniform in the interval

, \pi / 2 /math>. Equivalently, \theta is uniform on the whole circle, 2 \pi /math>.

''N''-sided die with biased probabilities

Similarly, for a throw of an

N

-sided die with outcome probabilities

\vec = (\gamma_1, \ldots, \gamma_N)

, each non-negative and satisfying

\sum_^N \gamma_i = 1

, the Jeffreys prior for

\vec

is the

Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector of pos ...

with all (alpha) parameters set to one half. This amounts to using a pseudocount of one half for each possible outcome. Equivalently, if we write

\gamma_i = \varphi_i^2

for each

i

, then the Jeffreys prior for

\vec

is uniform on the

(N - 1)

-dimensional

unit sphere In mathematics, a unit sphere is a sphere of unit radius: the locus (mathematics), set of points at Euclidean distance 1 from some center (geometry), center point in three-dimensional space. More generally, the ''unit -sphere'' is an n-sphere, -s ...

(''i.e.'', it is uniform on the surface of an

N

-dimensional

unit ball Unit may refer to: General measurement * Unit of measurement, a definite magnitude of a physical quantity, defined and adopted by convention or by law **International System of Units (SI), modern form of the metric system **English units, histo ...

Generalizations

Probability-matching prior

In 1963, Welch and Peers showed that for a scalar parameter θ the Jeffreys prior is "probability-matching" in the sense that posterior predictive probabilities agree with frequentist probabilities and

credible interval In Bayesian statistics, a credible interval is an interval used to characterize a probability distribution. It is defined such that an unobserved parameter value has a particular probability \gamma to fall within it. For example, in an experime ...

s of a chosen width coincide with frequentist confidence intervals. In a follow-up, Peers showed that this was not true for the multi-parameter case, instead leading to the notion of probability-matching priors which are only implicitly defined as the probability distribution solving a certain

partial differential equation In mathematics, a partial differential equation (PDE) is an equation which involves a multivariable function and one or more of its partial derivatives. The function is often thought of as an "unknown" that solves the equation, similar to ho ...

involving the

α-parallel prior

Using tools from

information geometry Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to proba ...

, the Jeffreys prior can be generalized in pursuit of obtaining priors that encode geometric information of the statistical model, so as to be invariant under a change of the coordinate of parameters. A special case, the so-called Weyl prior, is defined as a

volume form In mathematics, a volume form or top-dimensional form is a differential form of degree equal to the differentiable manifold dimension. Thus on a manifold M of dimension n, a volume form is an n-form. It is an element of the space of sections of t ...

on a Weyl manifold.

Reparameterization

One-parameter case

Multiple-parameter case

Attributes

Minimum description length

Examples

Gaussian distribution with mean parameter

Gaussian distribution with standard deviation parameter

Poisson distribution with rate parameter

Bernoulli trial

''N''-sided die with biased probabilities

Generalizations

Probability-matching prior

α-parallel prior

References

Further reading