In mathematics, Laplace's approximation fits an un-normalised

Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...

approximation to a (twice differentiable) un-normalised target density. In Bayesian statistical inference this is useful to simultaneously approximate the posterior and the marginal likelihood, see also

Approximate inference Approximate inference methods make it possible to learn realistic models from big data by trading off computation time for accuracy, when exact learning and inference are computationally intractable. Major methods classes *Laplace's approximation ...

. The method works by matching the log density and curvature at a mode of the target density. For example, a (possibly non-linear) regression or classification model with data set

\_

comprising inputs

x

and outputs

y

has (unknown) parameter vector

\theta

of length

D

. The

likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

is denoted

p(, ,\theta)

and the parameter prior

p(\theta)

. The joint density of outputs and parameters

p(,\theta, )

is the object of inferential desire :

p(,\theta, )\;=\;p(, ,\theta)p(\theta)\;=\;p(, )p(\theta, ,)\;\simeq\;\tilde q(\theta)\;=\;Zq(\theta).

The joint is equal to the product of the likelihood and the prior and by

Bayes' rule In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For exampl ...

, equal to the product of the

marginal likelihood A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model ev ...

p(, )

and posterior

p(\theta, ,)

. Seen as a function of

\theta

the joint is an un-normalised density. In Laplace's approximation we approximate the joint by an un-normalised Gaussian

\tilde q(\theta)=Zq(\theta)

, where we use

q

to denote approximate density,

\tilde q

for un-normalised density and

Z

is a constant (independent of

\theta

). Since the marginal likelihood

p(, )

doesn't depend on the parameter

\theta

and the posterior

p(\theta, ,)

normalises over

\theta

we can immediately identify them with

Z

and

q(\theta)

of our approximation, respectively. Laplace's approximation is :

p(,\theta, )\;\simeq\;p(,\hat\theta, )\exp\big(-\tfrac(\theta-\hat\theta)S^(\theta-\hat\theta)\big)\;=\;\tilde q(\theta),

where we have defined :

\begin
\hat\theta &\;=\; \operatorname_\theta \log p(,\theta, ),\\
S^ &\;=\; -\left.\nabla_\theta\nabla_\theta\log p(,\theta, )\_, 
\end

where

\hat\theta

is the location of a mode of the joint target density, also known as the

maximum a posteriori In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the ...

or MAP point and

S^

is the

D\times D

positive definite matrix of second derivatives of the negative log joint target density at the mode

\theta=\hat\theta

. Thus, the Gaussian approximation matches the value and the curvature of the un-normalised target density at the mode. The value of

\hat\theta

is usually found using a gradient based method, e.g.

Newton's method In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real ...

. In summary, we have :

\begin
q(\theta) &\;=\; (\theta, \mu=\hat\theta,\Sigma=S),\\
\log Z &\;=\; \log p(,\hat\theta, ) + \tfrac\log, S,  + \tfrac\log(2\pi),
\end

for the approximate posterior over

\theta

and the approximate log marginal likelihood respectively. In the special case of

Bayesian linear regression Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients (as wel ...

with a Gaussian prior, the approximation is exact. The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay and for

Gaussian process In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. ...

es by Williams and Barber, see references.

References

Sources

* * {{improve categories, date=October 2022 Approximations Statistical algorithms