In statistics and machine learning, Gaussian process approximation is a
computational method
Computation is any type of arithmetic or non-arithmetic calculation that follows a well-defined model (e.g., an algorithm).
Mechanical or electronic devices (or, historically, people) that perform computations are known as '' computers''. An esp ...
that accelerates inference tasks in the context of a
Gaussian process
In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. ...
model, most commonly
likelihood
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
evaluation and prediction. Like approximations of other models, they can often be expressed as additional assumptions imposed on the model, which do not correspond to any actual feature, but which retain its key properties while simplifying calculations. Many of these approximation methods can be expressed in purely
linear algebraic or
functional analytic terms as matrix or function approximations. Others are purely algorithmic and cannot easily be rephrased as a modification of a statistical model.
Basic ideas
In
statistical modeling
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...
, it is often convenient to assume that
, the phenomenon under investigation is a
Gaussian process
In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. ...
indexed by
which has mean function
and covariance function
.
One can also assume that data
are values of a particular realization of this process for indices
.
Consequently, the joint distribution of the data can be expressed as
:
,
where
and
, i.e. respectively a matrix with the covariance function values and a vector with the mean function values at corresponding (pairs of) indices.
The negative log-likelihood of the data then takes the form
:
Similarly, the best predictor of
, the values of
for indices
, given data
has the form
:
In the context of Gaussian models, especially in
geostatistics
Geostatistics is a branch of statistics focusing on spatial or spatiotemporal datasets. Developed originally to predict probability distributions of ore grades for mining operations, it is currently applied in diverse disciplines including pet ...
, prediction using the best predictor, i.e. mean conditional on the data, is also known as
kriging
In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging g ...
.
The most computationally expensive component of the best predictor formula is
inverting the
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements o ...
, which has cubic
complexity
Complexity characterises the behaviour of a system or model whose components interact in multiple ways and follow local rules, leading to nonlinearity, randomness, collective dynamics, hierarchy, and emergence.
The term is generally used to c ...
. Similarly, evaluating likelihood involves both calculating
and the
determinant
In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if ...
which has the same cubic complexity.
Gaussian process approximations can often be expressed in terms of assumptions on
under which
and
can be calculated with much lower complexity. Since these assumptions are generally not believed to reflect reality, the likelihood and the best predictor obtained in this way are not exact, but they are meant to be close to their original values.
Model-based methods
This class of approximations is expressed through a set of assumptions which are imposed on the original process and which, typically, imply some special structure of the covariance matrix. Although most of these methods were developed independently, most of them can be expressed as special cases of the sparse general
Vecchia approximation
Vecchia approximation is a Gaussian processes approximation technique originally developed by Aldo Vecchia, a statistician at United States Geological Survey. It is one of the earliest attempts to use Gaussian processes in high-dimensional settin ...
.
Sparse covariance methods
These methods approximate the true model in a way the covariance matrix is sparse. Typically, each method proposes its own algorithm that takes the full advantage of the sparsity pattern in the covariance matrix. Two prominent members of this class of approaches are covariance tapering and domain partitioning. The first method generally requires a metric
over
and assumes that for
we have
only if