In statistics and machine learning, Gaussian process approximation is a

computational method Computation is any type of arithmetic or non-arithmetic calculation that follows a well-defined model (e.g., an algorithm). Mechanical or electronic devices (or, historically, people) that perform computations are known as '' computers''. An esp ...

that accelerates inference tasks in the context of a

Gaussian process In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. ...

model, most commonly

likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

evaluation and prediction. Like approximations of other models, they can often be expressed as additional assumptions imposed on the model, which do not correspond to any actual feature, but which retain its key properties while simplifying calculations. Many of these approximation methods can be expressed in purely linear algebraic or functional analytic terms as matrix or function approximations. Others are purely algorithmic and cannot easily be rephrased as a modification of a statistical model.

Basic ideas

statistical modeling A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...

, it is often convenient to assume that

y \in \mathcal

, the phenomenon under investigation is a

indexed by

X \in \mathcal = \mathcal_1 \times \mathcal_2 \dots \mathcal_d

which has mean function

\mu: \mathcal \rightarrow \mathcal

and covariance function

K: \mathcal \times \mathcal \rightarrow \mathbb

. One can also assume that data

\mathbf = (y_1, \dots, y_n)

are values of a particular realization of this process for indices

\mathbf = X_1, \dots, X_n

. Consequently, the joint distribution of the data can be expressed as :

\mathbf \sim \mathcal(\mathbf, \mathbf)

, where

\mathbf = \left K(X_i, X_j) \right^n

and

\mathbf = \left(\mu(X_1), \mu(X_2), \dots, \mu(X_d)\right)^

, i.e. respectively a matrix with the covariance function values and a vector with the mean function values at corresponding (pairs of) indices. The negative log-likelihood of the data then takes the form :

-\log\ell(\mathbf) = \frac + \frac\log\det(\mathbf) + \left(\mathbf-\mathbf\right)^\top\mathbf^\left(\mathbf-\mathbf\right)

Similarly, the best predictor of

\mathbf^*

, the values of

y

for indices

\mathbf^* = \left(X_1^*, X_2^*, \dots, X_d^*\right)

, given data

\mathbf

has the form :

= \mathbf^* - \mathbf_ \mathbf^\left(\mathbf - \mathbf\right)

In the context of Gaussian models, especially in

geostatistics Geostatistics is a branch of statistics focusing on spatial or spatiotemporal datasets. Developed originally to predict probability distributions of ore grades for mining operations, it is currently applied in diverse disciplines including pet ...

, prediction using the best predictor, i.e. mean conditional on the data, is also known as

kriging In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging g ...

. The most computationally expensive component of the best predictor formula is inverting the

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements o ...

\mathbf

, which has cubic

complexity Complexity characterises the behaviour of a system or model whose components interact in multiple ways and follow local rules, leading to nonlinearity, randomness, collective dynamics, hierarchy, and emergence. The term is generally used to c ...

\mathcal(n^3)

. Similarly, evaluating likelihood involves both calculating

\mathbf^

and the

determinant In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if ...

\det(\mathbf)

which has the same cubic complexity. Gaussian process approximations can often be expressed in terms of assumptions on

y

under which

\log\ell(\mathbf)

and

\mathbf^*_\mathbf

can be calculated with much lower complexity. Since these assumptions are generally not believed to reflect reality, the likelihood and the best predictor obtained in this way are not exact, but they are meant to be close to their original values.

Model-based methods

This class of approximations is expressed through a set of assumptions which are imposed on the original process and which, typically, imply some special structure of the covariance matrix. Although most of these methods were developed independently, most of them can be expressed as special cases of the sparse general

Vecchia approximation Vecchia approximation is a Gaussian processes approximation technique originally developed by Aldo Vecchia, a statistician at United States Geological Survey. It is one of the earliest attempts to use Gaussian processes in high-dimensional settin ...

Sparse covariance methods

These methods approximate the true model in a way the covariance matrix is sparse. Typically, each method proposes its own algorithm that takes the full advantage of the sparsity pattern in the covariance matrix. Two prominent members of this class of approaches are covariance tapering and domain partitioning. The first method generally requires a metric

d

over

\mathcal

and assumes that for

X, \tilde \in \mathcal

we have

Cov(y(X), y(\tilde))\neq 0

only if

d(X, \tilde) for some radius r . The second method assumes that there exist \mathcal^, \dots, \mathcal^such that \bigcup_^K\mathcal^. Then with appropriate distribution of indices among partition elements and ordering of elements of X the covariance matrix is block diagonal.

Sparse precision methods

This family of methods assumes that the precision matrix

\mathbf = \mathbf^

is sparse and generally specifies which of its elements are non-zero. This leads to fast inversion because only those elements need to be calculated. Some of the prominent approximations in this category include the approach based on the equivalence between Gaussian processes with Matern covariance function and stochastic PDEs, periodic embedding, and Nearest Neighbour Gaussian processes. The first method applies to the case of

d=2

and when

\mathcal

has a defined metric and takes advantage of the fact, that the Markov property holds which makes

\mathbf

very sparse. The second extends the domain and uses Discrete Fourier Transform to decorrelate the data, which results in a diagonal precision matrix. The third one requires a metric on

\mathcal

and takes advantage of the so-called screening effect assuming that

\mathbf_ \neq 0

only if

d(x_i, x_j) < r

, for some

r>0

Sparse Cholesky factor methods

In many practical applications, calculating

\mathbf

is replaced with computing first

\mathbf

, the Cholesky factor of

\mathbf

, and second its inverse

\mathbf^

. This is known to be more stable than a plain inversion. For this reason, some authors focus on constructing a sparse approximation of the Cholesky factor of the precision or covariance matrices. One of the most established methods in this class is the

and its generalization. These approaches determine the optimal ordering of indices and, consequently, the elements of

\mathbf

and then assume a dependency structure which minimizes in-fill in the Cholesky factor. Several other methods can be expressed in this framework, the Multi-resolution Approximation (MRA), Nearest Neighbour Gaussian Process, Modified Predictive Process and Full-scale approximation.

Low-rank methods

While this approach encompasses many methods, the common assumption underlying them all is the assumption, that

y

, the Gaussian process of interest, is effectively low-rank. More precisely, it is assumed, that there exists a set of indices

\bar = \

such that every other set of indices

X = \

y(X) \sim \mathcal\left(\mathbf_X\bar, \mathbf_X^\bar\mathbf_X + \mathbf\right)

where

\mathbf_X

is an

p \times k

matrix,

\bar = \mu\left(y\left(\bar\right)\right)

and

\bar = K\left(\bar, \bar\right)

and

\mathbf

is a diagonal matrix. Depending on the method and application various ways of selecting

\bar

have been proposed. Typically,

p

is selected to be much smaller than

n

which means that the computational cost of inverting

\bar

is manageable (

\mathcal(p^3)

instead of

\mathcal(n^3)

). More generally, on top of selecting

\bar

, one may also find an

n \times p

matrix

\mathbf

and assume that

X = \mathbf\mathbf

, where

\mathbf

are

p

values of a Gaussian process possibly independent of

x

. Many machine learning methods fall into this category, such as subset-of-regressors (SoR), relevance vector machine, sparse spectrum Gaussian Process and others and they generally differ in the way they derive

\mathbf

and

\mathbf

Hierarchical methods

The general principle of hierarchical approximations consists of a repeated application of some other method, such that each consecutive application refines the quality of the approximation. Even though they can be expressed as a set of statistical assumptions, they are often described in terms of a hierarchical matrix approximation (HODLR) or basis function expansion (LatticeKrig, MRA, wavelets). The hierarchical matrix approach can often be represented as a repeated application of a low-rank approximation to successively smaller subsets of the index set

X

. Basis function expansion relies on using functions with compact support. These features can then be exploited by an algorithm who steps through consecutive layers of the approximation. In the most favourable settings some of these methods can achieve quasi-linear (

\mathcal(n\log n)

) complexity.

Unified framework

Probabilistic graphical models provide a convenient framework for comparing model-based approximations. In this context, value of the process at index

x_k \in X

can then be represented by a vertex in a directed graph and edges correspond to the terms in the factorization of the joint density of

y(X)

. In general, when no independent relations are assumed, the joint probability distribution can be represented by an arbitrary directed acyclic graph. Using a particular approximation can then be expressed as a certain way of ordering the vertices and adding or removing specific edges.

Methods without a statistical model

This class of methods does not specify a statistical model or impose assumptions on an existing one. Three major members of this group are the meta-kriging algorithm, the gapfill algorithm and Local Approximate Gaussian Process approach. The first one partitions the set of indices into

K

components

\mathcal^, \dots, \mathcal^

, calculates the conditional distribution for each those components separately and then uses geometric median of the conditional PDFs to combine them. The second is based on quantile regression using values of the process which are close to the value one is trying to predict, where distance is measured in terms of a metric on the set of indices. Local Approximate Gaussian Process uses a similar logic but constructs a valid stochastic process based on these neighboring values.

References

* * * {{cite journal, last1=Banerjee, first=Sudipto, title=High-Dimensional Bayesian Geostatistics, journal=Bayesian Analysis, volume=12, issue=2, year=2017, pages=583-614, doi=10.1214/17-BA1056R, pmid=29391920, doi-access=free Geostatistics Computational science Computational statistics