Probabilistic numerics
   HOME

TheInfoList



OR:

Probabilistic numerics is a scientific field at the intersection of statistics,
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
and
applied mathematics Applied mathematics is the application of mathematical methods by different fields such as physics, engineering, medicine, biology, finance, business, computer science, and industry. Thus, applied mathematics is a combination of mathemati ...
, where tasks in
numerical analysis Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of numerical methods ...
including finding numerical solutions for integration,
linear algebra Linear algebra is the branch of mathematics concerning linear equations such as: :a_1x_1+\cdots +a_nx_n=b, linear maps such as: :(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n, and their representations in vector spaces and through matrices ...
, optimisation and differential equations are seen as problems of statistical, probabilistic, or Bayesian inference.


Introduction

A numerical method is an algorithm that ''approximates'' the solution to a mathematical problem (examples below include the solution to a
linear system of equations In mathematics, a system of linear equations (or linear system) is a collection of one or more linear equations involving the same variables. For example, :\begin 3x+2y-z=1\\ 2x-2y+4z=-2\\ -x+\fracy-z=0 \end is a system of three equations in ...
, the value of an
integral In mathematics, an integral assigns numbers to functions in a way that describes displacement, area, volume, and other concepts that arise by combining infinitesimal data. The process of finding integrals is called integration. Along wit ...
, the solution of a
differential equation In mathematics, a differential equation is an equation that relates one or more unknown functions and their derivatives. In applications, the functions generally represent physical quantities, the derivatives represent their rates of change, an ...
, the
minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
of a multivariate function). In a ''probabilistic'' numerical algorithm, this process of approximation is thought of as a problem of ''estimation'', ''inference'' or ''learning'' and realised in the framework of probabilistic inference (often, but not always, Bayesian inference). Formally, this means casting the setup of the computational problem in terms of a
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
, formulating the relationship between numbers computed by the computer (e.g. matrix-vector multiplications in linear algebra, gradients in optimization, values of the integrand or the vector field defining a differential equation) and the quantity in question (the solution of the linear problem, the minimum, the integral, the solution curve) in a
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
, and returning a
posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
as the output. In most cases, numerical algorithms also take internal adaptive decisions about which numbers to compute, which form an
active learning Active learning is "a method of learning in which students are actively or experientially involved in the learning process and where there are different levels of active learning, depending on student involvement." states that "students partici ...
problem. Many of the most popular classic numerical algorithms can be re-interpreted in the probabilistic framework. This includes the method of conjugate gradients, Nordsieck methods,
Gaussian quadrature In numerical analysis, a quadrature rule is an approximation of the definite integral of a function, usually stated as a weighted sum of function values at specified points within the domain of integration. (See numerical integration for mor ...
rules, and
quasi-Newton method Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. ...
s. In all these cases, the classic method is based on a regularized least-squares estimate that can be associated with the posterior mean arising from a
Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...
prior and likelihood. In such cases, the variance of the Gaussian posterior is then associated with a worst-case estimate for the squared error. Probabilistic numerical methods promise several conceptual advantages over classic, point-estimate based approximation techniques: * They return ''structured'' error estimates (in particular, the ability to return joint posterior samples, i.e. multiple realistic hypotheses for the true unknown solution of the problem) * Hierarchical Bayesian inference can be used to set and control internal hyperparameters in such methods in a generic fashion, rather than having to re-invent novel methods for each parameter * Since they use and allow for an explicit likelihood describing the relationship between computed numbers and target quantity, probabilistic numerical methods can use the results of even highly imprecise, biased and stochastic computations. Conversely, probabilistic numerical methods can also provide a likelihood in computations often considered " likelihood-free" elsewhere * Because all probabilistic numerical methods use essentially the same data type – probability measures – to quantify uncertainty over ''both inputs and outputs'' they can be chained together to propagate uncertainty across large-scale, composite computations * Sources from multiple sources of information (e.g. algebraic, mechanistic knowledge about the form of a differential equation, and observations of the trajectory of the system collected in the physical world) can be combined naturally and ''inside the inner loop'' of the algorithm, removing otherwise necessary nested loops in computation, e.g. in
inverse problem An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, source reconstruction in acoustics, or calculating the ...
s. These advantages are essentially the equivalent of similar functional advantages that Bayesian methods enjoy over point-estimates in machine learning, applied or transferred to the computational domain.


Numerical tasks


Integration

Probabilistic numerical methods have been developed for the problem of
numerical integration In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral, and by extension, the term is also sometimes used to describe the numerical solution of differential equations ...
, with the most popular method called ''
Bayesian quadrature Bayesian quadrature is a method for approximating intractable integration problems. It falls within the class of probabilistic numerical methods. Bayesian quadrature views numerical integration as a Bayesian inference task, where function eval ...
''. In numerical integration, function evaluations f(x_1), \ldots, f(x_n) at a number of points x_1, \ldots, x_n are used to estimate the integral \textstyle \int f(x) \nu(dx) of a function f against some measure \nu . Bayesian quadrature consists of specifying a prior distribution over f and conditioning this prior on f(x_1), \ldots, f(x_n) to obtain a posterior distribution over f, then computing the implied posterior distribution on \textstyle \int f(x) \nu(dx) . The most common choice of prior is a Gaussian process as this allows us to obtain a closed-form posterior distribution on the integral which is a univariate Gaussian distribution. Bayesian quadrature is particularly useful when the function f is expensive to evaluate and the dimension of the data is small to moderate.


Optimization

Probabilistic approaches have also been studied for mathematical optimization, which consist of finding the minimum or maximum of some objective function f given (possibly noisy or indirect) evaluations of that function at a set of points. Perhaps the most notable effort in this direction is Bayesian optimization, a general approach to optimization grounded in Bayesian inference. Bayesian optimization algorithms operate by maintaining a probabilistic belief about f throughout the optimization procedure; this often takes the form of a Gaussian process prior conditioned on observations. This belief then guides the algorithm in obtaining observations that are likely to advance the optimization process. Bayesian optimization policies are usually realized by transforming the objective function posterior into an inexpensive, differentiable ''acquisition function'' that is maximized to select each successive observation location. One prominent approach is to model optimization via Bayesian sequential experimental design, seeking to obtain a sequence of observations yielding the most optimization progress as evaluated by an appropriate
utility function As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosoph ...
. A welcome side effect from this approach is that uncertainty in the objective function, as measured by the underlying probabilistic belief, can guide an optimization policy in addressing the classic exploration vs. exploitation tradeoff.


Local optimization

Probabilistic numerical methods have been developed in the context of stochastic optimization for deep learning, in particular to address main issues such as
learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...
tuning and line searches, batch-size selection,
early stopping In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data wit ...
, pruning, and first- and second-order search directions. In this setting, the optimization objective is often an
empirical risk Empirical evidence for a proposition is evidence, i.e. what supports or counters this proposition, that is constituted by or accessible to sense experience or experimental procedure. Empirical evidence is of central importance to the sciences and ...
of the form \textstyle L(\theta) = \frac\sum_^N \ell(y_n, f_(x_n)) defined by a dataset \textstyle \mathcal=\_^N, and a loss \ell(y, f_(x)) that quantifies how well a predictive model f_(x) parameterized by \theta performs on predicting the target y from its corresponding input x . Epistemic uncertainty arises when the dataset size N is large and cannot be processed at once meaning that local quantities (given some \theta ) such as the loss function L(\theta) itself or its gradient \nabla L(\theta) cannot be computed in reasonable time. Hence, generally mini-batching is used to construct estimators of these quantities on a random subset of the data. Probabilistic numerical methods model this uncertainty explicitly and allow for automated decisions and parameter tuning.


Linear algebra

Probabilistic numerical methods for linear algebra have primarily focused on solving
systems of linear equations In mathematics, a system of linear equations (or linear system) is a collection of one or more linear equations involving the same variables. For example, :\begin 3x+2y-z=1\\ 2x-2y+4z=-2\\ -x+\fracy-z=0 \end is a system of three equations in th ...
of the form Ax=b and the computation of
determinants In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if and ...
, A, . A large class of methods are iterative in nature and collect information about the linear system to be solved via repeated matrix-vector multiplication v \mapsto Av with the system matrix A with different vectors v. Such methods can be roughly split into a solution- and a matrix-based perspective, depending on whether belief is expressed over the solution x of the linear system or the (pseudo-)inverse of the matrix H=A^. The belief update uses that the inferred object is linked to matrix multiplications y= Av or z=A^\intercal v via b^\intercal z = x^\intercal v and v = A^ y . Methods typically assume a Gaussian distribution, due to its closedness under linear observations of the problem. While conceptually different, these two views are computationally equivalent and inherently connected via the right-hand-side through x = A^b. Probabilistic numerical linear algebra routines have been successfully applied to scale Gaussian processes to large datasets. In particular, they enable exact propagation of the approximation error to a combined Gaussian process posterior, which quantifies the uncertainty arising from both the finite number of data observed and the finite amount of computation expended.


Ordinary differential equations

Probabilistic numerical methods for
ordinary differential equation In mathematics, an ordinary differential equation (ODE) is a differential equation whose unknown(s) consists of one (or more) function(s) of one variable and involves the derivatives of those functions. The term ''ordinary'' is used in contrast ...
s \dot(t) = f(t, y(t)) , have been developed for initial and boundary value problems. Many different probabilistic numerical methods designed for ordinary differential equations have been proposed, and these can broadly be grouped into the two following categories: * Randomisation-based methods are defined through random perturbations of standard deterministic numerical methods for ordinary differential equations. For example, this has been achieved by adding Gaussian perturbations on the solution of one-step integrators or by perturbing randomly their time-step. This defines a probability measure on the solution of the differential equation that can be sampled. * Gaussian process regression methods are based on posing the problem of solving the differential equation at hand as a Gaussian process regression problem, interpreting evaluations of the right-hand side as data on the derivative. These techniques resemble to Bayesian cubature, but employ different and often non-linear observation models. In its infancy, this class of methods was based on naive Gaussian process regression. This was later improved (in terms of efficient computation) in favor of GaussMarkov priors modeled by the
stochastic differential equation A stochastic differential equation (SDE) is a differential equation in which one or more of the terms is a stochastic process, resulting in a solution which is also a stochastic process. SDEs are used to model various phenomena such as stock p ...
\mathrmx(t) = A x(t) \, \mathrm t + B \, \mathrmv(t) , where x(t) is a \nu -dimensional vector modeling the first \nu derivatives of y(t) , and where v(t) is a \nu -dimensional
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Inference can thus be implemented efficiently with Kalman filtering based methods. The boundary between these two categories is not sharp, indeed a Gaussian process regression approach based on randomised data was developed as well. These methods have been applied to problems in computational Riemannian geometry, inverse problems, latent force models, and to differential equations with a geometric structure such as symplecticity.


Partial differential equations

A number of probabilistic numerical methods have also been proposed for
partial differential equations In mathematics, a partial differential equation (PDE) is an equation which imposes relations between the various partial derivatives of a multivariable function. The function is often thought of as an "unknown" to be solved for, similarly to ...
. As with ordinary differential equations, the approaches can broadly be divided into those based on randomisation, generally of some underlying finite-element mesh and those based on Gaussian process regression. Probabilistic numerical PDE solvers based on Gaussian process regression recover classical methods on linear PDEs for certain priors, in particular methods of mean weighted residuals, which include Galerkin methods, finite element methods, as well as spectral methods.


History and related fields

The interplay between
numerical analysis Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of numerical methods ...
and
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...
is touched upon by a number of other areas of mathematics, including average-case analysis of numerical methods,
information-based complexity Information-based complexity (IBC) studies optimal algorithms and computational complexity for the continuous problems that arise in physical science, economics, engineering, and mathematical finance. IBC has studied such continuous problems as p ...
, game theory, and statistical
decision theory Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...
. Precursors to what is now being called "probabilistic numerics" can be found as early as the late 19th and early 20th century. The origins of probabilistic numerics can be traced to a discussion of probabilistic approaches to polynomial interpolation by Henri Poincaré in his ''Calcul des Probabilités''. In modern terminology, Poincaré considered a Gaussian prior distribution on a function f \colon \mathbb \to \mathbb, expressed as a formal
power series In mathematics, a power series (in one variable) is an infinite series of the form \sum_^\infty a_n \left(x - c\right)^n = a_0 + a_1 (x - c) + a_2 (x - c)^2 + \dots where ''an'' represents the coefficient of the ''n''th term and ''c'' is a con ...
with random coefficients, and asked for "probable values" of f(x) given this prior and n \in \mathbb observations f(a_) = B_ for i = 1, \dots, n. A later seminal contribution to the interplay of numerical analysis and probability was provided by Albert Suldin in the context of univariate quadrature. The statistical problem considered by Suldin was the approximation of the definite integral \textstyle \int u(t) \, \mathrm t of a function u \colon
, b The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
\to \mathbb, under a
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
prior on u, given access to pointwise evaluation of u at nodes t_, \dots, t_ \in
, b The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
/math>. Suldin showed that, for given quadrature nodes, the quadrature rule with minimal
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
is the
trapezoidal rule In calculus, the trapezoidal rule (also known as the trapezoid rule or trapezium rule; see Trapezoid for more information on terminology) is a technique for approximating the definite integral. \int_a^b f(x) \, dx. The trapezoidal rule works by ...
; furthermore, this minimal error is proportional to the sum of cubes of the inter-node spacings. As a result, one can see the trapezoidal rule with equally-spaced nodes as statistically optimal in some sense — an early example of the average-case analysis of a numerical method. Suldin's point of view was later extended by Mike Larkin. Note that Suldin's Brownian motion prior on the integrand u is a
Gaussian measure In mathematics, Gaussian measure is a Borel measure on finite-dimensional Euclidean space R''n'', closely related to the normal distribution in statistics. There is also a generalization to infinite-dimensional spaces. Gaussian measures are nam ...
and that the operations of integration and of point wise evaluation of u are both
linear map In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pr ...
s. Thus, the definite integral \textstyle \int u(t) \, \mathrm t is a real-valued Gaussian random variable. In particular, after conditioning on the observed pointwise values of u, it follows a
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
with mean equal to the trapezoidal rule and variance equal to \textstyle \frac \sum_^ (t_ - t_)^. This viewpoint is very close to that of
Bayesian quadrature Bayesian quadrature is a method for approximating intractable integration problems. It falls within the class of probabilistic numerical methods. Bayesian quadrature views numerical integration as a Bayesian inference task, where function eval ...
, seeing the output of a quadrature method not just as a point estimate but as a probability distribution in its own right. As noted by Owhadi and collaborators, interplays between numerical approximation and statistical inference can also be traced back to Palasti and Renyi, Sard, Kimeldorf and Wahba (on the correspondence between Bayesian estimation and spline smoothing/interpolation) and Larkin (on the correspondence between Gaussian process regression and numerical approximation). Although the approach of modelling a perfectly known function as a sample from a random process may seem counterintuitive, a natural framework for understanding it can be found in
information-based complexity Information-based complexity (IBC) studies optimal algorithms and computational complexity for the continuous problems that arise in physical science, economics, engineering, and mathematical finance. IBC has studied such continuous problems as p ...
(IBC), the branch of computational complexity founded on the observation that numerical implementation requires computation with partial information and limited resources. In IBC, the performance of an algorithm operating on incomplete information can be analyzed in the worst-case or the average-case (randomized) setting with respect to the missing information. Moreover, as Packel observed, the average case setting could be interpreted as a
mixed strategy In game theory, a player's strategy is any of the options which they choose in a setting where the outcome depends ''not only'' on their own actions ''but'' on the actions of others. The discipline mainly concerns the action of a player in a game ...
in an adversarial game obtained by lifting a (worst-case) minmax problem to a minmax problem over mixed (randomized) strategies. This observation leads to a natural connection between numerical approximation and Wald's
decision theory Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...
, evidently influenced by von Neumann's theory of games. To describe this connection consider the optimal recovery setting of Micchelli and Rivlin in which one tries to approximate an unknown function from a finite number of linear measurements on that function. Interpreting this optimal recovery problem as a zero-sum game where Player I selects the unknown function and Player II selects its approximation, and using relative errors in a quadratic norm to define losses, Gaussian priors emerge as optimal mixed strategies for such games, and the covariance operator of the optimal Gaussian prior is determined by the quadratic norm used to define the relative error of the recovery.


Software


ProbNum
Probabilistic Numerics in Python.
ProbNumDiffEq.jl
Probabilistic numerical ODE solvers based on filtering implemented in Julia.
Emukit
Adaptable Python toolbox for decision-making under uncertainty.
BackPACK
Built on top of PyTorch. It efficiently computes quantities other than the gradient.


See also

* Average-case analysis *
Information-based complexity Information-based complexity (IBC) studies optimal algorithms and computational complexity for the continuous problems that arise in physical science, economics, engineering, and mathematical finance. IBC has studied such continuous problems as p ...
*
Uncertainty quantification Uncertainty quantification (UQ) is the science of quantitative characterization and reduction of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system a ...


References

{{Reflist Applied statistics Machine learning Applied mathematics