Within

bayesian statistics Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about the event, ...

for

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

s (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general

reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g i ...

s. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

problems where the ''input space'' is usually a ''space of vectors'' while the ''output space'' is a ''space of scalars''. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning. A mathematical equivalence between the regularization and the Bayesian point of view is easily proved in cases where the reproducing kernel Hilbert space is ''finite-dimensional''. The infinite-dimensional case raises subtle mathematical issues; we will consider here the finite-dimensional case. We start with a brief review of the main ideas underlying kernel methods for scalar learning, and briefly introduce the concepts of regularization and Gaussian processes. We then show how both points of view arrive at essentially equivalent estimators, and show the connection that ties them together.

The supervised learning problem

The classical

problem requires estimating the output for some new input point

\mathbf'

by learning a scalar-valued estimator

\hat(\mathbf')

on the basis of a training set

S

consisting of

n

input-output pairs,

S = (\mathbf,\mathbf) = (\mathbf_1,y_1),\ldots,(\mathbf_n,y_n)

. Given a symmetric and positive bivariate function

k(\cdot,\cdot)

called a ''kernel'', one of the most popular estimators in machine learning is given by where

\mathbf \equiv k(\mathbf,\mathbf)

is the kernel matrix with entries

\mathbf_ = k(\mathbf_i,\mathbf_j)

\top

, and

\top

. We will see how this estimator can be derived both from a regularization and a Bayesian perspective.

A regularization perspective

The main assumption in the regularization perspective is that the set of functions

\mathcal

is assumed to belong to a reproducing kernel Hilbert space

\mathcal_k

Reproducing kernel Hilbert space

(RKHS)

\mathcal_k

is a

Hilbert space In mathematics, Hilbert spaces (named after David Hilbert) allow generalizing the methods of linear algebra and calculus from (finite-dimensional) Euclidean vector spaces to spaces that may be infinite-dimensional. Hilbert spaces arise natu ...

of functions defined by a

symmetric Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definit ...

positive-definite function In mathematics, a positive-definite function is, depending on the context, either of two types of function. Most common usage A ''positive-definite function'' of a real variable ''x'' is a complex-valued function f: \mathbb \to \mathbb suc ...

k : \mathcal \times \mathcal \rightarrow \mathbb

called the ''reproducing kernel'' such that the function

k(\mathbf,\cdot)

belongs to

\mathcal_k

for all

\mathbf \in \mathcal

. There are three main properties make an RKHS appealing: 1. The ''reproducing property'', which gives name to the space, :

f(\mathbf) = \langle f,k(\mathbf,\cdot) \rangle_k, \quad \forall \ f \in \mathcal_k,

where

\langle \cdot,\cdot \rangle_k

is the inner product in

\mathcal_k

. 2. Functions in an RKHS are in the closure of the linear combination of the kernel at given points, :

f(\mathbf) = \sum_i k(\mathbf_i,\mathbf)c_i

. This allows the construction in a unified framework of both linear and generalized linear models. 3. The squared norm in an RKHS can be written as :

\, f\, _k^2 = \sum_ k(\mathbf_i,\mathbf_j) c_i c_j

and could be viewed as measuring the ''complexity'' of the function.

The regularized functional

The estimator is derived as the minimizer of the regularized functional where

f \in \mathcal_k

and

\, \cdot\, _k

is the norm in

\mathcal_k

. The first term in this functional, which measures the average of the squares of the errors between the

f(\mathbf_i)

and the

y_i

, is called the ''empirical risk'' and represents the cost we pay by predicting

f(\mathbf_i)

for the true value

y_i

. The second term in the functional is the squared norm in a RKHS multiplied by a weight

\lambda

and serves the purpose of stabilizing the problem as well as of adding a trade-off between fitting and complexity of the estimator. The weight

\lambda

, called the ''regularizer'', determines the degree to which instability and complexity of the estimator should be penalized (higher penalty for increasing value of

\lambda

Derivation of the estimator

The explicit form of the estimator in equation () is derived in two steps. First, the representer theorem states that the minimizer of the functional () can always be written as a linear combination of the kernels centered at the training-set points, for some

\mathbf \in \mathbb^n

. The explicit form of the coefficients

\top

can be found by substituting for

f(\cdot)

in the functional (). For a function of the form in equation (), we have that :

\begin
\, f\, _k^2 & = \langle f,f \rangle_k, \\
& = \left\langle \sum_^N c_i k(\mathbf_i,\cdot), \sum_^N c_j k(\mathbf_j,\cdot) \right\rangle_k, \\
& = \sum_^N \sum_^N c_i c_j \langle k(\mathbf_i,\cdot), k(\mathbf_j,\cdot) \rangle_k, \\
& = \sum_^N \sum_^N c_i c_j k(\mathbf_i,\mathbf_j), \\
& = \mathbf^\top \mathbf \mathbf.
\end

We can rewrite the functional () as :

\frac \,  \mathbf - \mathbf \mathbf \, ^2 + \lambda \mathbf^\top \mathbf \mathbf.

This functional is convex in

\mathbf

and therefore we can find its minimum by setting the gradient with respect to

\mathbf

to zero, :

\begin
-\frac \mathbf (\mathbf - \mathbf \mathbf) + \lambda \mathbf \mathbf & = 0, \\
(\mathbf + \lambda n \mathbf) \mathbf & = \mathbf, \\
\mathbf & = (\mathbf + \lambda n \mathbf)^ \mathbf.
\end

Substituting this expression for the coefficients in equation (), we obtain the estimator stated previously in equation (), :

\hat(\mathbf') = \mathbf^\top(\mathbf + \lambda n \mathbf)^ \mathbf.

A Bayesian perspective

The notion of a kernel plays a crucial role in Bayesian probability as the covariance function of a stochastic process called the ''

Gaussian process In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. ...

''.

A review of Bayesian probability

As part of the Bayesian framework, the Gaussian process specifies the ''prior distribution'' that describes the prior beliefs about the properties of the function being modeled. These beliefs are updated after taking into account observational data by means of a ''

likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

'' that relates the prior beliefs to the observations. Taken together, the prior and likelihood lead to an updated distribution called the ''posterior distribution'' that is customarily used for predicting test cases.

The Gaussian process

(GP) is a stochastic process in which any finite number of random variables that are sampled follow a joint

Normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu i ...

. The mean vector and covariance matrix of the Gaussian distribution completely specify the GP. GPs are usually used as a priori distribution for functions, and as such the mean vector and covariance matrix can be viewed as functions, where the covariance function is also called the ''kernel'' of the GP. Let a function

f

follow a Gaussian process with mean function

m

and kernel function

k

, :

f \sim \mathcal(m,k).

In terms of the underlying Gaussian distribution, we have that for any finite set

\mathbf = \_^

if we let

\top

then :

f(\mathbf) \sim \mathcal(\mathbf,\mathbf),

where

\top

is the mean vector and

\mathbf = k(\mathbf,\mathbf)

is the covariance matrix of the multivariate Gaussian distribution.

Derivation of the estimator

In a regression context, the likelihood function is usually assumed to be a Gaussian distribution and the observations to be independent and identically distributed (iid), :

p(y, f,\mathbf,\sigma^2) = \mathcal(f(\mathbf),\sigma^2).

This assumption corresponds to the observations being corrupted with zero-mean Gaussian noise with variance

\sigma^2

. The iid assumption makes it possible to factorize the likelihood function over the data points given the set of inputs

\mathbf

and the variance of the noise

\sigma^2

, and thus the posterior distribution can be computed analytically. For a test input vector

\mathbf'

, given the training data

S = \

, the posterior distribution is given by :

p(f(\mathbf'), S,\mathbf',\boldsymbol) = \mathcal(m(\mathbf'),\sigma^2(\mathbf')),

where

\boldsymbol

denotes the set of parameters which include the variance of the noise

\sigma^2

and any parameters from the covariance function

k

and where :

\begin
m(\mathbf') & = \mathbf^\top (\mathbf + \sigma^2 \mathbf)^ \mathbf, \\
\sigma^2(\mathbf') & = k(\mathbf',\mathbf') - \mathbf^\top (\mathbf + \sigma^2 \mathbf)^ \mathbf.
\end

The connection between regularization and Bayes

A connection between regularization theory and Bayesian theory can only be achieved in the case of ''finite dimensional RKHS''. Under this assumption, regularization theory and Bayesian theory are connected through Gaussian process prediction. In the finite dimensional case, every RKHS can be described in terms of a feature map

\Phi : \mathcal \rightarrow \mathbb^p

such that :

k(\mathbf,\mathbf') = \sum_^p \Phi^i(\mathbf)\Phi^i(\mathbf').

Functions in the RKHS with kernel

\mathbf

can be then be written as :

f_(\mathbf) = \sum_^p \mathbf^i \Phi^i(\mathbf) = \langle \mathbf,\Phi(\mathbf) \rangle,

and we also have that :

\, f_ \, _k = \, \mathbf\, .

We can now build a Gaussian process by assuming

\top

to be distributed according to a multivariate Gaussian distribution with zero mean and identity covariance matrix, :

\mathbf \sim \mathcal(0,\mathbf) \propto \exp(-\, \mathbf\, ^2).

If we assume a Gaussian likelihood we have :

P(\mathbf, \mathbf,f) = \mathcal(f(\mathbf),\sigma^2 \mathbf) \propto \exp\left(-\frac \,  f_(\mathbf) - \mathbf \, ^2\right),

where

f_(\mathbf) = (\langle\mathbf,\Phi(\mathbf_1)\rangle,\ldots,\langle\mathbf,\Phi(\mathbf_n \rangle)

. The resulting posterior distribution is the given by :

P(f, \mathbf,\mathbf) \propto \exp\left(-\frac \, f_(\mathbf) - \mathbf\, _n^2 + \, \mathbf\, ^2\right)

We can see that a ''maximum posterior (MAP)'' estimate is equivalent to the minimization problem defining

Tikhonov regularization Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...

, where in the Bayesian case the regularization parameter is related to the noise variance. From a philosophical perspective, the loss function in a regularization setting plays a different role than the likelihood function in the Bayesian setting. Whereas the loss function measures the error that is incurred when predicting

f(\mathbf)

in place of

y

, the likelihood function measures how likely the observations are from the model that was assumed to be true in the generative process. From a mathematical perspective, however, the formulations of the regularization and Bayesian frameworks make the loss function and the likelihood function to have the same mathematical role of promoting the inference of functions

f

that approximate the labels

y

as much as possible.

References

{{Reflist Bayesian statistics Machine learning

The supervised learning problem

A regularization perspective

Reproducing kernel Hilbert space

The regularized functional

Derivation of the estimator

A Bayesian perspective

A review of Bayesian probability

The Gaussian process

Derivation of the estimator

The connection between regularization and Bayes

See also

References