
In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a
facial recognition system
A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, and wor ...
may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a
manifold
In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an n-dimensional manifold, or ''n-manifold'' for short, is a topological space with the property that each point has a ...
, a mathematical structure with useful properties. The technique also assumes that the function to be learned is ''smooth'': data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of
Tikhonov regularization
Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
. Manifold regularization algorithms can extend
supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
algorithms in
semi-supervised learning Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of ...
and
transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.
Manifold regularizer
Motivation
Manifold regularization is a type of
regularization
Regularization may refer to:
* Regularization (linguistics)
* Regularization (mathematics)
* Regularization (physics)
* Regularization (solid modeling)
* Regularization Law, an Israeli law intended to retroactively legalize settlements
See also ...
, a family of techniques that reduces
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
and ensures that a problem is
well-posed
The mathematical term well-posed problem stems from a definition given by 20th-century French mathematician Jacques Hadamard. He believed that mathematical models of physical phenomena should have the properties that:
# a solution exists,
# the s ...
by penalizing complex solutions. In particular, manifold regularization extends the technique of
Tikhonov regularization
Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
as applied to
Reproducing kernel Hilbert spaces
In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
(RKHSs). Under standard Tikhonov regularization on RKHSs, a learning algorithm attempts to learn a function
from among a hypothesis space of functions
. The hypothesis space is an RKHS, meaning that it is associated with a
kernel
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine lea ...
, and so every candidate function
has a
norm
Naturally occurring radioactive materials (NORM) and technologically enhanced naturally occurring radioactive materials (TENORM) consist of materials, usually industrial wastes or by-products enriched with radioactive elements found in the envir ...
, which represents the complexity of the candidate function in the hypothesis space. When the algorithm considers a candidate function, it takes its norm into account in order to penalize complex functions.
Formally, given a set of labeled training data
with
and a
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "co ...
, a learning algorithm using Tikhonov regularization will attempt to solve the expression
:
where
is a
hyperparameter
In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis.
For example, if one is using a beta distribution to mo ...
that controls how much the algorithm will prefer simpler functions over functions that fit the data better.

Manifold regularization adds a second regularization term, the ''intrinsic regularizer'', to the ''ambient regularizer'' used in standard Tikhonov regularization. Under the
manifold assumption in machine learning, the data in question do not come from the entire input space
, but instead from a nonlinear
manifold
In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an n-dimensional manifold, or ''n-manifold'' for short, is a topological space with the property that each point has a ...
. The geometry of this manifold, the intrinsic space, is used to determine the regularization norm.
Laplacian norm
There are many possible choices for the intrinsic regularizer
. Many natural choices involve the
gradient on the manifold , which can provide a measure of how smooth a target function is. A smooth function should change slowly where the input data are dense; that is, the gradient
should be small where the ''marginal probability density''
, the
probability density of a randomly drawn data point appearing at
, is large. This gives one appropriate choice for the intrinsic regularizer:
:
In practice, this norm cannot be computed directly because the marginal distribution
is unknown, but it can be estimated from the provided data.
Graph-based approach of the Laplacian norm
When the distances between input points are interpreted as a graph, then the
Laplacian matrix
In the mathematical field of graph theory, the Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix or discrete Laplacian, is a matrix representation of a graph. Named after Pierre-Simon Laplace, the graph Lapl ...
of the graph can help to estimate the marginal distribution. Suppose that the input data include
labeled examples (pairs of an input
and a label
) and
unlabeled examples (inputs without associated labels). Define
to be a matrix of edge weights for a graph, where
is a distance measure between the data points
and
. Define
to be a diagonal matrix with
and
to be the Laplacian matrix
. Then, as the number of data points
increases,
converges to the
Laplace–Beltrami operator
In differential geometry, the Laplace–Beltrami operator is a generalization of the Laplace operator to functions defined on submanifolds in Euclidean space and, even more generally, on Riemannian and pseudo-Riemannian manifolds. It is named ...
, which is the
divergence
In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of t ...
of the gradient
.
Then, if
is a vector of the values of
at the data,
, the intrinsic norm can be estimated:
:
As the number of data points
increases, this empirical definition of
converges to the definition when
is known.
Solving the regularization problem with graph-based approach
Using the weights
and
for the ambient and intrinsic regularizers, the final expression to be solved becomes:
:
As with other
kernel methods
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example c ...
,
may be an infinite-dimensional space, so if the regularization expression cannot be solved explicitly, it is impossible to search the entire space for a solution. Instead, a
representer theorem
For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^ of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represen ...
shows that under certain conditions on the choice of the norm
, the optimal solution
must be a linear combination of the kernel centered at each of the input points: for some weights
,
:
Using this result, it is possible to search for the optimal solution
by searching the finite-dimensional space defined by the possible choices of
.
Functional approach of the Laplacian norm
The idea beyond graph-Laplacian is to use neighbors to estimate Laplacian.
This method is akin
local averaging methods, that are known to scale poorly in high-dimensional problem.
Indeed, graph Laplacian is known to suffer from the
curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. Th ...
.
Luckily, it is possible to leverage expected smoothness of the function to estimate thanks to more advanced functional analysis.
This method consists in estimating the Laplacian operator thanks to derivatives of the kernel reading
where
denotes the partial derivatives according to the ''j''-th coordinate of the first variable.
This second approach of the Laplacian norm is to put in relation with
meshfree methods
In the field of numerical analysis, meshfree methods are those that do not require connection between nodes of the simulation domain, i.e. a mesh, but are rather based on interaction of each node with all its neighbors. As a consequence, original ...
, that contrast with the
finite difference method
In numerical analysis, finite-difference methods (FDM) are a class of numerical techniques for solving differential equations by approximating derivatives with finite differences. Both the spatial domain and time interval (if applicable) are dis ...
in PDE.
Applications
Manifold regularization can extend a variety of algorithms that can be expressed using Tikhonov regularization, by choosing an appropriate loss function
and hypothesis space
. Two commonly used examples are the families of
support vector machines
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
and
regularized least squares
Regularized least squares (RLS) is a family of methods for solving the least squares, least-squares problem while using regularization (mathematics), regularization to further constrain the resulting solution.
RLS is used for two main reasons. Th ...
algorithms. (Regularized least squares includes the ridge regression algorithm; the related algorithms of LASSO and
elastic net regularization
In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.
Specification
The ela ...
can be expressed as support vector machines.) The extended versions of these algorithms are called Laplacian Regularized Least Squares (abbreviated LapRLS) and Laplacian Support Vector Machines (LapSVM), respectively.
Laplacian Regularized Least Squares (LapRLS)
Regularized least squares (RLS) is a family of
regression algorithms: algorithms that predict a value
for its inputs
, with the goal that the predicted values should be close to the true labels for the data. In particular, RLS is designed to minimize the
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwe ...
between the predicted values and the true labels, subject to regularization. Ridge regression is one form of RLS; in general, RLS is the same as ridge regression combined with the
kernel method
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
. The problem statement for RLS results from choosing the loss function
in Tikhonov regularization to be the mean squared error:
:
Thanks to the
representer theorem
For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^ of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represen ...
, the solution can be written as a weighted sum of the kernel evaluated at the data points:
:
and solving for
gives:
:
where
is defined to be the kernel matrix, with
, and
is the vector of data labels.
Adding a Laplacian term for manifold regularization gives the Laplacian RLS statement:
:
The representer theorem for manifold regularization again gives
:
and this yields an expression for the vector
. Letting
be the kernel matrix as above,
be the vector of data labels, and
be the
block matrix
:
:
with a solution of
:
LapRLS has been applied to problems including sensor networks,
medical imaging,
object detection,
spectroscopy
Spectroscopy is the field of study that measures and interprets the electromagnetic spectra that result from the interaction between electromagnetic radiation and matter as a function of the wavelength or frequency of the radiation. Matter ...
,
document classification
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
,
drug-protein interactions,
and compressing images and videos.
Laplacian Support Vector Machines (LapSVM)
Support vector machines
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
(SVMs) are a family of algorithms often used for
classifying data into two or more groups, or ''classes''. Intuitively, an SVM draws a boundary between classes so that the closest labeled examples to the boundary are as far away as possible. This can be directly expressed as a
linear program
Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is ...
, but it is also equivalent to Tikhonov regularization with the
hinge loss
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).
For an intended output and a classifier score , th ...
function,
:
:
Adding the intrinsic regularization term to this expression gives the LapSVM problem statement:
:
Again, the representer theorem allows the solution to be expressed in terms of the kernel evaluated at the data points:
:
can be found by writing the problem as a linear program and solving the
dual problem
In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then t ...
. Again letting
be the kernel matrix and
be the block matrix
, the solution can be shown to be
:
where
is the solution to the dual problem
:
and
is defined by
:
LapSVM has been applied to problems including geographical imaging,
medical imaging,
face recognition,
machine maintenance,
and
brain–computer interfaces.
Limitations
* Manifold regularization assumes that data with different labels are not likely to be close together. This assumption is what allows the technique to draw information from unlabeled data, but it only applies to some problem domains. Depending on the structure of the data, it may be necessary to use a different semi-supervised or transductive learning algorithm.
* In some datasets, the intrinsic norm of a function
can be very close to the ambient norm
: for example, if the data consist of two classes that lie on perpendicular lines, the intrinsic norm will be equal to the ambient norm. In this case, unlabeled data have no effect on the solution learned by manifold regularization, even if the data fit the algorithm's assumption that the separator should be smooth. Approaches related to
co-training have been proposed to address this limitation.
* If there are a very large number of unlabeled examples, the kernel matrix
becomes very large, and a manifold regularization algorithm may become prohibitively slow to compute. Online algorithms and sparse approximations of the manifold may help in this case.
See also
*
Manifold learning
Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low- ...
*
Manifold hypothesis
In theoretical computer science and the study of machine learning, the manifold hypothesis is the hypothesis that many high-dimensional data sets that occur in the real world actually lie along low-dimensional latent manifolds inside that high-di ...
*
Semi-supervised learning Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of ...
*
Transduction (machine learning)
In logic, statistical inference, and supervised learning,
transduction or transductive inference is reasoning from
observed, specific (training) cases to specific (test) cases. In contrast,
induction is reasoning from observed training cases
to ...
*
Spectral graph theory
In mathematics, spectral graph theory is the study of the properties of a graph in relationship to the characteristic polynomial, eigenvalues, and eigenvectors of matrices associated with the graph, such as its adjacency matrix or Laplacian mat ...
*
Reproducing kernel Hilbert space
In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g i ...
*
Tikhonov regularization
Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
*
Differential geometry
References
{{Reflist
External links
Software
* Th
ManifoldLearn libraryand th
Primal LapSVM libraryimplement LapRLS and LapSVM in
MATLAB
MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ...
.
* Th
Dlib libraryfor
C++ includes a linear manifold regularization function.
Machine learning