
In
mathematics,
statistics,
finance,
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, particularly in
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
and
inverse problem
An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, source reconstruction in acoustics, or calculating th ...
s, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for
ill-posed problems or to prevent
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
.
Although regularization procedures can be divided in many ways, following delineation is particularly helpful:
* Explicit regularization is regularization whenever one explicitly adds a term to the optimization problem. These terms could be priors, penalties, or constraints. Explicit regularization is commonly employed with ill-posed optimization problems. The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique.
* Implicit regularization is all other forms of regularization. This includes, for example, early stopping, using a robust loss function, and discarding outliers. Implicit regularization is essentially ubiquitous in modern machine learning approaches, including stochastic gradient descent for training deep neural networks, and ensemble methods (such as random forests and gradient boosted trees).
In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement and a regularization term that corresponds to a prior. By combining both using Bayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more addictive to the data or to enforce generalization (to prevent overfitting). There is a whole research branch dealing with all possible regularizations. The work flow usually is, that one tries a specific regularization and then figures out the probability density that corresponds to that regularization to justify the choice. It can also be physically motivated by common sense or intuition.
In machine learning, the data term corresponds to the training data and the regularization is either the choice of the model or modifications to the algorithm. It is always intended to reduce the generalization error, i.e. the error score with the trained model on the evaluation set and not the training data.
One of the earliest uses of regularization is
Tikhonov regularization, related to the method of least squares.
Classification
Empirical learning of classifiers (from a finite data set) is always an underdetermined problem, because it attempts to infer a function of any
given only examples
.
A regularization term (or regularizer)
is added to a
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "co ...
:
:
where
is an underlying loss function that describes the cost of predicting
when the label is
, such as the
square loss or
hinge loss; and
is a parameter which controls the importance of the regularization term.
is typically chosen to impose a penalty on the complexity of
. Concrete notions of complexity used include restrictions for
smoothness and bounds on the
vector space norm.
A theoretical justification for regularization is that it attempts to impose
Occam's razor
Occam's razor, Ockham's razor, or Ocham's razor ( la, novacula Occami), also known as the principle of parsimony or the law of parsimony ( la, lex parsimoniae), is the problem-solving principle that "entities should not be multiplied beyond neces ...
on the solution (as depicted in the figure above, where the green function, the simpler one, may be preferred). From a
Bayesian point of view, many regularization techniques correspond to imposing certain
prior distributions on model parameters.
Regularization can serve multiple purposes, including learning simpler models, inducing models to be sparse and introducing group structure into the learning problem.
The same idea arose in many fields of
science
Science is a systematic endeavor that Scientific method, builds and organizes knowledge in the form of Testability, testable explanations and predictions about the universe.
Science may be as old as the human species, and some of the earli ...
. A simple form of regularization applied to
integral equations (
Tikhonov regularization) is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including
total variation regularization, have become popular.
Generalization
Regularization can be motivated as a technique to improve the generalizability of a learned model.
The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels. The expected error of a function
is:
:
where
and
are the domains of input data
and their labels
respectively.
Typically in learning problems, only a subset of input data and labels are available, measured with some noise. Therefore, the expected error is unmeasurable, and the best surrogate available is the empirical error over the
available samples:
:
Without bounds on the complexity of the function space (formally, the
reproducing kernel Hilbert space) available, a model will be learned that incurs zero loss on the surrogate empirical error. If measurements (e.g. of
) were made with noise, this model may suffer from
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
and display poor expected error. Regularization introduces a penalty for exploring certain regions of the function space used to build the model, which can improve generalization.
Tikhonov regularization
These techniques are named for
Andrey Nikolayevich Tikhonov, who applied regularization to
integral equations and made important contributions in many other areas.
When learning a linear function
, characterized by an unknown
vector such that
, one can add the
-norm of the vector
to the loss expression in order to prefer solutions with smaller norms. Tikhonov regularization is one of the most common forms. It is also known as ridge regression. It is expressed as:
:
,
where
would represent samples used for training.
In the case of a general function, the norm of the function in its
reproducing kernel Hilbert space is:
:
As the
norm is
differentiable, learning can be advanced by
gradient descent.
Tikhonov-regularized least squares
The learning problem with the
least squares
The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the r ...
loss function and Tikhonov regularization can be solved analytically. Written in matrix form, the optimal
is the one for which the gradient of the loss function with respect to
is 0.
:
:
:
(
first-order condition)
:
By construction of the optimization problem, other values of
give larger values for the loss function. This can be verified by examining the
second derivative .
During training, this algorithm takes
time
Time is the continued sequence of existence and events that occurs in an apparently irreversible succession from the past, through the present, into the future. It is a component quantity of various measurements used to sequence events, t ...
. The terms correspond to the matrix inversion and calculating
, respectively. Testing takes
time.
Early stopping
Early stopping can be viewed as regularization in time. Intuitively, a training procedure such as gradient descent tends to learn more and more complex functions with increasing iterations. By regularizing for time, model complexity can be controlled, improving generalization.
Early stopping is implemented using one data set for training, one statistically independent data set for validation and another for testing. The model is trained until performance on the validation set no longer improves and then applied to the test set.
Theoretical motivation in least squares
Consider the finite approximation of
Neumann series for an invertible matrix where
:
:
This can be used to approximate the analytical solution of unregularized least squares, if is introduced to ensure the norm is less than one.
:
The exact solution to the unregularized least squares learning problem minimizes the empirical error, but may fail. By limiting , the only free parameter in the algorithm above, the problem is regularized for time, which may improve its generalization.
The algorithm above is equivalent to restricting the number of gradient descent iterations for the empirical risk
:
with the gradient descent update:
:
The base case is trivial. The inductive case is proved as follows:
:
Regularizers for sparsity
Assume that a dictionary
with dimension
is given such that a function in the function space can be expressed as:
:

Enforcing a sparsity constraint on
can lead to simpler and more interpretable models. This is useful in many real-life applications such as
computational biology
Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
. An example is developing a simple predictive test for a disease in order to minimize the cost of performing medical tests while maximizing predictive power.
A sensible sparsity constraint is the
norm , defined as the number of non-zero elements in
. Solving a
regularized learning problem, however, has been demonstrated to be
NP-hard
In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
.
The
norm (see also
Norms) can be used to approximate the optimal
norm via convex relaxation. It can be shown that the
norm induces sparsity. In the case of least squares, this problem is known as
LASSO in statistics and
basis pursuit in signal processing.
:
regularization can occasionally produce non-unique solutions. A simple example is provided in the figure when the space of possible solutions lies on a 45 degree line. This can be problematic for certain applications, and is overcome by combining
with
regularization in
elastic net regularization, which takes the following form:
: