HOME

TheInfoList



OR:

In
mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
,
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
,
finance Finance is the study and discipline of money, currency and capital assets. It is related to, but not synonymous with economics, the study of production, distribution, and consumption of money, assets, goods and services (the discipline of f ...
,
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, particularly in
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
and
inverse problem An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, source reconstruction in acoustics, or calculating the ...
s, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
. Although regularization procedures can be divided in many ways, following delineation is particularly helpful: * Explicit regularization is regularization whenever one explicitly adds a term to the optimization problem. These terms could be priors, penalties, or constraints. Explicit regularization is commonly employed with ill-posed optimization problems. The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique. * Implicit regularization is all other forms of regularization. This includes, for example, early stopping, using a robust loss function, and discarding outliers. Implicit regularization is essentially ubiquitous in modern machine learning approaches, including stochastic gradient descent for training deep neural networks, and ensemble methods (such as random forests and gradient boosted trees). In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement and a regularization term that corresponds to a prior. By combining both using Bayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more addictive to the data or to enforce generalization (to prevent overfitting). There is a whole research branch dealing with all possible regularizations. The work flow usually is, that one tries a specific regularization and then figures out the probability density that corresponds to that regularization to justify the choice. It can also be physically motivated by common sense or intuition. In machine learning, the data term corresponds to the training data and the regularization is either the choice of the model or modifications to the algorithm. It is always intended to reduce the generalization error, i.e. the error score with the trained model on the evaluation set and not the training data. One of the earliest uses of regularization is Tikhonov regularization, related to the method of least squares.


Classification

Empirical learning of classifiers (from a finite data set) is always an underdetermined problem, because it attempts to infer a function of any x given only examples x_1, x_2, ... x_n. A regularization term (or regularizer) R(f) is added to a
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
: : \min_f \sum_^ V(f(x_i), y_i) + \lambda R(f) where V is an underlying loss function that describes the cost of predicting f(x) when the label is y, such as the square loss or hinge loss; and \lambda is a parameter which controls the importance of the regularization term. R(f) is typically chosen to impose a penalty on the complexity of f. Concrete notions of complexity used include restrictions for
smoothness In mathematical analysis, the smoothness of a function is a property measured by the number of continuous derivatives it has over some domain, called ''differentiability class''. At the very minimum, a function could be considered smooth if ...
and bounds on the vector space norm. A theoretical justification for regularization is that it attempts to impose
Occam's razor Occam's razor, Ockham's razor, or Ocham's razor ( la, novacula Occami), also known as the principle of parsimony or the law of parsimony ( la, lex parsimoniae), is the problem-solving principle that "entities should not be multiplied beyond neces ...
on the solution (as depicted in the figure above, where the green function, the simpler one, may be preferred). From a
Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
point of view, many regularization techniques correspond to imposing certain
prior Prior (or prioress) is an ecclesiastical title for a superior in some religious orders. The word is derived from the Latin for "earlier" or "first". Its earlier generic usage referred to any monastic superior. In abbeys, a prior would be low ...
distributions on model parameters. Regularization can serve multiple purposes, including learning simpler models, inducing models to be sparse and introducing group structure into the learning problem. The same idea arose in many fields of
science Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earliest archeological evidence ...
. A simple form of regularization applied to integral equations ( Tikhonov regularization) is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including total variation regularization, have become popular.


Generalization

Regularization can be motivated as a technique to improve the generalizability of a learned model. The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels. The expected error of a function f_n is: : I _n= \int_ V(f_n(x),y) \rho(x,y) \, dx \, dy where X and Y are the domains of input data x and their labels y respectively. Typically in learning problems, only a subset of input data and labels are available, measured with some noise. Therefore, the expected error is unmeasurable, and the best surrogate available is the empirical error over the N available samples: : I_S _n= \frac \sum_^N V(f_n(\hat x_i), \hat y_i) Without bounds on the complexity of the function space (formally, the reproducing kernel Hilbert space) available, a model will be learned that incurs zero loss on the surrogate empirical error. If measurements (e.g. of x_i) were made with noise, this model may suffer from
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
and display poor expected error. Regularization introduces a penalty for exploring certain regions of the function space used to build the model, which can improve generalization.


Tikhonov regularization

These techniques are named for
Andrey Nikolayevich Tikhonov Andrey Nikolayevich Tikhonov (russian: Андре́й Никола́евич Ти́хонов; October 17, 1906 – October 7, 1993) was a leading Soviet Russian mathematician and geophysicist known for important contributions to topology, ...
, who applied regularization to integral equations and made important contributions in many other areas. When learning a linear function f, characterized by an unknown
vector Vector most often refers to: *Euclidean vector, a quantity with a magnitude and a direction *Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematic ...
w such that f(x) = w \cdot x, one can add the L_2-norm of the vector w to the loss expression in order to prefer solutions with smaller norms. Tikhonov regularization is one of the most common forms. It is also known as ridge regression. It is expressed as: :\min_w \sum_^ V(\hat x_i \cdot w, \hat y_i) + \lambda \, w\, _^, where (\hat x_i, \hat y_i), \, 1 \leq i \leq n, would represent samples used for training. In the case of a general function, the norm of the function in its reproducing kernel Hilbert space is: :\min_f \sum_^ V(f(\hat x_i), \hat y_i) + \lambda \, f\, _^ As the L_2 norm is differentiable, learning can be advanced by gradient descent.


Tikhonov-regularized least squares

The learning problem with the
least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the re ...
loss function and Tikhonov regularization can be solved analytically. Written in matrix form, the optimal w is the one for which the gradient of the loss function with respect to w is 0. :\min_w \frac (\hat X w - Y)^T(\hat X w - Y)+ \lambda \, w\, _^ :\nabla_w = \frac \hat X^T (\hat X w - Y) + 2 \lambda w :0 = \hat X^T (\hat X w - Y) + n \lambda w    (
first-order condition In calculus, a derivative test uses the derivatives of a function to locate the critical points of a function and determine whether each point is a local maximum, a local minimum, or a saddle point. Derivative tests can also give information about ...
) :w = (\hat X^T \hat X + \lambda n I)^ (\hat X^T Y) By construction of the optimization problem, other values of w give larger values for the loss function. This can be verified by examining the
second derivative In calculus, the second derivative, or the second order derivative, of a function is the derivative of the derivative of . Roughly speaking, the second derivative measures how the rate of change of a quantity is itself changing; for example, ...
\nabla_. During training, this algorithm takes O(d^3 + nd^2)
time Time is the continued sequence of existence and event (philosophy), events that occurs in an apparently irreversible process, irreversible succession from the past, through the present, into the future. It is a component quantity of various me ...
. The terms correspond to the matrix inversion and calculating X^T X, respectively. Testing takes O(nd) time.


Early stopping

Early stopping can be viewed as regularization in time. Intuitively, a training procedure such as gradient descent tends to learn more and more complex functions with increasing iterations. By regularizing for time, model complexity can be controlled, improving generalization. Early stopping is implemented using one data set for training, one statistically independent data set for validation and another for testing. The model is trained until performance on the validation set no longer improves and then applied to the test set.


Theoretical motivation in least squares

Consider the finite approximation of
Neumann series A Neumann series is a mathematical series of the form : \sum_^\infty T^k where T is an operator and T^k := T^\circ its k times repeated application. This generalizes the geometric series. The series is named after the mathematician Carl Neuman ...
for an invertible matrix where \, I-A \, < 1: :\sum_^(I-A)^i \approx A^ This can be used to approximate the analytical solution of unregularized least squares, if is introduced to ensure the norm is less than one. :w_T = \frac \sum_^ ( I - \frac \hat X^T \hat X )^i \hat X^T \hat Y The exact solution to the unregularized least squares learning problem minimizes the empirical error, but may fail. By limiting , the only free parameter in the algorithm above, the problem is regularized for time, which may improve its generalization. The algorithm above is equivalent to restricting the number of gradient descent iterations for the empirical risk :I_s = \frac \, \hat X w - \hat Y \, ^_ with the gradient descent update: :\begin w_0 &= 0 \\ w_ &= (I - \frac \hat X^T \hat X)w_t + \frac\hat X^T \hat Y \end The base case is trivial. The inductive case is proved as follows: :\begin w_ &= (I - \frac \hat X^T \hat X)\frac \sum_^(I - \frac \hat X^T \hat X )^i \hat X^T \hat Y + \frac\hat X^T \hat Y \\ &= \frac \sum_^(I - \frac \hat X^T \hat X )^i \hat X^T \hat Y + \frac\hat X^T \hat Y \\ &= \frac \sum_^(I - \frac \hat X^T \hat X )^i \hat X^T \hat Y \end


Regularizers for sparsity

Assume that a dictionary \phi_j with dimension p is given such that a function in the function space can be expressed as: :f(x) = \sum_^ \phi_j(x) w_j Enforcing a sparsity constraint on w can lead to simpler and more interpretable models. This is useful in many real-life applications such as
computational biology Computational biology refers to the use of data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the ...
. An example is developing a simple predictive test for a disease in order to minimize the cost of performing medical tests while maximizing predictive power. A sensible sparsity constraint is the L_0 norm \, w\, _0, defined as the number of non-zero elements in w. Solving a L_0 regularized learning problem, however, has been demonstrated to be
NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
. The L_1 norm (see also Norms) can be used to approximate the optimal L_0 norm via convex relaxation. It can be shown that the L_1 norm induces sparsity. In the case of least squares, this problem is known as LASSO in statistics and
basis pursuit Basis pursuit is the mathematical optimization problem of the form : \min_x \, x\, _1 \quad \text \quad y = Ax, where ''x'' is a ''N''-dimensional solution vector (signal), ''y'' is a ''M''-dimensional vector of observations (measurements), ''A ...
in signal processing. :\min_ \frac \, \hat X w - \hat Y \, ^2 + \lambda \, w\, _ L_1 regularization can occasionally produce non-unique solutions. A simple example is provided in the figure when the space of possible solutions lies on a 45 degree line. This can be problematic for certain applications, and is overcome by combining L_1 with L_2 regularization in
elastic net regularization In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Specification The elas ...
, which takes the following form: :\min_ \frac \, \hat X w - \hat Y \, ^2 + \lambda (\alpha \, w\, _ + (1 - \alpha)\, w\, _^), \alpha \in , 1/math> Elastic net regularization tends to have a grouping effect, where correlated input features are assigned equal weights. Elastic net regularization is commonly used in practice and is implemented in many machine learning libraries.


Proximal methods

While the L_1 norm does not result in an NP-hard problem, the L_1 norm is convex but is not strictly differentiable due to the kink at x = 0.
Subgradient method Subgradient methods are iterative methods for solving convex minimization problems. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective func ...
s which rely on the subderivative can be used to solve L_1 regularized learning problems. However, faster convergence can be achieved through proximal methods. For a problem \min_ F(w) + R(w) such that F is convex, continuous, differentiable, with Lipschitz continuous gradient (such as the least squares loss function), and R is convex, continuous, and proper, then the proximal method to solve the problem is as follows. First define the
proximal operator In mathematical optimization, the proximal operator is an operator associated with a proper,An (extended) real-valued function ''f'' on a Hilbert space is said to be ''proper'' if it is not identically equal to +\infty, and -\infty is not in its im ...
:\operatorname_R(v) = \operatorname\limits_ \, and then iterate :w_ = \operatorname\limits_(w_k - \gamma \nabla F(w_k)) The proximal method iteratively performs gradient descent and then projects the result back into the space permitted by R. When R is the L_1 regularizer, the proximal operator is equivalent to the soft-thresholding operator, :S_\lambda(v)f(n) = \begin v_i - \lambda, & \textv_i > \lambda \\ 0, & \text v_i \in \lambda, \lambda\\ v_i + \lambda, & \textv_i < - \lambda \end This allows for efficient computation.


Group sparsity without overlaps

Groups of features can be regularized by a sparsity constraint, which can be useful for expressing certain prior knowledge into an optimization problem. In the case of a linear model with non-overlapping known groups, a regularizer can be defined: :R(w) = \sum_^G \, w_g\, _2, where \, w_g\, _2 = \sqrt This can be viewed as inducing a regularizer over the L_2 norm over members of each group followed by an L_1 norm over groups. This can be solved by the proximal method, where the proximal operator is a block-wise soft-thresholding function: : \operatorname\limits_(w_g) = \begin (1 - \frac)w_g, & \text \, w_g\, _2 > \lambda \\ 0, & \text \, w_g\, _2 \leq \lambda \end


Group sparsity with overlaps

The algorithm described for group sparsity without overlaps can be applied to the case where groups do overlap, in certain situations. This will likely result in some groups with all zero elements, and other groups with some non-zero and some zero elements. If it is desired to preserve the group structure, a new regularizer can be defined: :R(w) = \inf \left\ For each w_g, \bar w_g is defined as the vector such that the restriction of \bar w_g to the group g equals w_g and all other entries of \bar w_g are zero. The regularizer finds the optimal disintegration of w into parts. It can be viewed as duplicating all elements that exist in multiple groups. Learning problems with this regularizer can also be solved with the proximal method with a complication. The proximal operator cannot be computed in closed form, but can be effectively solved iteratively, inducing an inner iteration within the proximal method iteration.


Regularizers for semi-supervised learning

When labels are more expensive to gather than input examples, semi-supervised learning can be useful. Regularizers have been designed to guide learning algorithms to learn models that respect the structure of unsupervised training samples. If a symmetric weight matrix W is given, a regularizer can be defined: :R(f) = \sum_ w_(f(x_i) - f(x_j))^2 If W_ encodes the result of some distance metric for points x_i and x_j, it is desirable that f(x_i) \approx f(x_j). This regularizer captures this intuition, and is equivalent to: :R(f) = \bar f^T L \bar f where L = D- W is the
Laplacian matrix In the mathematical field of graph theory, the Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix or discrete Laplacian, is a matrix representation of a graph. Named after Pierre-Simon Laplace, the graph La ...
of the graph induced by W. The optimization problem \min_ R(f), m = u + l can be solved analytically if the constraint f(x_i) = y_i is applied for all supervised samples. The labeled part of the vector f is therefore obvious. The unlabeled part of f is solved for by: :\min_ f^T L f = \min_ \ :\nabla_ = 2L_f_u + 2L_Y :f_u = L_^\dagger (L_ Y) Note that the pseudo-inverse can be taken because L_ has the same range as L_.


Regularizers for multitask learning

In the case of multitask learning, T problems are considered simultaneously, each related in some way. The goal is to learn T functions, ideally borrowing strength from the relatedness of tasks, that have predictive power. This is equivalent to learning the matrix W: T \times D .


Sparse regularizer on columns

:R(w) = \sum_^D \, W\, _ This regularizer defines an L2 norm on each column and an L1 norm over all columns. It can be solved by proximal methods.


Nuclear norm regularization

:R(w) = \, \sigma(W)\, _1 where \sigma(W) is the
eigenvalues In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denote ...
in the
singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is re ...
of W.


Mean-constrained regularization

:R(f_1 \cdots f_T) = \sum_^T \, f_t - \frac \sum_^T f_s \, _^2 This regularizer constrains the functions learned for each task to be similar to the overall average of the functions across all tasks. This is useful for expressing prior information that each task is expected to share with each other task. An example is predicting blood iron levels measured at different times of the day, where each task represents an individual.


Clustered mean-constrained regularization

:R(f_1 \cdots f_T) = \sum_^C \sum_ \, f_t - \frac \sum_ f_s\, _^2 where I(r) is a cluster of tasks. This regularizer is similar to the mean-constrained regularizer, but instead enforces similarity between tasks within the same cluster. This can capture more complex prior information. This technique has been used to predict
Netflix Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a ...
recommendations. A cluster would correspond to a group of people who share similar preferences.


Graph-based similarity

More generally than above, similarity between tasks can be defined by a function. The regularizer encourages the model to learn similar functions for similar tasks. :R(f_1 \cdots f_T) = \sum_^T \, f_t - f_s \, ^2 M_ for a given symmetric
similarity matrix In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...
M.


Other uses of regularization in statistics and machine learning

Bayesian learning methods make use of a
prior probability In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
that (usually) gives lower probability to more complex models. Well-known model selection techniques include the Akaike information criterion (AIC),
minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occa ...
(MDL), and the
Bayesian information criterion In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, o ...
(BIC). Alternative methods of controlling overfitting not involving regularization include cross-validation. Examples of applications of different methods of regularization to the
linear model In statistics, the term linear model is used in different ways according to the context. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, the term ...
are:


See also

* Bayesian interpretation of regularization *
Bias–variance tradeoff In statistics and machine learning, the bias–variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters. The bias–variance di ...
* Matrix regularization * Regularization by spectral filtering * Regularized least squares *
Lagrange multiplier In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied e ...


Notes


References

* {{Authority control Mathematical analysis Inverse problems