Penalized Least Squares
   HOME

TheInfoList



OR:

Regularized least squares (RLS) is a family of methods for solving the
least-squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
problem while using
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...
to further constrain the resulting solution. RLS is used for two main reasons. The first comes up when the number of variables in the linear system exceeds the number of observations. In such settings, the ordinary least-squares problem is ill-posed and is therefore impossible to fit because the associated optimization problem has infinitely many solutions. RLS allows the introduction of further constraints that uniquely determine the solution. The second reason for using RLS arises when the learned model suffers from poor
generalization A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...
. RLS can be used in such cases to improve the generalizability of the model by constraining it at training time. This constraint can either force the solution to be "sparse" in some way or to reflect other prior knowledge about the problem such as information about correlations between features. A
Bayesian Thomas Bayes ( ; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian ( or ) may be either any of a range of concepts and approaches that relate to statistical methods based on Bayes' theorem Bayes ...
understanding of this can be reached by showing that RLS methods are often equivalent to
priors Prior (or prioress) is an ecclesiastical title for a superior in some religious orders. The word is derived from the Latin for "earlier" or "first". Its earlier generic usage referred to any monastic superior. In abbeys, a prior would be lowe ...
on the solution to the least-squares problem.


General formulation

Consider a learning setting given by a probabilistic space (X \times Y, \rho(X,Y)), Y \in R. Let S = \_^n denote a training set of n pairs i.i.d. with respect to the joint distribution \rho. Let V:Y \times R \to
\varepsilon(f) = \int V(y,f(x)) \, d\rho(x,y) is well defined. The main goal is to minimize the expected risk: \inf_\varepsilon(f) Since the problem cannot be solved exactly there is a need to specify how to measure the quality of a solution. A good learning algorithm should provide an estimator with a small risk. As the joint distribution \rho is typically unknown, the empirical risk is taken. For regularized least squares the square loss function is introduced: \varepsilon(f) = \frac\sum_^n V(y_i,f(x_i)) = \frac\sum_^n(y_i-f(x_i))^2 However, if the functions are from a relatively unconstrained space, such as the set of square-integrable functions on X, this approach may overfit the training data, and lead to poor generalization. Thus, it should somehow constrain or penalize the complexity of the function f . In RLS, this is accomplished by choosing functions from a reproducing kernel Hilbert space (RKHS) \mathcal , and adding a regularization term to the objective function, proportional to the norm of the function in \mathcal : \inf_\varepsilon(f) + \lambda R(f), \lambda > 0


Kernel formulation


Definition of RKHS

A RKHS can be defined by a symmetric function">symmetric
Symmetry () in everyday life refers to a sense of harmonious and beautiful proportion and balance. In mathematics, the term has a more precise definition and is usually used to refer to an object that is invariant under some transformations ...
positive-definite kernel function K(x,z) with the reproducing property: \langle K_x,f\rangle_ = f(x), where K_x(z) = K(x,z). The RKHS for a kernel K consists of the completion of the space of functions spanned by \left\: f(x) = \sum_^n \alpha_i K_(x),\, f\in\mathcal, where all \alpha_i are real numbers. Some commonly used kernels include the linear kernel, inducing the space of linear functions: K(x,z) = x^\mathsf z, the polynomial kernel, inducing the space of polynomial functions of order d: K(x,z) = \left(x^\mathsf z + 1\right)^d, and the Gaussian kernel: K(x,z) = e^. Note that for an arbitrary loss function V, this approach defines a general class of algorithms named Tikhonov regularization. For instance, using the
hinge loss In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output and a classifier score , ...
leads to the
support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
algorithm, and using the epsilon-insensitive loss leads to support vector regression.


Arbitrary kernel

The
representer theorem For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^ of a regularized Empirical risk minimization, empirical risk functional defined over a reproducing kernel Hi ...
guarantees that the solution can be written as: f(x) = \sum_^n c_i K(x_i,x) for some c \in \mathbb R^n. The minimization problem can be expressed as: \min_\frac \left\, Y - K c\right\, ^2_ + \lambda \left\, f\right\, ^2_H, where, with some abuse of notation, the i,j entry of kernel matrix K (as opposed to kernel function K(\cdot, \cdot)) is K(x_i, x_j). For such a function, \begin \left\, f\right\, ^2_H &= \langle f,f \rangle_ \\ ex&= \left\langle \sum_^n c_i K(x_i,\cdot), \sum_^n c_j K(x_,\cdot) \right\rangle_H \\ ex&= \sum_^n \sum_^n c_i c_j \left\langle K(x_i,\cdot), K(x_j,\cdot) \right\rangle_H \\ &= \sum_^n \sum_^n c_i c_j K(x_i,x_j) \\ &= c^\mathsf Kc, \end The following minimization problem can be obtained: \min_ \frac \left\, Y - K c\right\, ^2_ + \lambda c^\mathsf Kc. As the sum of convex functions is convex, the solution is unique and its minimum can be found by setting the gradient with respect to c to 0: -\frac K \left(Y - Kc\right) + \lambda Kc = 0 \Rightarrow K\left(K+\lambda n I\right)c = K Y \Rightarrow c = \left(K+\lambda n I\right)^Y, where c \in\mathbb R^n.


Complexity

The complexity of training is basically the cost of computing the kernel matrix plus the cost of solving the linear system which is roughly O(n^3). The computation of the kernel matrix for the linear or
Gaussian kernel In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form f(x) = \exp (-x^2) and with parametric extension f(x) = a \exp\left( -\frac \right) for arbitrary real constants , and non-zero . It is ...
is O(n^2 D). The complexity of testing is O(n).


Prediction

The prediction at a new test point x_ is: f(x_) = \sum_^n c_i K(x_i,x_) = K(X,X_)^\mathsf c


Linear kernel

For convenience a vector notation is introduced. Let X be an n\times d matrix, where the rows are input vectors, and Y a n\times 1 vector where the entries are corresponding outputs. In terms of vectors, the kernel matrix can be written as K = X X^\mathsf. The learning function can be written as: f(x_) = K_c = x_^\mathsf X^\mathsf c = x_^\mathsf w Here we define w = X^\mathsf c, w \in \Reals^d. The objective function can be rewritten as: \begin \frac \left\, Y - Kc\right\, ^2_ + \lambda c^\mathsf Kc &= \frac \left\, y - X X^\mathsf c\right\, ^2_ + \lambda c^\mathsf X X^\mathsf c \\ ex&= \frac \left\, y - X w\right\, ^2_ + \lambda \left\, w\right\, ^2_ \end The first term is the objective function from
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression In statistics, linear regression is a statistical model, model that estimates the relationship ...
(OLS) regression, corresponding to the
residual sum of squares In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of dat ...
. The second term is a regularization term, not present in OLS, which penalizes large w values. As a smooth finite dimensional problem is considered and it is possible to apply standard calculus tools. In order to minimize the objective function, the gradient is calculated with respect to w and set it to zero: X^\mathsf X w - X^\mathsf y + \lambda n w=0 w = \left( X^\mathsf X+\lambda n I\right)^ X^\mathsf y This solution closely resembles that of standard linear regression, with an extra term \lambda I. If the assumptions of OLS regression hold, the solution w = \left( X^\mathsf X\right)^ X^\mathsf y, with \lambda = 0, is an unbiased estimator, and is the minimum-variance linear unbiased estimator, according to the
Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in ...
. The term \lambda n I therefore leads to a biased solution; however, it also tends to reduce variance. This is easy to see, as the
covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables. If greater values of one ...
matrix of the w-values is proportional to \left( X^\mathsf X + \lambda n I\right)^, and therefore large values of \lambda will lead to lower variance. Therefore, manipulating \lambda corresponds to trading-off bias and variance. For problems with high-variance w estimates, such as cases with relatively small n or with correlated regressors, the optimal prediction accuracy may be obtained by using a nonzero \lambda, and thus introducing some bias to reduce variance. Furthermore, it is not uncommon in
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
to have cases where n < d, in which case X^\mathsf X is
rank A rank is a position in a hierarchy. It can be formally recognized—for example, cardinal, chief executive officer, general, professor—or unofficial. People Formal ranks * Academic rank * Corporate title * Diplomatic rank * Hierarchy ...
-deficient, and a nonzero \lambda is necessary to compute \left( X^\mathsf X + \lambda n I\right)^.


Complexity

The parameter \lambda controls the invertibility of the matrix X^\mathsf X + \lambda n I . Several methods can be used to solve the above linear system,
Cholesky decomposition In linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced ) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for eff ...
being probably the method of choice, since the matrix X^\mathsf X + \lambda n I is
symmetric Symmetry () in everyday life refers to a sense of harmonious and beautiful proportion and balance. In mathematics, the term has a more precise definition and is usually used to refer to an object that is invariant under some transformations ...
and
positive definite In mathematics, positive definiteness is a property of any object to which a bilinear form In mathematics, a bilinear form is a bilinear map on a vector space (the elements of which are called '' vectors'') over a field ''K'' (the elements of w ...
. The complexity of this method is O(n D^2) for training and O(D) for testing. The cost O(n D^2) is essentially that of computing X^\mathsf X, whereas the inverse computation (or rather the solution of the linear system) is roughly O(D^3).


Feature maps and Mercer's theorem

In this section it will be shown how to extend RLS to any kind of reproducing kernel K. Instead of linear kernel a feature map is considered \Phi: X \to F for some Hilbert space F, called the feature space. In this case the kernel is defined as: The matrix X is now replaced by the new data matrix \Phi, where \Phi_ = \varphi_j(x_i), or the j-th component of the \varphi(x_i). K(x,x') = \langle \Phi(x), \Phi(x') \rangle_F. It means that for a given training set K = \Phi \Phi^\mathsf. Thus, the objective function can be written as \min_ \left\, Y - \Phi \Phi^\mathsf c \right\, ^2_ + \lambda c^\mathsf \Phi \Phi^\mathsf c. This approach is known as the
kernel trick In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pa ...
. This technique can significantly simplify the computational operations. If F is high dimensional, computing \varphi(x_i) may be rather intensive. If the explicit form of the kernel function is known, we just need to compute and store the n\times n kernel matrix K. In fact, the
Hilbert space In mathematics, a Hilbert space is a real number, real or complex number, complex inner product space that is also a complete metric space with respect to the metric induced by the inner product. It generalizes the notion of Euclidean space. The ...
F need not be isomorphic to \mathbb^m, and can be infinite dimensional. This follows from
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most ...
, which states that a continuous, symmetric, positive definite kernel function can be expressed as K(x,z) = \sum_^\infty \sigma_i e_i(x) e_i(z) where e_i(x) form an
orthonormal basis In mathematics, particularly linear algebra, an orthonormal basis for an inner product space V with finite Dimension (linear algebra), dimension is a Basis (linear algebra), basis for V whose vectors are orthonormal, that is, they are all unit vec ...
for \ell^2(X), and \sigma_i \in\mathbb. If feature maps is defined \varphi(x) with components \varphi_i(x) = \sqrte_i(x), it follows that K(x,z) = \langle\varphi(x),\varphi(z)\rangle. This demonstrates that any kernel can be associated with a feature map, and that RLS generally consists of linear RLS performed in some possibly higher-dimensional feature space. While Mercer's theorem shows how one feature map that can be associated with a kernel, in fact multiple feature maps can be associated with a given reproducing kernel. For instance, the map \varphi(x)=K_x satisfies the property K(x,z) = \langle\varphi(x), \varphi(z) \rangle for an arbitrary reproducing kernel.


Bayesian interpretation

Least squares can be viewed as a likelihood maximization under an assumption of normally distributed residuals. This is because the exponent of the
Gaussian distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...
is quadratic in the data, and so is the least-squares objective function. In this framework, the regularization terms of RLS can be understood to be encoding
priors Prior (or prioress) is an ecclesiastical title for a superior in some religious orders. The word is derived from the Latin for "earlier" or "first". Its earlier generic usage referred to any monastic superior. In abbeys, a prior would be lowe ...
on w. For instance, Tikhonov regularization corresponds to a normally distributed prior on w that is centered at 0. To see this, first note that the OLS objective is proportional to the
log-likelihood A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the j ...
function when each sampled y^i is normally distributed around w^\mathsf \cdot x^i. Then observe that a normal prior on w centered at 0 has a log-probability of the form\log P(w) = q - \alpha \sum_^d w_j^2where q and \alpha are constants that depend on the variance of the prior and are independent of w. Thus, minimizing the logarithm of the likelihood times the prior is equivalent to minimizing the sum of the OLS loss function and the ridge regression regularization term. This gives a more intuitive interpretation for why
Tikhonov regularization Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...
leads to a unique solution to the least-squares problem: there are infinitely many vectors w satisfying the constraints obtained from the data, but since we come to the problem with a prior belief that w is normally distributed around the origin, we will end up choosing a solution with this constraint in mind. Other regularization methods correspond to different priors. See the
list A list is a Set (mathematics), set of discrete items of information collected and set forth in some format for utility, entertainment, or other purposes. A list may be memorialized in any number of ways, including existing only in the mind of t ...
below for more details.


Specific examples


Ridge regression (or Tikhonov regularization)

One particularly common choice for the penalty function R is the squared \ell_2 norm, i.e., R(w) = \sum_^d w_j^2 and the solution is found as \hat = \text_ \frac \left\, Y - X w \right\, ^2_2+\lambda \sum_^d \left, w_j\^2 The most common names for this are called
Tikhonov regularization Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...
and
ridge regression Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...
. It admits a closed-form solution for w: \hat = \left(\frac X^\mathsf X + \lambda I\right)^ \fracX^\mathsf Y = \left(X^\mathsf X + n\lambda I\right)^ X^\mathsf Y The name ridge regression alludes to the fact that the \lambda I term adds positive entries along the diagonal "ridge" of the sample
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
\fracX^\mathsf X. When \lambda=0, i.e., in the case of
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression In statistics, linear regression is a statistical model, model that estimates the relationship ...
, the condition that d > n causes the sample
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
\fracX^\mathsf X to not have full rank and so it cannot be inverted to yield a unique solution. This is why there can be an infinitude of solutions to the
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression In statistics, linear regression is a statistical model, model that estimates the relationship ...
problem when d > n. However, when \lambda > 0, i.e., when ridge regression is used, the addition of \lambda I to the sample covariance matrix ensures that all of its eigenvalues will be strictly greater than 0. In other words, it becomes invertible, and the solution is then unique. Compared to ordinary least squares, ridge regression is not unbiased. It accepts bias to reduce variance and the
mean square error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
.


Simplifications and automatic regularization

If we want to find \hat for different values of the regularization coefficient \lambda (which we denote \hat(\lambda)) we may use the
eigenvalue decomposition In linear algebra, eigendecomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Only diagonalizable matrices can be factorized in this way. When the mat ...
of the covariance matrix \frac X^\mathsf X = Q \text(\alpha_1,\ldots,\alpha_d)Q^\mathsf where columns of Q\in\mathbb^ are the eigenvectors of \frac X^\mathsf X and \alpha_1,\ldots, \alpha_d - its d eigenvalues. The solution in then given by \hat(\lambda) = Q \text^(\alpha_1+\lambda,\ldots,\alpha_d+\lambda)Z where Z = \fracQ^\mathsfX^\mathsf Y= _1,\ldots,Z_d\mathsf. Using the above results, the algorithm for finding a
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
estimate of \lambda may be defined as follows: \lambda \leftarrow \frac \sum_^d \frac \left \frac +\lambda \right This algorithm, for automatic (as opposed to heuristic) regularization, is obtained as a fixed point solution in the
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
estimation of the parameters. Although the guarantees of convergence are not provided, the examples indicate that a satisfactory solution may be obtained after a couple of iterations. The eigenvalue decomposition simplifies derivation of the algorithm and also simplifies the calculations: \, \hat(\lambda)\, ^2 = \sum_^d \frac, \frac\, Y-X\hat(\lambda)\, ^2 = \sum_^d \frac. An alternative fixed-point algorithm known as Gull-McKay algorithm \lambda \leftarrow \frac usually has a faster convergence, but may be used only if n> \sum_^d \frac. Thus, while it can be used without problems for n>d caution is recommended for n.


Lasso regression

The least absolute selection and shrinkage (LASSO) method is another popular choice. In
lasso regression In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso, LASSO or L1 regularization) is a regression analysis method that performs both variable selection and regularization in order to enhance the ...
, the lasso penalty function R is the \ell_1 norm, i.e. R(w) = \sum_^d \left, w_j \ \frac \left\, Y - X w\right\, ^2_2+\lambda \sum_^d , w_j, \rightarrow \min_ Note that the lasso penalty function is convex but not strictly convex. Unlike
Tikhonov regularization Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...
, this scheme does not have a convenient closed-form solution: instead, the solution is typically found using
quadratic programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constr ...
or more general
convex optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization problems ...
methods, as well as by specific algorithms such as the
least-angle regression In statistics, least-angle regression (LARS) is an algorithm for fitting linear regression models to high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Suppose we expect a response variab ...
algorithm. An important difference between lasso regression and Tikhonov regularization is that lasso regression forces more entries of w to actually equal 0 than would otherwise. In contrast, while Tikhonov regularization forces entries of w to be small, it does not force more of them to be 0 than would be otherwise. Thus, LASSO regularization is more appropriate than Tikhonov regularization in cases in which we expect the number of non-zero entries of w to be small, and Tikhonov regularization is more appropriate when we expect that entries of w will generally be small but not necessarily zero. Which of these regimes is more relevant depends on the specific data set at hand. Besides feature selection described above, LASSO has some limitations. Ridge regression provides better accuracy in the case n > d for highly correlated variables. In another case, n < d , LASSO selects at most n variables. Moreover, LASSO tends to select some arbitrary variables from group of highly correlated samples, so there is no grouping effect.


''ℓ''0 Penalization

\frac \left\, Y - X w\right\, ^2_2 + \lambda \left\, w_j\right\, _0 \rightarrow \min_ The most extreme way to enforce sparsity is to say that the actual magnitude of the coefficients of w does not matter; rather, the only thing that determines the complexity of w is the number of non-zero entries. This corresponds to setting R(w) to be the \ell_0 norm of w. This regularization function, while attractive for the sparsity that it guarantees, is very difficult to solve because doing so requires optimization of a function that is not even weakly
convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...
. Lasso regression is the minimal possible relaxation of \ell_0 penalization that yields a weakly convex optimization problem.


Elastic net

For any non-negative \lambda_1 and \lambda_2 the objective has the following form: \frac \left\, Y - X w\right\, ^2_2 + \lambda_1 \sum_^d \left, w_j\ + \lambda_2 \sum_^d \left, w_j\^2 \rightarrow \min_ Let \alpha = \frac, then the solution of the minimization problem is described as: \frac \left\, Y - X w\right\, ^2_2 \rightarrow \min_ \text (1-\alpha) \left\, w\right\, _1 + \alpha \left\, w\right\, _2 \leq t for some t. Consider (1-\alpha) \left\, w\right\, _1 + \alpha \left\, w\right\, _2 \leq t as an Elastic Net penalty function. When \alpha = 1 , elastic net becomes ridge regression, whereas \alpha = 0 it becomes Lasso. \forall \alpha \in (0,1] Elastic Net penalty function doesn't have the first derivative at 0 and it is strictly convex \forall \alpha > 0 taking the properties both
lasso regression In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso, LASSO or L1 regularization) is a regression analysis method that performs both variable selection and regularization in order to enhance the ...
and
ridge regression Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...
. One of the main properties of the Elastic Net is that it can select groups of correlated variables. The difference between weight vectors of samples x_i and x_j is given by: \left, w^_i(\lambda_1, \lambda_2) - w^_j(\lambda_1, \lambda_2)\ \leq \frac \sqrt, where \rho_ = x_i^\mathsf x_j. If x_i and x_j are highly correlated (\rho_ \to 1), the weight vectors are very close. In the case of negatively correlated samples (\rho_ \to -1) the samples -x_j can be taken. To summarize, for highly correlated variables the weight vectors tend to be equal up to a sign in the case of negative correlated variables.


Partial list of RLS methods

The following is a list of possible choices of the regularization function R(\cdot), along with the name for each one, the corresponding prior if there is a simple one, and ways for computing the solution to the resulting optimization problem.


See also

*
Least squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
*
Regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...
in mathematics. *
Generalization error For supervised learning applications in machine learning and statistical learning theory, generalization errorMohri, M., Rostamizadeh A., Talwakar A., (2018) ''Foundations of Machine learning'', 2nd ed., Boston: MIT Press (also known as the out-of- ...
, one of the reasons regularization is used. *
Tikhonov regularization Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...
*
Lasso regression In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso, LASSO or L1 regularization) is a regression analysis method that performs both variable selection and regularization in order to enhance the ...
*
Elastic net regularization In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the ''L''1 and ''L''2 penalties of the lasso and ridge methods. Nevertheless, el ...
* :Least-angle regression


References

{{Reflist


External links


http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf Regularization and Variable Selection via the Elastic Net
(presentation)
Regularized Least Squares and Support Vector Machines
(presentation)
Regularized Least Squares
presentation) Least squares Linear algebra Inverse problems