Within

mathematical analysis Analysis is the branch of mathematics dealing with continuous functions, limit (mathematics), limits, and related theories, such as Derivative, differentiation, Integral, integration, measure (mathematics), measure, infinite sequences, series (m ...

, Regularization perspectives on support-vector machines provide a way of interpreting

support-vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...

s (SVMs) in the context of other regularization-based machine-learning algorithms. SVM algorithms categorize binary data, with the goal of fitting the

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

data in a way that minimizes the average of the hinge-loss function and L2 norm of the learned weights. This strategy avoids

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

via

Tikhonov regularization Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...

and in the L2 norm sense and also corresponds to minimizing the bias and variance of our estimator of the weights. Estimators with lower Mean squared error predict better or generalize better when given unseen data. Specifically,

algorithms produce a decision boundary that minimizes the average training-set error and constrain the

Decision boundary __NOTOC__ In a statistical-classification problem with two classes, a decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class. The classifier will classify all the point ...

not to be excessively complicated or overfit the training data via a L2 norm of the weights term. The training and test-set errors can be measured without bias and in a fair way using accuracy, precision, Auc-Roc, precision-recall, and other metrics. Regularization perspectives on support-vector machines interpret SVM as a special case of Tikhonov regularization, specifically Tikhonov regularization with the

hinge loss In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output and a classifier score , th ...

for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to

generalize A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...

without

. SVM was first proposed in 1995 by Corinna Cortes and

Vladimir Vapnik Vladimir Naumovich Vapnik (russian: Владимир Наумович Вапник; born 6 December 1936) is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machine ...

, and framed geometrically as a method for finding

hyperplane In geometry, a hyperplane is a subspace whose dimension is one less than that of its ''ambient space''. For example, if a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyper ...

s that can separate multidimensional data into two categories. This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other

machine-learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

techniques for avoiding overfitting, like

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

early stopping In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with ...

sparsity In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse b ...

and

Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...

. However, once it was discovered that SVM is also a

special case In logic, especially as applied in mathematics, concept is a special case or specialization of concept precisely if every instance of is also an instance of but not vice versa, or equivalently, if is a generalization of . A limiting case is ...

of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms. This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.

Theoretical background

In the statistical learning theory framework, an

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

is a strategy for choosing a function

f \colon \mathbf X \to \mathbf Y

given a training set

S = \

of inputs

x_i

and their labels

y_i

(the labels are usually

\pm1

Regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

strategies avoid

by choosing a function that fits the data, but is not too complex. Specifically: :

f = \underset \operatorname\left\,

where

\mathcal

is a hypothesis space of functions,

V \colon \mathbf Y \times \mathbf Y \to \mathbb R

is the loss function,

\, \cdot\, _\mathcal H

is a

norm Naturally occurring radioactive materials (NORM) and technologically enhanced naturally occurring radioactive materials (TENORM) consist of materials, usually industrial wastes or by-products enriched with radioactive elements found in the envir ...

on the hypothesis space of functions, and

\lambda \in \mathbb R

is the regularization parameter. When

\mathcal

is a reproducing kernel Hilbert space, there exists a

kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...

K \colon \mathbf X \times \mathbf X \to \mathbb R

that can be written as an

n \times n

symmetric

positive-definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular: * Positive-definite bilinear form * Positive-definite fu ...

matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...

\mathbf K

. By the

representer theorem For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^ of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represente ...

, :

f(x_i) = \sum_^n c_j \mathbf K_, \text \, f\, ^2_ = \langle f, f\rangle_\mathcal H = \sum_^n \sum_^n c_i c_jK(x_i, x_j) = c^T \mathbf K c.

Special properties of the hinge loss

The simplest and most intuitive loss function for categorization is the misclassification loss, or 0–1 loss, which is 0 if

f(x_i) = y_i

and 1 if

f(x_i) \neq y_i

, i.e. the

Heaviside step function The Heaviside step function, or the unit step function, usually denoted by or (but sometimes , or ), is a step function, named after Oliver Heaviside (1850–1925), the value of which is zero for negative arguments and one for positive argume ...

-y_if(x_i)

. However, this loss function is not convex, which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0–1 loss. The hinge loss,

V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+

, where

(s)_+ = \max(s, 0)

, provides such a

convex relaxation Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytope ...

. In fact, the hinge loss is the tightest convex upper bound to the 0–1 misclassification loss function, and with infinite data returns the Bayes-optimal solution: :

f_b(x) = \begin
  1, & p(1\mid x) > p(-1\mid x), \\
 -1, & p(1\mid x) < p(-1\mid x).
\end

Derivation

The Tikhonov regularization problem can be shown to be equivalent to traditional formulations of SVM by expressing it in terms of the hinge loss.For a detailed derivation, see With the hinge loss :

V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+,

where

(s)_+ = \max(s, 0)

, the regularization problem becomes :

f = \underset \operatorname \left\.

Multiplying by

1/(2\lambda)

yields :

f = \underset \operatorname \left\

with

C = 1/(2\lambda n)

, which is equivalent to the standard SVM minimization problem.

Notes and references

* * * {{cite book, last=Vapnik, first=Vladimir, title=The Nature of Statistical Learning Theory, year=1999, publisher=Springer-Verlag, location=New York, isbn=978-0-387-98780-4, url=https://books.google.com/books?id=sna9BaxVbj8C&q=vapnik+the+nature+of+statistical+learning+theory&pg=PR7 Support vector machines Mathematical analysis