HOME

TheInfoList



OR:

Sequential minimal optimization (SMO) is an algorithm for solving the
quadratic programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constr ...
(QP) problem that arises during the training of
support-vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborator ...
s (SVM). It was invented by John Platt in 1998 at
Microsoft Research Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technologi ...
. SMO is widely used for training support vector machines and is implemented by the popular
LIBSVM LIBSVM and LIBLINEAR are two popular open source machine learning libraries, both developed at the National Taiwan University and both written in C++ though with a C API. LIBSVM implements the Sequential minimal optimization (SMO) algorithm fo ...
tool. The publication of the SMO algorithm in 1998 has generated a lot of excitement in the SVM community, as previously available methods for SVM training were much more complex and required expensive third-party QP solvers.


Optimization problem

Consider a
binary classification Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient ha ...
problem with a dataset (''x''1, ''y''1), ..., (''x''''n'', ''y''''n''), where ''x''''i'' is an input vector and is a binary label corresponding to it. A soft-margin
support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
is trained by solving a quadratic programming problem, which is expressed in the
dual form Dual (abbreviated ) is a grammatical number that some languages use in addition to singular and plural. When a noun or pronoun appears in dual form, it is interpreted as referring to precisely two of the entities (objects or persons) identified ...
as follows: :\max_ \sum_^n \alpha_i - \frac12 \sum_^n \sum_^n y_i y_j K(x_i, x_j) \alpha_i \alpha_j, :subject to: :0 \leq \alpha_i \leq C, \quad \mbox i=1, 2, \ldots, n, :\sum_^n y_i \alpha_i = 0 where ''C'' is an SVM hyperparameter and ''K''(''x''''i'', ''x''''j'') is the
kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...
, both supplied by the user; and the variables \alpha_i are
Lagrange multiplier In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied ...
s.


Algorithm

SMO is an iterative algorithm for solving the optimization problem described above. SMO breaks this problem into a series of smallest possible sub-problems, which are then solved analytically. Because of the linear equality constraint involving the Lagrange multipliers \alpha_i, the smallest possible problem involves two such multipliers. Then, for any two multipliers \alpha_1 and \alpha_2, the constraints are reduced to: :0 \leq \alpha_1, \alpha_2 \leq C, :y_1 \alpha_1 + y_2 \alpha_2 = k, and this reduced problem can be solved analytically: one needs to find a minimum of a one-dimensional quadratic function. k is the negative of the sum over the rest of terms in the equality constraint, which is fixed in each iteration. The algorithm proceeds as follows: # Find a Lagrange multiplier \alpha_1 that violates the Karush–Kuhn–Tucker (KKT) conditions for the optimization problem. # Pick a second multiplier \alpha_2 and optimize the pair (\alpha_1,\alpha_2). # Repeat steps 1 and 2 until convergence. When all the Lagrange multipliers satisfy the KKT conditions (within a user-defined tolerance), the problem has been solved. Although this algorithm is guaranteed to converge, heuristics are used to choose the pair of multipliers so as to accelerate the rate of convergence. This is critical for large data sets since there are n(n-1)/2 possible choices for \alpha_i and \alpha_j.


Related Work

The first approach to splitting large SVM learning problems into a series of smaller optimization tasks was proposed by Bernhard Boser, Isabelle Guyon,
Vladimir Vapnik Vladimir Naumovich Vapnik (russian: Владимир Наумович Вапник; born 6 December 1936) is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machin ...
. It is known as the "chunking algorithm". The algorithm starts with a random subset of the data, solves this problem, and iteratively adds examples which violate the optimality conditions. One disadvantage of this algorithm is that it is necessary to solve QP-problems scaling with the number of SVs. On real world sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm. In 1997,
E. Osuna E is the fifth letter of the Latin alphabet. E or e may also refer to: Commerce and transportation * €, the symbol for the euro, the European Union's standard currency unit * ℮, the estimated sign, an EU symbol indicating that the weig ...
,
R. Freund R. or r. may refer to: * ''Reign'', the period of time during which an Emperor, king, queen, etc., is ruler. * '' Rex'', abbreviated as R., the Latin word meaning King * ''Regina'', abbreviated as R., the Latin word meaning Queen * or , abbreviat ...
, and
F. Girosi F is the sixth letter of the Latin alphabet. F may also refer to: Science and technology Mathematics * F or f, the number 15 in hexadecimal and higher positional systems * ''p'F'q'', the hypergeometric function * F-distribution, a con ...
proved a theorem which suggests a whole new set of QP algorithms for SVMs. By the virtue of this theorem a large QP problem can be broken down into a series of smaller QP sub-problems. A sequence of QP sub-problems that always add at least one violator of the Karush–Kuhn–Tucker (KKT) conditions is guaranteed to converge. The chunking algorithm obeys the conditions of the theorem, and hence will converge. The SMO algorithm can be considered a special case of the Osuna algorithm, where the size of the optimization is two and both Lagrange multipliers are replaced at every step with new multipliers that are chosen via good heuristics. The SMO algorithm is closely related to a family of optimization algorithms called Bregman methods or row-action methods. These methods solve convex programming problems with linear constraints. They are iterative methods where each step projects the current primal point onto each constraint.


See also

*
Kernel perceptron In machine learning, the kernel perceptron is a variant of the popular perceptron learning algorithm that can learn kernel machines, i.e. non-linear classifiers that employ a kernel function to compute the similarity of unseen samples to training ...


References

{{reflist, 30em Optimization algorithms and methods Support vector machines