machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, the kernel perceptron is a variant of the popular

perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...

learning algorithm that can learn kernel machines, i.e. non-linear classifiers that employ a kernel function to compute the similarity of unseen samples to training samples. The algorithm was invented in 1964, making it the first kernel classification learner.

Preliminaries

The perceptron algorithm

The perceptron algorithm is an online learning algorithm that operates by a principle called "error-driven learning". It iteratively improves a model by running it on training samples, then updating the model whenever it finds it has made an incorrect classification with respect to a supervised signal. The model learned by the standard perceptron algorithm is a

linear In mathematics, the term ''linear'' is used in two distinct senses for two different properties: * linearity of a '' function'' (or '' mapping''); * linearity of a '' polynomial''. An example of a linear function is the function defined by f(x) ...

binary classifier: a vector of weights (and optionally an intercept term , omitted here for simplicity) that is used to classify a sample vector as class "one" or class "minus one" according to :

\hat = \sgn(\mathbf^\top \mathbf)

where a zero is arbitrarily mapped to one or minus one. (The "

hat A hat is a Headgear, head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorpor ...

" on denotes an estimated value.) In

pseudocode In computer science, pseudocode is a description of the steps in an algorithm using a mix of conventions of programming languages (like assignment operator, conditional operator, loop) with informal, usually self-explanatory, notation of actio ...

, the perceptron algorithm is given by: :Initialize to an all-zero vector of length , the number of predictors (features). :For some fixed number of iterations, or until some stopping criterion is met: ::For each training example with ground truth label : :::Let . :::If , update .

Kernel Methods

By contrast with the linear models learned by the perceptron, a kernel method is a classifier that stores a subset of its training examples , associates with each a weight , and makes decisions for new samples by evaluating :

\sgn \sum_i \alpha_i y_i K(\mathbf_i, \mathbf)

. Here, is some kernel function. Formally, a kernel function is a non-negative semidefinite kernel (see

Mercer's condition In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most n ...

), representing an

inner product In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, ofte ...

between samples in a high-dimensional space, as if the samples had been expanded to include additional features by a function : . Intuitively, it can be thought of as a similarity function between samples, so the kernel machine establishes the class of a new sample by weighted comparison to the training set. Each function serves as a

basis function In mathematics, a basis function is an element of a particular basis for a function space. Every function in the function space can be represented as a linear combination of basis functions, just as every vector in a vector space can be represe ...

in the classification.

Algorithm

To derive a kernelized version of the perceptron algorithm, we must first formulate it in dual form, starting from the observation that the weight vector can be expressed as a

linear combination In mathematics, a linear combination or superposition is an Expression (mathematics), expression constructed from a Set (mathematics), set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of ''x'' a ...

of the training samples. The equation for the weight vector is :

\mathbf = \sum_i^n \alpha_i y_i \mathbf_i

where is the number of times was misclassified, forcing an update . Using this result, we can formulate the dual perceptron algorithm, which loops through the samples as before, making predictions, but instead of storing and updating a weight vector , it updates a "mistake counter" vector . We must also rewrite the prediction formula to get rid of : :

\begin
  \hat & = \sgn(\mathbf^\mathsf \mathbf) \\
          & = \sgn \left( \sum_i^n  \alpha_i y_i \mathbf_i \right)^\mathsf \mathbf \\
          & = \sgn \sum_i^n \alpha_i y_i (\mathbf_i \cdot \mathbf)
\end

Plugging these two equations into the training loop turn it into the ''dual perceptron'' algorithm. Finally, we can replace the

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...

in the dual perceptron by an arbitrary kernel function, to get the effect of a feature map without computing explicitly for any samples. Doing this yields the kernel perceptron algorithm: :Initialize to an all-zeros vector of length , the number of training samples. :For some fixed number of iterations, or until some stopping criterion is met: ::For each training example : :::Let

\hat = \sgn \sum_i^n \alpha_i y_i K(\mathbf_i, \mathbf_j)

:::If , perform an update by incrementing the mistake counter: ::::

Variants and extensions

One problem with the kernel perceptron, as presented above, is that it does not learn sparse kernel machines. Initially, all the are zero so that evaluating the decision function to get requires no kernel evaluations at all, but each update increments a single , making the evaluation increasingly more costly. Moreover, when the kernel perceptron is used in an

online In computer technology and telecommunications, online indicates a state of connectivity, and offline indicates a disconnected state. In modern terminology, this usually refers to an Internet connection, but (especially when expressed as "on lin ...

setting, the number of non-zero and thus the evaluation cost grow linearly in the number of examples presented to the algorithm. The forgetron variant of the kernel perceptron was suggested to deal with this problem. It maintains an

active set In mathematical optimization, the active-set method is an algorithm used to identify the active constraints in a set of inequality constraints. The active constraints are then expressed as equality constraints, thereby transforming an inequalit ...

of examples with non-zero , removing ("forgetting") examples from the active set when it exceeds a pre-determined budget and "shrinking" (lowering the weight of) old examples as new ones are promoted to non-zero . Another problem with the kernel perceptron is that it does not regularize, making it vulnerable to

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

. The NORMA online kernel learning algorithm can be regarded as a generalization of the kernel perceptron algorithm with regularization. The sequential minimal optimization (SMO) algorithm used to learn

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

s can also be regarded as a generalization of the kernel perceptron. The voted perceptron algorithm of Freund and Schapire also extends to the kernelized case, giving generalization bounds comparable to the kernel SVM.

References

{{reflist, 30em Kernel methods for machine learning Statistical classification