In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, the kernel perceptron is a variant of the popular
perceptron
In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function which can decide whether or not an ...
learning algorithm that can learn
kernel machines, i.e. non-linear
classifiers that employ a kernel function to compute the similarity of unseen samples to training samples. The algorithm was invented in 1964, making it the first kernel classification learner.
Preliminaries
The perceptron algorithm
The perceptron algorithm is an
online learning algorithm that operates by a principle called "error-driven learning". It iteratively improves a model by running it on training samples, then updating the model whenever it finds it has made an incorrect classification with respect to a
supervised signal. The model learned by the standard perceptron algorithm is a
linear
Linearity is the property of a mathematical relationship ('' function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear ...
binary classifier: a vector of weights (and optionally an intercept term , omitted here for simplicity) that is used to classify a sample vector as class "one" or class "minus one" according to
:
where a zero is arbitrarily mapped to one or minus one. (The "
hat
A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mech ...
" on denotes an estimated value.)
In
pseudocode
In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...
, the perceptron algorithm is given by:
:Initialize to an all-zero vector of length , the number of predictors (features).
:For some fixed number of iterations, or until some stopping criterion is met:
::For each training example with ground truth label :
:::Let .
:::If , update .
Kernel Methods
By contrast with the linear models learned by the perceptron, a kernel method is a classifier that stores a subset of its training examples , associates with each a weight , and makes decisions for new samples by evaluating
:
.
Here, is some kernel function. Formally, a kernel function is a
non-negative semidefinite kernel (see
Mercer's condition), representing an
inner product
In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...
between samples in a high-dimensional space, as if the samples had been expanded to include additional features by a function : . Intuitively, it can be thought of as a
similarity function between samples, so the kernel machine establishes the class of a new sample by weighted comparison to the training set. Each function serves as a
basis function
In mathematics, a basis function is an element of a particular basis for a function space. Every function in the function space can be represented as a linear combination of basis functions, just as every vector in a vector space can be repre ...
in the classification.
Algorithm
To derive a kernelized version of the perceptron algorithm, we must first formulate it in
dual form
Dual (abbreviated ) is a grammatical number that some languages use in addition to singular and plural. When a noun or pronoun appears in dual form, it is interpreted as referring to precisely two of the entities (objects or persons) identified ...
, starting from the observation that the weight vector can be expressed as a
linear combination of the training samples. The equation for the weight vector is
:
where is the number of times was misclassified, forcing an update . Using this result, we can formulate the dual perceptron algorithm, which loops through the samples as before, making predictions, but instead of storing and updating a weight vector , it updates a "mistake counter" vector .
We must also rewrite the prediction formula to get rid of :
:
Plugging these two equations into the training loop turn it into the ''dual perceptron'' algorithm.
Finally, we can replace the
dot product
In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alg ...
in the dual perceptron by an arbitrary kernel function, to get the effect of a feature map without computing explicitly for any samples. Doing this yields the kernel perceptron algorithm:
:Initialize to an all-zeros vector of length , the number of training samples.
:For some fixed number of iterations, or until some stopping criterion is met:
::For each training example :
:::Let
:::If , perform an update by incrementing the mistake counter:
::::
Variants and extensions
One problem with the kernel perceptron, as presented above, is that it does not learn
sparse
Sparse is a computer software tool designed to find possible coding faults in the Linux kernel. Unlike other such tools, this static analysis tool was initially designed to only flag constructs that were likely to be of interest to kernel de ...
kernel machines. Initially, all the are zero so that evaluating the decision function to get requires no kernel evaluations at all, but each update increments a single , making the evaluation increasingly more costly. Moreover, when the kernel perceptron is used in an
online
In computer technology and telecommunications, online indicates a state of connectivity and offline indicates a disconnected state. In modern terminology, this usually refers to an Internet connection, but (especially when expressed "on line" o ...
setting, the number of non-zero and thus the evaluation cost grow linearly in the number of examples presented to the algorithm.
The forgetron variant of the kernel perceptron was suggested to deal with this problem. It maintains an
active set of examples with non-zero , removing ("forgetting") examples from the active set when it exceeds a pre-determined budget and "shrinking" (lowering the weight of) old examples as new ones are promoted to non-zero .
Another problem with the kernel perceptron is that it does not
regularize, making it vulnerable to
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
. The NORMA online kernel learning algorithm can be regarded as a generalization of the kernel perceptron algorithm with regularization.
The
sequential minimal optimization (SMO) algorithm used to learn
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
s can also be regarded as a generalization of the kernel perceptron.
The voted perceptron algorithm of Freund and Schapire also extends to the kernelized case,
giving generalization bounds comparable to the kernel SVM.
References
{{reflist, 30em
Kernel methods for machine learning
Statistical classification