HOME

TheInfoList



OR:

In
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, the kernel perceptron is a variant of the popular
perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function which can decide whether or not an ...
learning algorithm that can learn kernel machines, i.e. non-linear classifiers that employ a kernel function to compute the similarity of unseen samples to training samples. The algorithm was invented in 1964, making it the first kernel classification learner.


Preliminaries


The perceptron algorithm

The perceptron algorithm is an online learning algorithm that operates by a principle called "error-driven learning". It iteratively improves a model by running it on training samples, then updating the model whenever it finds it has made an incorrect classification with respect to a supervised signal. The model learned by the standard perceptron algorithm is a
linear Linearity is the property of a mathematical relationship ('' function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear ...
binary classifier: a vector of weights (and optionally an intercept term , omitted here for simplicity) that is used to classify a sample vector as class "one" or class "minus one" according to :\hat = \sgn(\mathbf^\top \mathbf) where a zero is arbitrarily mapped to one or minus one. (The "
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mech ...
" on denotes an estimated value.) In
pseudocode In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...
, the perceptron algorithm is given by: :Initialize to an all-zero vector of length , the number of predictors (features). :For some fixed number of iterations, or until some stopping criterion is met: ::For each training example with ground truth label : :::Let . :::If , update .


Kernel Methods

By contrast with the linear models learned by the perceptron, a kernel method is a classifier that stores a subset of its training examples , associates with each a weight , and makes decisions for new samples by evaluating :\sgn \sum_i \alpha_i y_i K(\mathbf_i, \mathbf). Here, is some kernel function. Formally, a kernel function is a non-negative semidefinite kernel (see Mercer's condition), representing an
inner product In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...
between samples in a high-dimensional space, as if the samples had been expanded to include additional features by a function : . Intuitively, it can be thought of as a similarity function between samples, so the kernel machine establishes the class of a new sample by weighted comparison to the training set. Each function serves as a
basis function In mathematics, a basis function is an element of a particular basis for a function space. Every function in the function space can be represented as a linear combination of basis functions, just as every vector in a vector space can be repre ...
in the classification.


Algorithm

To derive a kernelized version of the perceptron algorithm, we must first formulate it in
dual form Dual (abbreviated ) is a grammatical number that some languages use in addition to singular and plural. When a noun or pronoun appears in dual form, it is interpreted as referring to precisely two of the entities (objects or persons) identified ...
, starting from the observation that the weight vector can be expressed as a linear combination of the training samples. The equation for the weight vector is :\mathbf = \sum_i^n \alpha_i y_i \mathbf_i where is the number of times was misclassified, forcing an update . Using this result, we can formulate the dual perceptron algorithm, which loops through the samples as before, making predictions, but instead of storing and updating a weight vector , it updates a "mistake counter" vector . We must also rewrite the prediction formula to get rid of : :\begin \hat & = \sgn(\mathbf^\mathsf \mathbf) \\ & = \sgn \left( \sum_i^n \alpha_i y_i \mathbf_i \right)^\mathsf \mathbf \\ & = \sgn \sum_i^n \alpha_i y_i (\mathbf_i \cdot \mathbf) \end Plugging these two equations into the training loop turn it into the ''dual perceptron'' algorithm. Finally, we can replace the
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alg ...
in the dual perceptron by an arbitrary kernel function, to get the effect of a feature map without computing explicitly for any samples. Doing this yields the kernel perceptron algorithm: :Initialize to an all-zeros vector of length , the number of training samples. :For some fixed number of iterations, or until some stopping criterion is met: ::For each training example : :::Let \hat = \sgn \sum_i^n \alpha_i y_i K(\mathbf_i, \mathbf_j) :::If , perform an update by incrementing the mistake counter: ::::


Variants and extensions

One problem with the kernel perceptron, as presented above, is that it does not learn
sparse Sparse is a computer software tool designed to find possible coding faults in the Linux kernel. Unlike other such tools, this static analysis tool was initially designed to only flag constructs that were likely to be of interest to kernel de ...
kernel machines. Initially, all the are zero so that evaluating the decision function to get requires no kernel evaluations at all, but each update increments a single , making the evaluation increasingly more costly. Moreover, when the kernel perceptron is used in an
online In computer technology and telecommunications, online indicates a state of connectivity and offline indicates a disconnected state. In modern terminology, this usually refers to an Internet connection, but (especially when expressed "on line" o ...
setting, the number of non-zero and thus the evaluation cost grow linearly in the number of examples presented to the algorithm. The forgetron variant of the kernel perceptron was suggested to deal with this problem. It maintains an active set of examples with non-zero , removing ("forgetting") examples from the active set when it exceeds a pre-determined budget and "shrinking" (lowering the weight of) old examples as new ones are promoted to non-zero . Another problem with the kernel perceptron is that it does not regularize, making it vulnerable to
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
. The NORMA online kernel learning algorithm can be regarded as a generalization of the kernel perceptron algorithm with regularization. The sequential minimal optimization (SMO) algorithm used to learn
support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
s can also be regarded as a generalization of the kernel perceptron. The voted perceptron algorithm of Freund and Schapire also extends to the kernelized case, giving generalization bounds comparable to the kernel SVM.


References

{{reflist, 30em Kernel methods for machine learning Statistical classification