Neighbourhood components analysis is a
supervised learning
In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
method for
classifying
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
multivariate data into distinct classes according to a given
distance metric over the data. Functionally, it serves the same purposes as the
K-nearest neighbors algorithm and makes direct use of a related concept termed ''stochastic nearest neighbours''.
Definition
Neighbourhood components analysis aims at "learning" a distance metric by finding a linear transformation of input data such that the average leave-one-out (LOO) classification performance is maximized in the transformed space. The key insight to the algorithm is that a matrix
corresponding to the transformation can be found by defining a differentiable objective function for
, followed by the use of an iterative solver such as
conjugate gradient descent. One of the benefits of this algorithm is that the number of classes
can be determined as a function of
, up to a scalar constant. This use of the algorithm, therefore, addresses the issue of
model selection
Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one.
In the context of machine learning and more generally statistical analysis, this may be the selection of ...
.
Explanation
In order to define
, we define an objective function describing classification accuracy in the transformed space and try to determine
such that this objective function is maximized.
Leave-one-out (LOO) classification
Consider predicting the class label of a single data point by consensus of its
-nearest neighbours with a given distance metric. This is known as ''leave-one-out'' classification. However, the set of nearest-neighbours
can be quite different after passing all the points through a linear transformation. Specifically, the set of neighbours for a point can undergo discrete changes in response to smooth changes in the elements of
, implying that any objective function
based on the neighbours of a point will be ''piecewise-constant'', and hence ''not differentiable''.
Solution
We can resolve this difficulty by using an approach inspired by
stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...
. Rather than considering the
-nearest neighbours at each transformed point in LOO-classification, we'll consider the entire transformed data set as ''stochastic nearest neighbours''. We define these using a
softmax function
The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
of the squared
Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
between a given LOO-classification point and each other point in the transformed space:
The probability of correctly classifying data point
is the probability of classifying the points of each of its neighbours with the same class
:
where
is the probability of classifying neighbour
of point
.
Define the objective function using LOO classification, this time using the entire data set as stochastic nearest neighbours:
Note that under stochastic nearest neighbours, the consensus class for a single point
is the expected value of a point's class in the limit of an infinite number of samples drawn from the distribution over its neighbours
i.e.:
. Thus the predicted class is an
affine combination
In mathematics, an affine combination of is a linear combination
: \sum_^ = \alpha_ x_ + \alpha_ x_ + \cdots +\alpha_ x_,
such that
:\sum_^ =1.
Here, can be elements ( vectors) of a vector space over a field , and the coefficients \alpha_ ...
of the classes of every other point, weighted by the softmax function for each
where
is now the entire transformed data set.
This choice of objective function is preferable as it is differentiable with respect to
(denote
):
Obtaining a
gradient
In vector calculus, the gradient of a scalar-valued differentiable function f of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p gives the direction and the rate of fastest increase. The g ...
for
means that it can be found with an iterative solver such as
conjugate gradient descent. Note that in practice, most of the innermost terms of the gradient evaluate to insignificant contributions due to the rapidly diminishing contribution of distant points from the point of interest. This means that the inner sum of the gradient can be truncated, resulting in reasonable computation times even for large data sets.
Alternative formulation
"Maximizing
is equivalent to minimizing the
-distance between the predicted class distribution and the true class distribution (ie: where the
induced by
are all equal to 1). A natural alternative is the KL-divergence, which induces the following objective function and gradient:" (Goldberger 2005)
In practice, optimization of
using this function tends to give similar performance results as with the original.
History and background
Neighbourhood components analysis was developed by Jacob Goldberger, Sam Roweis, Ruslan Salakhudinov, and Geoff Hinton at the University of Toronto's department of computer science in 2004.
See also
*
Spectral clustering
In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided ...
*
Large margin nearest neighbor
Large margin nearest neighbor (LMNN) classification is a statistical machine learning algorithm for similarity learning, metric learning. It learns a Pseudometric space, pseudometric designed for k-nearest neighbor classification. The algorithm is ...
References
*J. Goldberger, G. Hinton, S. Roweis, R. Salakhutdinov. (2005)
Neighbourhood Components Analysis'' Advances in Neural Information Processing Systems. 17, 513–520, 2005.
External links
Software
* The
MLPACK library contains a
C++ implementation
nca(
C++)
*
scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support ...
's
NeighborhoodComponentsAnalysis implementation (
Python)
Statistical classification