In
statistics, single-linkage clustering is one of several methods of
hierarchical clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into tw ...
. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other.
A drawback of this method is that it tends to produce long thin clusters in which nearby elements of the same cluster have small distances, but elements at opposite ends of a cluster may be much farther from each other than two elements of other clusters. This may lead to difficulties in defining classes that could usefully subdivide the data.
[ ]
Overview of agglomerative clustering methods
In the beginning of the agglomerative clustering process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters, until all elements end up being in the same cluster. At each step, the two clusters separated by the shortest distance are combined. The function used to determine the distance between two clusters, known as the ''linkage function'', is what differentiates the agglomerative clustering methods.
In single-linkage clustering, the distance between two clusters is determined by a single pair of elements: those two elements (one in each cluster) that are closest to each other. The shortest of these pairwise distances that remain at any step causes the two clusters whose elements are involved to be merged. The method is also known as ''nearest neighbour clustering''. The result of the clustering can be visualized as a
dendrogram
A dendrogram is a diagram representing a tree. This diagrammatic representation is frequently used in different contexts:
* in hierarchical clustering, it illustrates the arrangement of the clusters produced by the corresponding analyses.
...
, which shows the sequence in which clusters were merged and the distance at which each merge took place.
Mathematically, the linkage function – the distance ''D''(''X'',''Y'') between clusters ''X'' and ''Y'' – is described by the expression
:
where ''X'' and ''Y'' are any two sets of elements considered as clusters, and ''d''(''x'',''y'') denotes the distance between the two elements ''x'' and ''y''.
Naive algorithm
The following algorithm is an
agglomerative scheme that erases rows and columns in a proximity matrix as old clusters are merged into new ones. The
proximity matrix
contains all distances
. The clusterings are assigned sequence numbers
and
is the level of the
-th clustering. A cluster with sequence number ''m'' is denoted (''m'') and the proximity between clusters
and
is denoted