Consensus clustering is a method of aggregating (potentially conflicting) results from multiple
clustering algorithm
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
s. Also called cluster ensembles
or aggregation of clustering (or partitions), it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings.
Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be
NP-complete
In computational complexity theory, a problem is NP-complete when:
# it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by tryin ...
,
even when the number of input clusterings is three.
Consensus clustering for unsupervised learning is analogous to
ensemble learning
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
Unlike a statistical ensemble in statist ...
in supervised learning.
Issues with existing clustering techniques
* Current clustering techniques do not address all the requirements adequately.
* Dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
* Effectiveness of the method depends on the definition of "
distance
Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...
" (for distance-based clustering)
* If an obvious distance measure doesn't exist, we must "define" it, which is not always easy, especially in multidimensional spaces.
* The result of the clustering algorithm (that, in many cases, can be arbitrary itself) can be interpreted in different ways.
Justification for using consensus clustering
There are potential shortcomings for all existing clustering techniques. This may cause interpretation of results to become difficult, especially when there is no knowledge about the number of clusters. Clustering methods are also very sensitive to the initial clustering settings, which can cause non-significant data to be amplified in non-reiterative methods. An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised analysis), this validation becomes somewhat elusive.
Iterative descent clustering methods, such as the
SOM
Som, SOM or Søm may refer to:
Computing
* System Object Model (file format), of the HP-UX operating system
* Simulation Object Model, in computer high-level architecture (simulation)
* System on module, in computer embedded systems
* Self-org ...
and
k-means clustering
''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...
circumvent some of the shortcomings of
hierarchical clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into tw ...
by providing for univocally defined clusters and cluster boundaries. Consensus clustering provides a method that represents the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. It can provide data for a visualization tool to inspect cluster number, membership, and boundaries. However, they lack the intuitive and visual appeal of hierarchical clustering dendrograms, and the number of clusters must be chosen a priori.
The Monti consensus clustering algorithm
The Monti consensus clustering algorithm is one of the most popular consensus clustering algorithms and is used to determine the number of clusters,
. Given a dataset of
total number of points to cluster, this algorithm works by resampling and clustering the data, for each
and a
consensus matrix is calculated, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations. The relative stability of the consensus matrices can be used to infer the optimal
.
More specifically, given a set of points to cluster,
, let
be the list of
perturbed (resampled) datasets of the original dataset
, and let
denote the
connectivity matrix resulting from applying a clustering algorithm to the dataset
. The entries of
are defined as follows:
Let
be the
identicator matrix where the
-th entry is equal to 1 if points
and
are in the same perturbed dataset
, and 0 otherwise. The indicator matrix is used to keep track of which samples were selected during each resampling iteration for the normalisation step. The consensus matrix
is defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is calculated for every
.
That is the entry
in the consensus matrix is the number of times points
and
were clustered together divided by the total number of times they were selected together. The matrix is symmetric and each element is defined within the range