Consensus clustering is a method of aggregating (potentially conflicting) results from multiple

clustering algorithm Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

s. Also called cluster ensembles or aggregation of clustering (or partitions), it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be

NP-complete In computational complexity theory, a problem is NP-complete when: # it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by tryin ...

, even when the number of input clusterings is three. Consensus clustering for unsupervised learning is analogous to

ensemble learning In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statist ...

in supervised learning.

Issues with existing clustering techniques

* Current clustering techniques do not address all the requirements adequately. * Dealing with large number of dimensions and large number of data items can be problematic because of time complexity; * Effectiveness of the method depends on the definition of "

distance Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...

" (for distance-based clustering) * If an obvious distance measure doesn't exist, we must "define" it, which is not always easy, especially in multidimensional spaces. * The result of the clustering algorithm (that, in many cases, can be arbitrary itself) can be interpreted in different ways.

Justification for using consensus clustering

There are potential shortcomings for all existing clustering techniques. This may cause interpretation of results to become difficult, especially when there is no knowledge about the number of clusters. Clustering methods are also very sensitive to the initial clustering settings, which can cause non-significant data to be amplified in non-reiterative methods. An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised analysis), this validation becomes somewhat elusive. Iterative descent clustering methods, such as the

SOM Som, SOM or Søm may refer to: Computing * System Object Model (file format), of the HP-UX operating system * Simulation Object Model, in computer high-level architecture (simulation) * System on module, in computer embedded systems * Self-org ...

and

k-means clustering ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...

circumvent some of the shortcomings of

hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into tw ...

by providing for univocally defined clusters and cluster boundaries. Consensus clustering provides a method that represents the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. It can provide data for a visualization tool to inspect cluster number, membership, and boundaries. However, they lack the intuitive and visual appeal of hierarchical clustering dendrograms, and the number of clusters must be chosen a priori.

The Monti consensus clustering algorithm

The Monti consensus clustering algorithm is one of the most popular consensus clustering algorithms and is used to determine the number of clusters,

K

. Given a dataset of

N

total number of points to cluster, this algorithm works by resampling and clustering the data, for each

K

and a

N \times N

consensus matrix is calculated, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations. The relative stability of the consensus matrices can be used to infer the optimal

K

. More specifically, given a set of points to cluster,

D=\

, let

D^1,D^2,...,D^H

be the list of

H

perturbed (resampled) datasets of the original dataset

D

, and let

M^h

denote the

N \times N

connectivity matrix resulting from applying a clustering algorithm to the dataset

D^h

. The entries of

M^h

are defined as follows:

M^h(i,j)= \begin 1, & \text\text \\ 0, & \text \end

Let

I^h

be the

N \times N

identicator matrix where the

(i,j)

-th entry is equal to 1 if points

i

and

j

are in the same perturbed dataset

D^h

, and 0 otherwise. The indicator matrix is used to keep track of which samples were selected during each resampling iteration for the normalisation step. The consensus matrix

C

is defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is calculated for every

K

C(i,j)=\left ( \frac \right )

That is the entry

(i,j)

in the consensus matrix is the number of times points

i

and

j

were clustered together divided by the total number of times they were selected together. The matrix is symmetric and each element is defined within the range

,1 /math>. A consensus matrix is calculated for each K to be tested, and the stability of each matrix, that is how far the matrix is towards a matrix of perfect stability (just zeros and ones) is used to determine the optimal K . One way of quantifying the stability of the K th consensus matrix is examining its CDF curve (see below).

Over-interpretation potential of the Monti consensus clustering algorithm

Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu ''et al.'' It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study. If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. Identifying false positive clusters is a common problem throughout cluster research, and has been addressed by methods such as SigClust and the GAP-statistic. However, these methods rely on certain assumptions for the null model that may not always be appropriate. Şenbabaoğlu ''et al'' demonstrated the original delta K metric to decide

K

in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u₁, u₂) ∈

, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

where u₁ is a value close to 0 and u₂ is a value close to 1 (for instance u₁=0.1 and u₂=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. One can therefore infer the optimal number of clusters by the

K

value having the lowest PAC.

Related work

#Clustering ensemble (Strehl and Ghosh): They considered various formulations for the problem, most of which reduce the problem to a

hyper-graph In mathematics, a hypergraph is a generalization of a graph in which an edge can join any number of vertices. In contrast, in an ordinary graph, an edge connects exactly two vertices. Formally, an undirected hypergraph H is a pair H = (X,E) w ...

partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best ''k''-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart. #Clustering aggregation (Fern and Brodley): They applied the clustering aggregation idea to a collection of soft clusterings they obtained by random projections. They used an agglomerative algorithm and did not penalize for merging dissimilar nodes. #Fred and Jain: They proposed to use a single linkage algorithm to combine multiple runs of the ''k''-means algorithm. #Dana Cristofor and Dan Simovici: They observed the connection between clustering aggregation and clustering of

categorical data In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group o ...

. They proposed information theoretic distance measures, and they propose

genetic algorithm In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to gen ...

s for finding the best aggregation solution. #Topchy et al.: They defined clustering aggregation as a maximum likelihood estimation problem, and they proposed an

EM algorithm EM, Em or em may refer to: Arts and entertainment Music * EM, the E major musical scale * Em, the E minor musical scale * Electronic music, music that employs electronic musical instruments and electronic music technology in its production * Ency ...

for finding the consensus clustering.

Hard ensemble clustering

This approach by ''Strehl'' and ''Ghosh'' introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. They discuss three approaches towards solving this problem to obtain high quality consensus functions. Their techniques have low computational costs and this makes it feasible to evaluate each of the techniques discussed below and arrive at the best solution by comparing the results against the objective function.

Efficient consensus functions

#Cluster-based similarity partitioning algorithm (CSPA):In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in ''n''
SC3
is an example of a CSPA type algorithm. The following two methods are computationally less expensive: #Hyper-graph partitioning algorithm (HGPA): The HGPA algorithm takes a very different approach to finding the consensus clustering than the previous method. The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use o
hMETIS
which is a hypergraph partitioning package system. #Meta-clustering algorithm (MCLA):The meta-cLustering algorithm (MCLA) is based on clustering clusters. First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble. The clustering is performed usin
METIS
and Spectral clustering.

Soft clustering ensembles

''Punera'' and ''Ghosh'' extended the idea of hard clustering ensembles to the soft clustering scenario. Each instance in a soft ensemble is represented by a concatenation of ''r'' posterior membership probability distributions obtained from the constituent clustering algorithms. We can define a distance measure between two instances using the Kullback–Leibler (KL) divergence, which calculates the "distance" between two probability distributions. #sCSPA: extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the dot product between the vectors representing the objects as their similarity. #sMCLA:extends MCLA by accepting soft clusterings as input. sMCLA's working can be divided into the following steps: #* Construct Soft Meta-Graph of Clusters #* Group the Clusters into Meta-Clusters #* Collapse Meta-Clusters using Weighting #* Compete for Objects #sHBGF:represents the ensemble as a

bipartite graph In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V, that is every edge connects a vertex in U to one in V. Vertex sets U and V ar ...

with clusters and instances as nodes, and edges between the instances and the clusters they belong to.Solving cluster ensemble problems by bipartite graph partitioning, Xiaoli Zhang Fern and Carla Brodley, Proceedings of the twenty-first international conference on Machine learning This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has ''n'' + ''t'' vertices, where t is the total number of underlying clusters. #Bayesian consensus clustering (BCC): defines a fully

Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a follower ...

model for soft consensus clustering in which multiple source clusterings, defined by different input data or different probability models, are assumed to adhere loosely to a consensus clustering. The full posterior for the separate clusterings, and the consensus clustering, are inferred simultaneously via

Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is diff ...

. #Ensemble Clustering Fuzzification Means (ECF-Means): ECF-means is a clustering algorithm, which combines different clustering results in ensemble, achieved by different runs of a chosen algorithm (

k-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to Partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the Cluster (statistics), cluster ...

), into a single final clustering configuration.
/ref>

References

* Aristides Gionis,

Heikki Mannila Heikki Olavi Mannila (born 4 January 1960 in Espoo) is a Finnish computer scientist, the president of the Academy of Finland.''Kuka kukin on 2007'', p. 585. Helsinki 2006. Mannila earned his Ph.D. in 1985 from the University of Helsinki under the ...

, Panayiotis Tsaparas
Clustering Aggregation
21st International Conference on Data Engineering (ICDE 2005) * Hongjun Wang, Hanhuai Shan, Arindam Banerjee
Bayesian Cluster Ensembles
SIAM International Conference on Data Mining, SDM 09 *{{cite conference , last=Nguyen , first=Nam , last2=Caruana , first2=Rich , title=Consensus Clusterings , publisher=IEEE , year=2007 , doi=10.1109/icdm.2007.73 , page=, quote=...we address the problem of combining multiple clusterings without access to the underlying features of the data. This process is known in the literature as clustering ensembles, clustering aggregation, or consensus clustering. Consensus clustering yields a stable and robust final clustering that is in agreement with multiple clusterings. We find that an iterative EM-like method is remarkably effective for this problem. We present an iterative algorithm and its variations for finding clustering consensus. An extensive empirical study compares our proposed algorithms with eleven other consensus clustering methods on four data sets using three different clustering performance metrics. The experimental results show that the new ensemble clustering methods produce clusterings that are as good as, and often better than, these other methods. Cluster analysis