CURE (Clustering Using REpresentatives) is an efficient
data clustering algorithm for large
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
s. Compared with
K-means clustering it is more
robust to
outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s and able to identify clusters having non-spherical shapes and size variances.
Drawbacks of traditional algorithms
The popular
K-means clustering algorithm minimizes the
sum of squared error
Squared deviations from the mean (SDM) result from squaring deviations. In probability theory and statistics, the definition of ''variance'' is either the expected value of the SDM (when considering a theoretical distribution) or its average valu ...
s criterion:
:
Given large differences in sizes or geometries of different clusters, the square error method could split the large clusters to minimize the square error, which is not always correct. Also, with hierarchic clustering algorithms these problems exist as none of the distance measures between clusters (
) tend to work with different cluster shapes. Also the
running time is high when n is large.
The problem with the
BIRCH algorithm is that once the clusters are generated after step 3, it uses centroids of the clusters and assigns each
data point to the cluster with the closest centroid. Using only the centroid to redistribute the data has problems when clusters lack uniform sizes and shapes.
CURE clustering algorithm
To avoid the problems with non-uniform sized or shaped clusters, CURE employs a
hierarchical clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into tw ...
algorithm that adopts a
middle ground between the centroid based and all point extremes. In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction α. The scattered points after shrinking are used as representatives of the cluster. The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE to correctly identify the clusters and makes it less sensitive to outliers.
Running time is O(''n''
2 log ''n''), making it rather expensive, and
space complexity
The space complexity of an algorithm or a computer program is the amount of memory space required to solve an instance of the computational problem as a function of characteristics of the input. It is the memory required by an algorithm until it ex ...
is O(''n'').
The algorithm cannot be directly applied to large databases because of the high runtime complexity. Enhancements address this requirement.
* Random sampling :
random sampling
In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attem ...
supports large data sets. Generally the
random sample fits in
main memory
Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.
The central processing unit (CPU) of a comput ...
. The random sampling involves a
trade off between accuracy and efficiency.
* Partitioning : The basic idea is to partition the
sample space
In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually de ...
into ''p'' partitions. Each partition contains ''n/p'' elements. The first pass partially clusters each partition until the final number of clusters reduces to ''n/pq'' for some constant q ≥ 1. A second clustering pass on ''n/q'' partially clusters partitions. For the second pass only the representative points are stored since the merge procedure only requires representative points of previous clusters before computing the representative points for the merged cluster. Partitioning the input reduces the execution times.
* Labeling data on disk : Given only representative points for ''k'' clusters, the remaining data points are also assigned to the clusters. For this a fraction of randomly selected representative points for each of the ''k'' clusters is chosen and data point is assigned to the cluster containing the representative point closest to it.
Pseudocode
CURE (no. of points,''k'')
Input : A set of points S
Output : ''k'' clusters
* For every cluster u (each input point), in u.mean and u.rep store the mean of the points in the cluster and a set of ''c'' representative points of the cluster (initially ''c'' = 1 since each cluster has one data point). Also u.closest stores the cluster closest to u.
* All the input points are inserted into a
k-d tree T
* Treat each input point as separate cluster, compute u.closest for each u and then insert each cluster into the heap Q. (clusters are arranged in increasing order of distances between u and u.closest).
* While size (Q) > ''k''
* Remove the top element of Q (say u) and merge it with its closest cluster u.closest (say v) and compute the new representative points for the merged cluster w.
* Remove u and v from T and Q.
* For all the clusters x in Q, update x.closest and relocate x
* insert w into Q
* repeat
Availability
pyclusteringopen source library includes a Python and C++ implementation of CURE algorithm.
See also
*
k-means clustering
*
BFR algorithm
References
*
*
* {{cite book, last=Theodoridis, first=Sergios , title=Pattern recognition, year=2006, publisher=Academic Press, isbn=978-0-12-369531-4, url=https://books.google.com/books?id=gAGRCmp8Sp8C&pg=PA572, pages=572–574, author2=Koutroumbas, Konstantinos
Cluster analysis algorithms
Articles with example pseudocode