Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based
clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig,
Hans-Peter Kriegel
Hans-Peter Kriegel (1 October 1948, Germany) is a German computer scientist and professor at the Ludwig Maximilian University of Munich and leading the Database Systems Group in the Department of Computer Science. He was previously professor at t ...
and Jörg Sander.
Its basic idea is similar to
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.
It is a density-based clustering non-parametric algorithm: g ...
, but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. To do so, the points of the database are (linearly) ordered such that spatially closest points become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that must be accepted for a cluster so that both points belong to the same cluster. This is represented as a
dendrogram
A dendrogram is a diagram representing a tree. This diagrammatic representation is frequently used in different contexts:
* in hierarchical clustering, it illustrates the arrangement of the clusters produced by the corresponding analyses.
...
.
Basic idea
Like
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.
It is a density-based clustering non-parametric algorithm: g ...
, OPTICS requires two parameters: , which describes the maximum distance (radius) to consider, and , describing the number of points required to form a cluster. A point is a ''core point'' if at least points are found within its -neighborhood
(including point itself). In contrast to
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.
It is a density-based clustering non-parametric algorithm: g ...
, OPTICS also considers points that are part of a more densely packed cluster, so each point is assigned a ''core distance'' that describes the distance to the th closest point:
:
The ''reachability-distance'' of another point from a point is either the distance between and , or the core distance of , whichever is bigger:
:
If and are nearest neighbors, this is the
we need to assume to have and belong to the same cluster.
Both core-distance and reachability-distance are undefined if no sufficiently dense cluster (w.r.t. ) is available. Given a sufficiently large , this never happens, but then every -neighborhood query returns the entire database, resulting in
runtime. Hence, the parameter is required to cut off the density of clusters that are no longer interesting, and to speed up the algorithm.
The parameter is, strictly speaking, not necessary. It can simply be set to the maximum possible value. When a spatial index is available, however, it does play a practical role with regards to complexity. OPTICS abstracts from DBSCAN by removing this parameter, at least to the extent of only having to give the maximum value.
Pseudocode
The basic approach of OPTICS is similar to
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.
It is a density-based clustering non-parametric algorithm: g ...
, but instead of maintaining known, but so far unprocessed cluster members in a set, they are maintained in a
priority queue
In computer science, a priority queue is an abstract data-type similar to a regular queue or stack data structure in which each element additionally has a ''priority'' associated with it. In a priority queue, an element with high priority is se ...
(e.g. using an indexed
heap
Heap or HEAP may refer to:
Computing and mathematics
* Heap (data structure), a data structure commonly used to implement a priority queue
* Heap (mathematics), a generalization of a group
* Heap (programming) (or free store), an area of memory f ...
).
function OPTICS(DB, ε, MinPts) is
for each point p of DB do
p.reachability-distance = UNDEFINED
for each unprocessed point p of DB do
N = getNeighbors(p, ε)
mark p as processed
output p to the ordered list
if core-distance(p, ε, MinPts) != UNDEFINED then
Seeds = empty priority queue
update(N, p, Seeds, ε, MinPts)
for each next q in Seeds do
N' = getNeighbors(q, ε)
mark q as processed
output q to the ordered list
if core-distance(q, ε, MinPts) != UNDEFINED do
update(N', q, Seeds, ε, MinPts)
In update(), the priority queue Seeds is updated with the
-neighborhood of
and
, respectively:
function update(N, p, Seeds, ε, MinPts) is
coredist = core-distance(p, ε, MinPts)
for each o in N
if o is not processed then
new-reach-dist = max(coredist, dist(p,o))
if o.reachability-distance UNDEFINED then // o is not in Seeds
o.reachability-distance = new-reach-dist
Seeds.insert(o, new-reach-dist)
else // o in Seeds, check for improvement
if new-reach-dist < o.reachability-distance then
o.reachability-distance = new-reach-dist
Seeds.move-up(o, new-reach-dist)
OPTICS hence outputs the points in a particular ordering, annotated with their smallest reachability distance (in the original algorithm, the core distance is also exported, but this is not required for further processing).
Extracting the clusters
Using a ''reachability-plot'' (a special kind of
dendrogram
A dendrogram is a diagram representing a tree. This diagrammatic representation is frequently used in different contexts:
* in hierarchical clustering, it illustrates the arrangement of the clusters produced by the corresponding analyses.
...
), the hierarchical structure of the clusters can be obtained easily. It is a 2D plot, with the ordering of the points as processed by OPTICS on the x-axis and the reachability distance on the y-axis. Since points belonging to a cluster have a low reachability distance to their nearest neighbor, the clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.
The image above illustrates this concept. In its upper left area, a synthetic example data set is shown. The upper right part visualizes the
spanning tree
In the mathematical field of graph theory, a spanning tree ''T'' of an undirected graph ''G'' is a subgraph that is a tree which includes all of the vertices of ''G''. In general, a graph may have several spanning trees, but a graph that is no ...
produced by OPTICS, and the lower part shows the reachability plot as computed by OPTICS. Colors in this plot are labels, and not computed by the algorithm; but it is well visible how the valleys in the plot correspond to the clusters in above data set. The yellow points in this image are considered noise, and no valley is found in their reachability plot. They are usually not assigned to clusters, except the omnipresent "all data" cluster in a hierarchical result.
Extracting clusters from this plot can be done manually by selecting a range on the x-axis after visual inspection, by selecting a threshold on the y-axis (the result is then similar to a DBSCAN clustering result with the same
and parameters; here a value of 0.1 may yield good results), or by different algorithms that try to detect the valleys by steepness, knee detection, or local maxima. Clusterings obtained this way usually are
hierarchical
A hierarchy (from Greek: , from , 'president of sacred rites') is an arrangement of items (objects, names, values, categories, etc.) that are represented as being "above", "below", or "at the same level as" one another. Hierarchy is an important ...
, and cannot be achieved by a single DBSCAN run.
Complexity
Like
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.
It is a density-based clustering non-parametric algorithm: g ...
, OPTICS processes each point once, and performs one
-neighborhood query during this processing. Given a
spatial index
A spatial database is a general-purpose database (usually a relational database) that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data. Most s ...
that grants a neighborhood query in
runtime, an overall runtime of
is obtained. The authors of the original OPTICS paper report an actual constant slowdown factor of 1.6 compared to DBSCAN. Note that the value of
might heavily influence the cost of the algorithm, since a value too large might raise the cost of a neighborhood query to linear complexity.
In particular, choosing
(larger than the maximum distance in the data set) is possible, but leads to quadratic complexity, since every neighborhood query returns the full data set. Even when no spatial index is available, this comes at additional cost in managing the heap. Therefore,
should be chosen appropriately for the data set.
Extensions
OPTICS-OF is an
outlier detection
In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
algorithm based on OPTICS. The main use is the extraction of outliers from an existing run of OPTICS at low cost compared to using a different outlier detection method. The better known version
LOF is based on the same concepts.
DeLi-Clu, Density-Link-Clustering combines ideas from
single-linkage clustering
In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of e ...
and OPTICS, eliminating the
parameter and offering performance improvements over OPTICS.
HiSC is a hierarchical
subspace clustering Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology ca ...
(axis-parallel) method based on OPTICS.
HiCO is a hierarchical
correlation clustering Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.
De ...
algorithm based on OPTICS.
DiSH is an improvement over HiSC that can find more complex hierarchies.
FOPTICS is a faster implementation using random projections.
HDBSCAN* is based on a refinement of DBSCAN, excluding border-points from the clusters and thus following more strictly the basic definition of density-levels by Hartigan.
Availability
Java implementations of OPTICS, OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH are available in the
ELKI data mining framework (with index acceleration for several distance functions, and with automatic cluster extraction using the ξ extraction method). Other Java implementations include the
Weka
The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus '' Gallirallus''. Four subspecies are recog ...
extension (no support for ξ cluster extraction).
The
R package "dbscan" includes a C++ implementation of OPTICS (with both traditional dbscan-like and ξ cluster extraction) using a
k-d tree
In computer science, a ''k''-d tree (short for ''k-dimensional tree'') is a space-partitioning data structure for organizing points in a ''k''-dimensional space. ''k''-d trees are a useful data structure for several applications, such as sea ...
for index acceleration for Euclidean distance only.
Python implementations of OPTICS are available in th
PyClusteringlibrary and in
scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support-vector ...
. HDBSCAN* is available in th
hdbscanlibrary.
References
{{Reflist
Cluster analysis algorithms