History
In 1972, Robert F. Ling published a closely related algorithm in "The Theory and Construction of k-Clusters" in '' The Computer Journal'' with an estimated runtime complexity of O(n³). DBSCAN has a worst-case of O(n²), and the database-oriented range-query formulation of DBSCAN allows for index acceleration. The algorithms slightly differ in their handling of border points.Preliminary
Consider a set of points in some space to be clustered. Let be a parameter specifying the radius of a neighborhood with respect to some point. For the purpose of DBSCAN clustering, the points are classified as ''core points'', (''directly''-)'' reachable points'' and ''outliers'', as follows: * A point is a ''core point'' if at least points are within distance of it (including ). * A point is ''directly reachable'' from if point is within distance from core point . Points are only said to be directly reachable from core points. * A point is ''reachable'' from if there is a path with and , where each is directly reachable from . Note that this implies that the initial point and all points on the path must be core points, with the possible exception of . * All points not reachable from any other point are ''outliers'' or ''noise points''. Now if is a core point, then it forms a ''cluster'' together with all points (core or non-core) that are reachable from it. Each cluster contains at least one core point; non-core points can be part of a cluster, but they form its "edge", since they cannot be used to reach more points.Algorithm
Original query-based algorithm
DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region (). It starts with an arbitrary starting point that has not been visited. This point's ε-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point and hence be made part of a cluster. If a point is found to be a dense part of a cluster, its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added, as is their own ε-neighborhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise. DBSCAN can be used with any distance function (as well as similarity functions or other predicates). The distance function (dist) can therefore be seen as an additional parameter. The algorithm can be expressed inAbstract algorithm
The DBSCAN algorithm can be abstracted into the following steps: # Find the points in the ε (eps) neighborhood of every point, and identify the core points with more than neighbors. # Find the connected components of ''core'' points on the neighbor graph, ignoring all non-core points. # Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise. A naive implementation of this requires storing the neighborhoods in step 1, thus requiring substantial memory. The original DBSCAN algorithm does not require this by performing these steps for one point at a time.Complexity
DBSCAN visits each point of the database, possibly multiple times (e.g., as candidates to different clusters). For practical considerations, however, the time complexity is mostly governed by the number of regionQuery invocations. DBSCAN executes exactly one such query for each point, and if an indexing structure is used that executes a neighborhood query in , an overall average runtime complexity of is obtained (if parameter is chosen in a meaningful way, i.e. such that on average only points are returned). Without the use of an accelerating index structure, or on degenerated data (e.g. all points within a distance less than ), the worst case run time complexity remains . The = -sized upper triangle of the distance matrix can be materialized to avoid distance recomputations, but this needs memory, whereas a non-matrix based implementation of DBSCAN only needs memory.Advantages
# DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to k-means. # DBSCAN can find arbitrarily-shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced. # DBSCAN has a notion of noise, and is robust to outliers. # DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (However, points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed, and the cluster assignment is unique only up to isomorphism.) # DBSCAN is designed for use with databases that can accelerate region queries, e.g. using an R* tree. # The parameters and ε can be set by a domain expert, if the data is well understood.Disadvantages
# DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed. For most data sets and domains, this situation does not arise often and has little impact on the clustering result: both on core points and noise points, DBSCAN is deterministic. DBSCAN* is a variation that treats border points as noise, and this way achieves a fully deterministic result as well as a more consistent statistical interpretation of density-connected components. # The quality of DBSCAN depends on the distance measure used in the function regionQuery(P,ε). The most common distance metric used is Euclidean distance. Especially for high-dimensional data, this metric can be rendered almost useless due to the so-called " Curse of dimensionality", making it difficult to find an appropriate value for ε. This effect, however, is also present in any other algorithm based on Euclidean distance. # DBSCAN cannot cluster data sets well with large differences in densities, since the -ε combination cannot then be chosen appropriately for all clusters. # If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult. See the section below on extensions for algorithmic modifications to handle these issues.Parameter estimation
Every data mining task has the problem of parameters. Every parameter influences the algorithm in specific ways. For DBSCAN, the parameters ε and ' are needed. The parameters must be specified by the user. Ideally, the value of ε is given by the problem to solve (e.g. a physical distance), and ' is then the desired minimum cluster size. * ''MinPts'': As a rule of thumb, a minimum ' can be derived from the number of dimensions ''D'' in the data set, as ' ≥ ''D'' + 1. The low value of ' = 1 does not make sense, as then every point is a core point by definition. With ' ≤ 2, the result will be the same as ofRelationship to spectral clustering
A spectral implementation of DBSCAN is related to spectral clustering in the trivial case of determining connected graph components — the optimal clusters with no edges cut. However, it can be computationally intensive, up to . Additionally, one has to choose the number of eigenvectors to compute. For performance reasons, the original DBSCAN algorithm remains preferable to its spectral implementation.Extensions
Generalized DBSCAN (GDBSCAN) is a generalization by the same authors to arbitrary "neighborhood" and "dense" predicates. The ε and ' parameters are removed from the original algorithm and moved to the predicates. For example, on polygon data, the "neighborhood" could be any intersecting polygon, whereas the density predicate uses the polygon areas instead of just the object count. Various extensions to the DBSCAN algorithm have been proposed, including methods for parallelization, parameter estimation, and support for uncertain data. The basic idea has been extended to hierarchical clustering by the OPTICS algorithm. DBSCAN is also used as part of subspace clustering algorithms like PreDeCon andAvailability
Different implementations of the same algorithm were found to exhibit enormous performance differences, with the fastest on a test data set finishing in 1.4 seconds, the slowest taking 13803 seconds. The differences can be attributed to implementation quality, language and compiler differences, and the use of indexes for acceleration. * Apache Commonsbr>MathNotes
References
{{DEFAULTSORT:Dbscan Cluster analysis algorithms