In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, ...
that stores a collection of
disjoint (non-overlapping)
sets. Equivalently, it stores a
partition of a set
In mathematics, a partition of a set is a grouping of its elements into non-empty subsets, in such a way that every element is included in exactly one subset.
Every equivalence relation on a set defines a partition of this set, and every part ...
into disjoint
subset
In mathematics, set ''A'' is a subset of a set ''B'' if all elements of ''A'' are also elements of ''B''; ''B'' is then a superset of ''A''. It is possible for ''A'' and ''B'' to be equal; if they are unequal, then ''A'' is a proper subset of ...
s. It provides operations for adding new sets, merging sets (replacing them by their
union), and finding a representative member of a set. The last operation makes it possible to find out efficiently if any two elements are in the same or different sets.
While there are several ways of implementing disjoint-set data structures, in practice they are often identified with a particular implementation called a disjoint-set forest. This is a specialized type of
forest
A forest is an area of land dominated by trees. Hundreds of definitions of forest are used throughout the world, incorporating factors such as tree density, tree height, land use, legal standing, and ecological function. The United Nations' ...
which performs unions and finds in near-constant
amortized time
In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case ...
. To perform a sequence of addition, union, or find operations on a disjoint-set forest with nodes requires total time , where is the extremely slow-growing
inverse Ackermann function
In computability theory, the Ackermann function, named after Wilhelm Ackermann, is one of the simplest and earliest-discovered examples of a total computable function that is not primitive recursive. All primitive recursive functions are total an ...
. Disjoint-set forests do not guarantee this performance on a per-operation basis. Individual union and find operations can take longer than a constant times time, but each operation causes the disjoint-set forest to adjust itself so that successive operations are faster. Disjoint-set forests are both
asymptotically optimal
In computer science, an algorithm is said to be asymptotically optimal if, roughly speaking, for large inputs it performs at worst a constant factor (independent of the input size) worse than the best possible algorithm. It is a term commonly en ...
and practically efficient.
Disjoint-set data structures play a key role in
Kruskal's algorithm
Kruskal's algorithm finds a minimum spanning forest of an undirected edge-weighted graph. If the graph is connected, it finds a minimum spanning tree. (A minimum spanning tree of a connected graph is a subset of the edges that forms a tree that ...
for finding the
minimum spanning tree
A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. ...
of a graph. The importance of minimum spanning trees means that disjoint-set data structures underlie a wide variety of algorithms. In addition, disjoint-set data structures also have applications to symbolic computation, as well in compilers, especially for
register allocation
In compiler optimization, register allocation is the process of assigning local automatic variables and expression results to a limited number of processor registers.
Register allocation can happen over a basic block (''local register allocatio ...
problems.
History
Disjoint-set forests were first described by
Bernard A. Galler and
Michael J. Fischer
Michael John Fischer (born 1942) is a computer scientist who works in the fields of distributed computing, parallel computing, cryptography, algorithms and data structures, and computational complexity.
Career
Fischer was born in 1942 in Ann Arbor ...
in 1964.
[. The paper originating disjoint-set forests.] In 1973, their time complexity was bounded to
, the
iterated logarithm
In computer science, the iterated logarithm of n, written n (usually read "log star"), is the number of times the logarithm function must be iteratively applied before the result is less than or equal to 1. The simplest formal definition ...
of
, by
Hopcroft and
Ullman.
In 1975,
Robert Tarjan
Robert Endre Tarjan (born April 30, 1948) is an American computer scientist and mathematician. He is the discoverer of several graph algorithms, including Tarjan's off-line lowest common ancestors algorithm, and co-inventor of both splay trees ...
was the first to prove the
(
inverse Ackermann function
In computability theory, the Ackermann function, named after Wilhelm Ackermann, is one of the simplest and earliest-discovered examples of a total computable function that is not primitive recursive. All primitive recursive functions are total an ...
) upper bound on the algorithm's time complexity,
and, in 1979, showed that this was the lower bound for a restricted case.
In 1989,
Fredman and
Saks showed that
(amortized) words must be accessed by ''any'' disjoint-set data structure per operation,
thereby proving the optimality of the data structure.
In 1991, Galil and Italiano published a survey of data structures for disjoint-sets.
In 1994, Richard J. Anderson and Heather Woll described a parallelized version of Union–Find that never needs to block.
In 2007, Sylvain Conchon and Jean-Christophe Filliâtre developed a semi-
persistent version of the disjoint-set forest data structure and formalized its correctness using the
proof assistant
In computer science and mathematical logic, a proof assistant or interactive theorem prover is a software tool to assist with the development of formal proofs by human-machine collaboration. This involves some sort of interactive proof edi ...
Coq
Coq is an interactive theorem prover first released in 1989. It allows for expressing mathematical assertions, mechanically checks proofs of these assertions, helps find formal proofs, and extracts a certified program from the constructive proof ...
.
"Semi-persistent" means that previous versions of the structure are efficiently retained, but accessing previous versions of the data structure invalidates later ones. Their fastest implementation achieves performance almost as efficient as the non-persistent algorithm. They do not perform a complexity analysis.
Variants of disjoint-set data structures with better performance on a restricted class of problems have also been considered. Gabow and Tarjan showed that if the possible unions are restricted in certain ways, then a truly linear time algorithm is possible.
Representation
Each node in a disjoint-set forest consists of a pointer and some auxiliary information, either a size or a rank (but not both). The pointers are used to make
parent pointer tree
In computer science, an in-tree or parent pointer tree is an -ary tree data structure in which each node has a pointer to its parent node, but no pointers to child nodes. When used to implement a set of stacks, the structure is called a spaghet ...
s, where each node that is not the root of a tree points to its parent. To distinguish root nodes from others, their parent pointers have invalid values, such as a circular reference to the node or a sentinel value. Each tree represents a set stored in the forest, with the members of the set being the nodes in the tree. Root nodes provide set representatives: Two nodes are in the same set if and only if the roots of the trees containing the nodes are equal.
Nodes in the forest can be stored in any way convenient to the application, but a common technique is to store them in an array. In this case, parents can be indicated by their array index. Every array entry requires bits of storage for the parent pointer. A comparable or lesser amount of storage is required for the rest of the entry, so the number of bits required to store the forest is . If an implementation uses fixed size nodes (thereby limiting the maximum size of the forest that can be stored), then the necessary storage is linear in .
Operations
Disjoint-set data structures support three operations: Making a new set containing a new element; Finding the representative of the set containing a given element; and Merging two sets.
Making new sets
The
MakeSet
operation adds a new element into a new set containing only the new element, and the new set is added to the data structure. If the data structure is instead viewed as a partition of a set, then the
MakeSet
operation enlarges the set by adding the new element, and it extends the existing partition by putting the new element into a new subset containing only the new element.
In a disjoint-set forest,
MakeSet
initializes the node's parent pointer and the node's size or rank. If a root is represented by a node that points to itself, then adding an element can be described using the following pseudocode:
function MakeSet(''x'') is
if ''x'' is not already in the forest then
''x''.parent := ''x''
''x''.size := 1 ''// if nodes store size''
''x''.rank := 0 ''// if nodes store rank''
end if
end function
This operation has constant time complexity. In particular, initializing a
disjoint-set forest with nodes requires
time.
In practice,
MakeSet
must be preceded by an operation that allocates memory to hold . As long as memory allocation is an amortized constant-time operation, as it is for a good
dynamic array
In computer science, a dynamic array, growable array, resizable array, dynamic table, mutable array, or array list is a random access, variable-size list data structure that allows elements to be added or removed. It is supplied with standard li ...
implementation, it does not change the asymptotic performance of the random-set forest.
Finding set representatives
The
Find
operation follows the chain of parent pointers from a specified query node until it reaches a root element. This root element represents the set to which belongs and may be itself.
Find
returns the root element it reaches.
Performing a
Find
operation presents an important opportunity for improving the forest. The time in a
Find
operation is spent chasing parent pointers, so a flatter tree leads to faster
Find
operations. When a
Find
is executed, there is no faster way to reach the root than by following each parent pointer in succession. However, the parent pointers visited during this search can be updated to point closer to the root. Because every element visited on the way to a root is part of the same set, this does not change the sets stored in the forest. But it makes future
Find
operations faster, not only for the nodes between the query node and the root, but also for their descendants. This updating is an important part of the disjoint-set forest's amortized performance guarantee.
There are several algorithms for
Find
that achieve the asymptotically optimal time complexity. One family of algorithms, known as path compression, makes every node between the query node and the root point to the root. Path compression can be implemented using a simple recursion as follows:
function Find(''x'') is
if ''x''.parent ≠ ''x'' then
''x''.parent := Find(''x''.parent)
return ''x''.parent
else
return ''x''
end if
end function
This implementation makes two passes, one up the tree and one back down. It requires enough scratch memory to store the path from the query node to the root (in the above pseudocode, the path is implicitly represented using the call stack). This can be decreased to a constant amount of memory by performing both passes in the same direction. The constant memory implementation walks from the query node to the root twice, once to find the root and once to update pointers:
function Find(''x'') is
''root'' := ''x''
while ''root''.parent ≠ ''root'' do
''root'' := ''root''.parent
end while
while ''x''.parent ≠ ''root'' do
''parent'' := ''x''.parent
''x''.parent := ''root''
''x'' := ''parent''
end while
return ''root''
end function
Tarjan and
Van Leeuwen also developed one-pass
Find
algorithms that retain the same worst-case complexity but are more efficient in practice.
These are called path splitting and path halving. Both of these update the parent pointers of nodes on the path between the query node and the root. Path splitting replaces every parent pointer on that path by a pointer to the node's grandparent:
function Find(''x'') is
while ''x''.parent ≠ ''x'' do
(''x'', ''x''.parent) := (''x''.parent, ''x''.parent.parent)
end while
return ''x''
end function
Path halving works similarly but replaces only every other parent pointer:
function Find(''x'') is
while ''x''.parent ≠ ''x'' do
''x''.parent := ''x''.parent.parent
''x'' := ''x''.parent
end while
return ''x''
end function
Merging two sets
The operation
Union(''x'', ''y'')
replaces the set containing and the set containing with their union.
Union
first uses
Find
to determine the roots of the trees containing and . If the roots are the same, there is nothing more to do. Otherwise, the two trees must be merged. This is done by either setting the parent pointer of 's root to 's, or setting the parent pointer of 's root to 's.
The choice of which node becomes the parent has consequences for the complexity of future operations on the tree. If it is done carelessly, trees can become excessively tall. For example, suppose that
Union
always made the tree containing a subtree of the tree containing . Begin with a forest that has just been initialized with elements
and execute
,
, ...,
. The resulting forest contains a single tree whose root is , and the path from 1 to passes through every node in the tree. For this forest, the time to run
Find(1)
is .
In an efficient implementation, tree height is controlled using union by size or union by rank. Both of these require a node to store information besides just its parent pointer. This information is used to decide which root becomes the new parent. Both strategies ensure that trees do not become too deep.
Union by size
In the case of union by size, a node stores its size, which is simply its number of descendants (including the node itself). When the trees with roots and are merged, the node with more descendants becomes the parent. If the two nodes have the same number of descendants, then either one can become the parent. In both cases, the size of the new parent node is set to its new total number of descendants.
function Union(''x'', ''y'') is
''// Replace nodes by roots''
''x'' := Find(''x'')
''y'' := Find(''y'')
if ''x'' = ''y'' then
return ''// x and y are already in the same set''
end if
''// If necessary, rename variables to ensure that''
''// x has at least as many descendants as y''
if ''x''.size < ''y''.size then
(''x'', ''y'') := (''y'', ''x'')
end if
''// Make x the new root''
''y''.parent := ''x''
''// Update the size of x''
''x''.size := ''x''.size + ''y''.size
end function
The number of bits necessary to store the size is clearly the number of bits necessary to store . This adds a constant factor to the forest's required storage.
Union by rank
For union by rank, a node stores its , which is an upper bound for its height. When a node is initialized, its rank is set to zero. To merge trees with roots and , first compare their ranks. If the ranks are different, then the larger rank tree becomes the parent, and the ranks of and do not change. If the ranks are the same, then either one can become the parent, but the new parent's rank is incremented by one. While the rank of a node is clearly related to its height, storing ranks is more efficient than storing heights. The height of a node can change during a
Find
operation, so storing ranks avoids the extra effort of keeping the height correct. In pseudocode, union by rank is:
function Union(''x'', ''y'') is
''// Replace nodes by roots''
''x'' := Find(''x'')
''y'' := Find(''y'')
if ''x'' = ''y'' then
return ''// x and y are already in the same set''
end if
''// If necessary, rename variables to ensure that''
''// x has rank at least as large as that of y''
if ''x''.rank < ''y''.rank then
(''x'', ''y'') := (''y'', ''x'')
end if
''// Make x the new root''
''y''.parent := ''x''
''// If necessary, increment the rank of x''
if ''x''.rank = ''y''.rank then
''x''.rank := ''x''.rank + 1
end if
end function
It can be shown that every node has rank
or less.
Consequently each rank can be stored in bits and all the ranks can be stored in bits. This makes the ranks an asymptotically negligible portion of the forest's size.
It is clear from the above implementations that the size and rank of a node do not matter unless a node is the root of a tree. Once a node becomes a child, its size and rank are never accessed again.
Time complexity
A disjoint-set forest implementation in which
Find
does not update parent pointers, and in which
Union
does not attempt to control tree heights, can have trees with height . In such a situation, the
Find
and
Union
operations require time.
If an implementation uses path compression alone, then a sequence of
MakeSet
operations, followed by up to
Union
operations and
Find
operations, has a worst-case running time of
.
Using union by rank, but without updating parent pointers during
Find
, gives a running time of
for operations of any type, up to of which are
MakeSet
operations.
The combination of path compression, splitting, or halving, with union by size or by rank, reduces the running time for operations of any type, up to of which are
MakeSet
operations, to
.
This makes the
amortized running time of each operation
. This is asymptotically optimal, meaning that every disjoint set data structure must use
amortized time per operation.
Here, the function
is the
inverse Ackermann function
In computability theory, the Ackermann function, named after Wilhelm Ackermann, is one of the simplest and earliest-discovered examples of a total computable function that is not primitive recursive. All primitive recursive functions are total an ...
. The inverse Ackermann function grows extraordinarily slowly, so this factor is or less for any that can actually be written in the physical universe. This makes disjoint-set operations practically amortized constant time.
Proof of O(m log* n) time complexity of Union-Find
The precise analysis of the performance of a disjoint-set forest is somewhat intricate. However, there is a much simpler analysis that proves that the amortized time for any
Find
or
Union
operations on a disjoint-set forest containing objects is , where denotes the
iterated logarithm
In computer science, the iterated logarithm of n, written n (usually read "log star"), is the number of times the logarithm function must be iteratively applied before the result is less than or equal to 1. The simplest formal definition ...
.
Lemma 1: As the
find function follows the path along to the root, the rank of node it encounters is increasing.
Lemma 2: A node which is root of a subtree with rank has at least
nodes.
Lemma 3: The maximum number of nodes of rank is at most
For convenience, we define "bucket" here: a bucket is a set that contains vertices with particular ranks.
We create some buckets and put vertices into the buckets according to their ranks inductively. That is, vertices with rank 0 go into the zeroth bucket, vertices with rank 1 go into the first bucket, vertices with ranks 2 and 3 go into the second bucket. If the -th bucket contains vertices with ranks from interval