Chapter 21 Data Structures for 21.0.1
Disjoint Sets
Some applications require the grouping of n
distinct elements into a collection of
disjoint sets. Kruskal's algorithm for
finding a minimum spanning tree for a weighted
undirected graph is an example of such an
application in which the sets are sets of
vertices. The two operations we want to apply
are finding the set an element belongs to and
uniting two sets.
Section 21.1 describes the operations
supported by the disjoint-set data structure
and gives a simple application, finding the
connected components of an undirected graph.
Section 21.2 presents a linked-list
implementation for disjoint sets. Section
21.3 gives a more efficient representation
using rooted trees. The running time for this
representation is linear for practical
purposes, but is theoretically superlinear.
Section 21.4 discusses a very rapidly growing
function and its very slowly growing inverse,
which is used to calculate the running time
bound of the tree representation using
amortized analysis.
21.1 Disjoint-set operations 21.1.1
A disjoint-set data structure maintains a
collection S = {S_1, S_2, ... , S_k} of
disjoint dynamic sets. Each is represented by
one of its members, the representative. In
some applications (e.g. Kruskal) it does not
matter which element is the representative, so
long as the representative doesn't change if
its set doesn't change. In other applications
the representative may be the smallest element
for example.
Each element of the set is itself represented
by an object, say x. Then we need to support
the following operations:
MAKE-SET(x) creates a new set whose only
member (and thus its representative) is x.
Since the sets are disjoint, x cannot be in
any other set.
UNION(x,y) unites the sets containing the
representatives x and y, say S_x and S_y,
into a new set that is the union of S_x and
S_y. S_x and S_y are assumed to be disjoint
before this operation. We pick some element
of S_x union S_y to be the representative.
Also, we remove S_x and S_y from the
collection S and add their union to S.
FIND-SET(x) returns a pointer to the
representative of the (unique) set containing
x.
In this chapter, we analyze running 21.1.2
times of disjoint-set operations in terms of
two parameters: n, the number of MAKE-SET
operations (assumed to be done first), and m,
the total number of all operations. Since
each UNION reduces the number of sets by one,
there are < n-1 UNIONs. And since the n
MAKE-SETs are included in m, we have m >= n.
An application of disjoint-set data structures
We use the disjoint-set data structure and
procedure CONNECTED-COMPONENTS below to find
the connected components of an undirected
graph as in Figure 21.1a. After CONNECTED-
COMPONENTS has been run as a preprocessing
step, procedure SAME-COMPONENT answers queries
about whether two vertices are in the same
component. As usual, G.V and G.E denote the
set of vertices and edges of G.
CONNECTED-COMPONENTS(G)
1 for each vertex v in G.V do
2 MAKE-SET(v)
3 for each edge (u,v) in G.E do
4 if FIND-SET(u) != FIND-SET(v) then
5 UNION(u,v)
SAME-COMPONENT(u,v)
1 if FIND-SET(u) == FIND-SET(v)
2 then return TRUE
3 else return FALSE
Linked-list representation of disjoint 21.2.1
sets
One way to represent disjoint sets is with a
linked list for each set, the first object in
the list contains the representative. Each
object contains a set member, and pointers to
the next object and to the representative.
Each list maintains pointers head, to the
representative, and tail to the last object in
the list. Objects may appear in any order in
a list. Figure 21.1a shows two lists, and
Figure 21.1b shows their union.
With linked lists, MAKE-SET and FIND-SET are
easy, requiring O(1) time: for MAKE-SET(x) we
create a new list whose only object is x; for
FIND-SET(x), we just return the back pointer
to the representative.
A simple implementation of union
We perform UNION(x,y) by appending x's list
to the end of y's list using y's tail pointer,
with the representative just being that of y.
Unfortunately we must update representative
pointers in x's list, which takes time linear
in the length of x's list.
In fact, there is a sequence of m operations
on n objects needing Theta(n^2) time: suppose
we have x_1, x_2, ..., x_n and do n MAKE-SET
operations followed by n-1 UNION operations as
shown in Figure 21.3, so that m = 2n - 1. We
spend Theta(n) time on the MAKE-SETs, and
n-1 21.2.2
Sum i = Theta(n^2)
i=1
time on the UNIONs, since the i_th UNION
operation updates i objects. Thus the average
cost of the m operations is Theta(n).
A weighted-union heuristic
We can improve this by using a weighted-union
heuristic: also maintain the length of each
list and append the smaller list to the larger
one in the UNION operation. A single UNION
can still take Omega(n) time if both sets have
Omega(n) members, but as the following theorem
says, we get better amortized performance.
Theorem 21.1 In the linked-list representation
using the weighted-union heuristic, a sequence
of m MAKE-SET, UNION, and FIND-SET operations,
with n MAKE-SETs, takes O(m + n lg n) time.
Proof: We start by computing the number of
times an object x's "representative" pointer
can be updated; this always happens when x is
in the smaller set. So the first time x is
updated, the resulting union had at least 2
objects; the second time the resulting union
had at least 4 objects, etc. So for any k < n
after x has been updated ceiling(lg k) times,
the resulting set has at least k objects.
So a limit of n-1 UNIONs leads to O(n lg n)
time + O(m) other operations gives the result.
21.3 Disjoint-set forests 21.3.1
In a faster implementation of disjoint sets,
we represent sets by rooted trees, with each
node containing one member and each tree
representing one set. In a disjoint-set
forest, each member points only to its parent
(Figure 21.4a). The root of each tree
contains the representative and is its own
parent. Though the straightforward algorithms
using this representation are no faster than
linked-lists, adding two heuristics, "union by
rank" and "path compression", we can obtain
the fastest known disjoint-set data structure.
A MAKE-SET just creates a tree with one node.
A FIND-SET follows parent pointers until we
find the root; the nodes visited are called
the "find path". A UNION causes one root to
point to the other root as in Figure 21.4b.
Heuristics to improve the running time
The first heuristic, union by rank, is
similar to the weighted-union heuristic we
used with linked-lists. The idea is to make
the root of the smaller tree point to the root
of the bigger tree. For each node, we have a
rank field that is an upper bound on the
height of the node. In union by rank, we make
the root with the smaller rank point to the
root with the larger rank during a UNION.
21.3.2
The second heuristic, path compression, is
applied in FIND-SET during which we make each
node on the find path point directly to the
root as shown in Figure 21.5; it does not
change any ranks.
Pseudocode for disjoint-set forests
To implement the union by rank heuristic, for
each node x we maintain a field rank[x] which
is an upper bound for the x's height (the
number of edges on the longest path between x
and a descendent leaf). When a singleton set
is created by MAKE-SET, rank[x] is set to 0.
As noted above, FIND-SET does not change any
ranks. In the UNION operation there are two
cases: (1) if the ranks are unequal, the root
with the smaller rank points the the root with
the larger rank and all ranks are unchanged,
or (2) if the ranks are the same, we choose
arbitrarily one of the roots to be the parent
and increment its rank (only). As usual, we
let p[x] point to the parent.
MAKE-SET(x)
1 x.p = x
2 x.rank = 0
UNION(x,y)
1 LINK( FIND-SET(x), FIND-SET(y) )
LINK(x,y) 21.3.3
1 if x.rank > y.rank
2 then y.p = x
3 else x.p = y
4 if x.rank == y.rank
5 then y.rank = y.rank + 1
FIND-SET(x)
1 if x != x.p
2 then x.p = FIND-SET(x.p)
3 return x.p
Note that FIND-SET is a two-pass method: on
the first pass, during successive recursive
calls, we move up the find path to the root;
on the second pass, returning from the calls,
we make all the roots on the find path point
to the root.
Effect of the heuristics on running time
If we use union by rank alone, the running
time is O(m lg n) (Exercise 21.4-4) and this
bound is tight (Exercise 21.3-3). If we use
path-compression alone with f FIND-SET calls,
the worst-case running time is given by:
Theta(n + f*(1 + log_(2 + f/n)(n) ) ).
If both union by rank and path-compression
are used, the worst-case running time is
O(m alpha(n)) where alpha(n) is very slowly
growing function, with value <= 4 for any
value of n in practice.
21.4 Analysis of union by rank with 21.4.1
path compression
In this section, we define the function alpha
and prove the time bound O(m alpha(n)) by
using the potential method.
A very quickly growing function and its very
slowly growing inverse
For integers k >= 0 and j >= 1 we define the
function A (j) as follows:
k
/ j + 1 if k = 0
/
A (j) = < (j+1)
k \ A (j) if k > 0
\ k-1
where the bottom expression uses the function-
iteration notation of Section 3.2; so we have
(0) (i) (i-1)
A (j) = j and A (j) = A (A (j) )
k-1 k-1 k-1 k-1
for i > 0. We call k the level of A.
Lemma 21.2 For j >= 1, A (j) = 2*j + 1
1
j+1
Lemma 21.3 For j >= 1, A (j) = 2 * (j+1) - 1
2
We now compute A_k(1) for k = 0,...,4: 21.4.2
A (1) = 1 + 1 (by definition), A (1) = 2*1 + 1
0 2 1
by Lemma 21.2, and A (1) = 2 * (1+1) - 1 = 7.
2
(2)
A (1) = A (1) = A (A (1)) = A (7)
3 2 2 2 2
8 11
= 2 *8 - 1 = 2 - 1 = 2047
(2)
A (1) = A (1) = A (A (1)) = A (2047)
4 3 3 3 3
(2048)
= A (2047)
2
2048
>> A (2047) = 2 *2048 - 1
2
2048 1000 300
> 2 > 2 > 10
Now alpha(n) = min{k : A (1) >= n}
k
So alpha(n) = 0 for n = 0, 1, 2; alpha(3) = 1;
alpha(n) = 2 for n = 4,...,7; alpha(n) = 3 for
7 < n < 2048; alpha(n) =4 if 2047 < n <= A (1)
4