Chapter 21 Data Structures for 21.0.1 Disjoint Sets Some applications require the grouping of n distinct elements into a collection of disjoint sets. Kruskal's algorithm for finding a minimum spanning tree for a weighted undirected graph is an example of such an application in which the sets are sets of vertices. The two operations we want to apply are finding the set an element belongs to and uniting two sets. Section 21.1 describes the operations supported by the disjoint-set data structure and gives a simple application, finding the connected components of an undirected graph. Section 21.2 presents a linked-list implementation for disjoint sets. Section 21.3 gives a more efficient representation using rooted trees. The running time for this representation is linear for practical purposes, but is theoretically superlinear. Section 21.4 discusses a very rapidly growing function and its very slowly growing inverse, which is used to calculate the running time bound of the tree representation using amortized analysis. 21.1 Disjoint-set operations 21.1.1 A disjoint-set data structure maintains a collection S = {S_1, S_2, ... , S_k} of disjoint dynamic sets. Each is represented by one of its members, the representative. In some applications (e.g. Kruskal) it does not matter which element is the representative, so long as the representative doesn't change if its set doesn't change. In other applications the representative may be the smallest element for example. Each element of the set is itself represented by an object, say x. Then we need to support the following operations: MAKE-SET(x) creates a new set whose only member (and thus its representative) is x. Since the sets are disjoint, x cannot be in any other set. UNION(x,y) unites the sets containing the representatives x and y, say S_x and S_y, into a new set that is the union of S_x and S_y. S_x and S_y are assumed to be disjoint before this operation. We pick some element of S_x union S_y to be the representative. Also, we remove S_x and S_y from the collection S and add their union to S. FIND-SET(x) returns a pointer to the representative of the (unique) set containing x. In this chapter, we analyze running 21.1.2 times of disjoint-set operations in terms of two parameters: n, the number of MAKE-SET operations (assumed to be done first), and m, the total number of all operations. Since each UNION reduces the number of sets by one, there are < n-1 UNIONs. And since the n MAKE-SETs are included in m, we have m >= n. An application of disjoint-set data structures We use the disjoint-set data structure and procedure CONNECTED-COMPONENTS below to find the connected components of an undirected graph as in Figure 21.1a. After CONNECTED- COMPONENTS has been run as a preprocessing step, procedure SAME-COMPONENT answers queries about whether two vertices are in the same component. As usual, G.V and G.E denote the set of vertices and edges of G. CONNECTED-COMPONENTS(G) 1 for each vertex v in G.V do 2 MAKE-SET(v) 3 for each edge (u,v) in G.E do 4 if FIND-SET(u) != FIND-SET(v) then 5 UNION(u,v) SAME-COMPONENT(u,v) 1 if FIND-SET(u) == FIND-SET(v) 2 then return TRUE 3 else return FALSE Linked-list representation of disjoint 21.2.1 sets One way to represent disjoint sets is with a linked list for each set, the first object in the list contains the representative. Each object contains a set member, and pointers to the next object and to the representative. Each list maintains pointers head, to the representative, and tail to the last object in the list. Objects may appear in any order in a list. Figure 21.1a shows two lists, and Figure 21.1b shows their union. With linked lists, MAKE-SET and FIND-SET are easy, requiring O(1) time: for MAKE-SET(x) we create a new list whose only object is x; for FIND-SET(x), we just return the back pointer to the representative. A simple implementation of union We perform UNION(x,y) by appending x's list to the end of y's list using y's tail pointer, with the representative just being that of y. Unfortunately we must update representative pointers in x's list, which takes time linear in the length of x's list. In fact, there is a sequence of m operations on n objects needing Theta(n^2) time: suppose we have x_1, x_2, ..., x_n and do n MAKE-SET operations followed by n-1 UNION operations as shown in Figure 21.3, so that m = 2n - 1. We spend Theta(n) time on the MAKE-SETs, and n-1 21.2.2 Sum i = Theta(n^2) i=1 time on the UNIONs, since the i_th UNION operation updates i objects. Thus the average cost of the m operations is Theta(n). A weighted-union heuristic We can improve this by using a weighted-union heuristic: also maintain the length of each list and append the smaller list to the larger one in the UNION operation. A single UNION can still take Omega(n) time if both sets have Omega(n) members, but as the following theorem says, we get better amortized performance. Theorem 21.1 In the linked-list representation using the weighted-union heuristic, a sequence of m MAKE-SET, UNION, and FIND-SET operations, with n MAKE-SETs, takes O(m + n lg n) time. Proof: We start by computing the number of times an object x's "representative" pointer can be updated; this always happens when x is in the smaller set. So the first time x is updated, the resulting union had at least 2 objects; the second time the resulting union had at least 4 objects, etc. So for any k < n after x has been updated ceiling(lg k) times, the resulting set has at least k objects. So a limit of n-1 UNIONs leads to O(n lg n) time + O(m) other operations gives the result. 21.3 Disjoint-set forests 21.3.1 In a faster implementation of disjoint sets, we represent sets by rooted trees, with each node containing one member and each tree representing one set. In a disjoint-set forest, each member points only to its parent (Figure 21.4a). The root of each tree contains the representative and is its own parent. Though the straightforward algorithms using this representation are no faster than linked-lists, adding two heuristics, "union by rank" and "path compression", we can obtain the fastest known disjoint-set data structure. A MAKE-SET just creates a tree with one node. A FIND-SET follows parent pointers until we find the root; the nodes visited are called the "find path". A UNION causes one root to point to the other root as in Figure 21.4b. Heuristics to improve the running time The first heuristic, union by rank, is similar to the weighted-union heuristic we used with linked-lists. The idea is to make the root of the smaller tree point to the root of the bigger tree. For each node, we have a rank field that is an upper bound on the height of the node. In union by rank, we make the root with the smaller rank point to the root with the larger rank during a UNION. 21.3.2 The second heuristic, path compression, is applied in FIND-SET during which we make each node on the find path point directly to the root as shown in Figure 21.5; it does not change any ranks. Pseudocode for disjoint-set forests To implement the union by rank heuristic, for each node x we maintain a field rank[x] which is an upper bound for the x's height (the number of edges on the longest path between x and a descendent leaf). When a singleton set is created by MAKE-SET, rank[x] is set to 0. As noted above, FIND-SET does not change any ranks. In the UNION operation there are two cases: (1) if the ranks are unequal, the root with the smaller rank points the the root with the larger rank and all ranks are unchanged, or (2) if the ranks are the same, we choose arbitrarily one of the roots to be the parent and increment its rank (only). As usual, we let p[x] point to the parent. MAKE-SET(x) 1 x.p = x 2 x.rank = 0 UNION(x,y) 1 LINK( FIND-SET(x), FIND-SET(y) ) LINK(x,y) 21.3.3 1 if x.rank > y.rank 2 then y.p = x 3 else x.p = y 4 if x.rank == y.rank 5 then y.rank = y.rank + 1 FIND-SET(x) 1 if x != x.p 2 then x.p = FIND-SET(x.p) 3 return x.p Note that FIND-SET is a two-pass method: on the first pass, during successive recursive calls, we move up the find path to the root; on the second pass, returning from the calls, we make all the roots on the find path point to the root. Effect of the heuristics on running time If we use union by rank alone, the running time is O(m lg n) (Exercise 21.4-4) and this bound is tight (Exercise 21.3-3). If we use path-compression alone with f FIND-SET calls, the worst-case running time is given by: Theta(n + f*(1 + log_(2 + f/n)(n) ) ). If both union by rank and path-compression are used, the worst-case running time is O(m alpha(n)) where alpha(n) is very slowly growing function, with value <= 4 for any value of n in practice. 21.4 Analysis of union by rank with 21.4.1 path compression In this section, we define the function alpha and prove the time bound O(m alpha(n)) by using the potential method. A very quickly growing function and its very slowly growing inverse For integers k >= 0 and j >= 1 we define the function A (j) as follows: k / j + 1 if k = 0 / A (j) = < (j+1) k \ A (j) if k > 0 \ k-1 where the bottom expression uses the function- iteration notation of Section 3.2; so we have (0) (i) (i-1) A (j) = j and A (j) = A (A (j) ) k-1 k-1 k-1 k-1 for i > 0. We call k the level of A. Lemma 21.2 For j >= 1, A (j) = 2*j + 1 1 j+1 Lemma 21.3 For j >= 1, A (j) = 2 * (j+1) - 1 2 We now compute A_k(1) for k = 0,...,4: 21.4.2 A (1) = 1 + 1 (by definition), A (1) = 2*1 + 1 0 2 1 by Lemma 21.2, and A (1) = 2 * (1+1) - 1 = 7. 2 (2) A (1) = A (1) = A (A (1)) = A (7) 3 2 2 2 2 8 11 = 2 *8 - 1 = 2 - 1 = 2047 (2) A (1) = A (1) = A (A (1)) = A (2047) 4 3 3 3 3 (2048) = A (2047) 2 2048 >> A (2047) = 2 *2048 - 1 2 2048 1000 300 > 2 > 2 > 10 Now alpha(n) = min{k : A (1) >= n} k So alpha(n) = 0 for n = 0, 1, 2; alpha(3) = 1; alpha(n) = 2 for n = 4,...,7; alpha(n) = 3 for 7 < n < 2048; alpha(n) =4 if 2047 < n <= A (1) 4