Part II Sorting and Order Statistics II.1 Introduction This part gives several algorithms that solve the sorting problem: Input: A sequence of n numbers: Output: A permutation (reordering) of the input sequence such that a_1' <= a_2' <= ... <= a_n'. The structure of the data In the real world, each number, called the "key" is part of a record that contains other "satellite data". Often the satellite data is carried around with the key when they are permuted, but if the satellite data is large, usually only a pointer to it is carried with the key (or just the pointers are permuted). In this course, we are only interested in the sorting process, not the implementation details, which are usually straightforward. Why sorting? II.2 Sorting is considered to be fundamental since - Sometimes it is required by the application, for example sorting checks for a statement. - Some algorithms use sorting as a subroutine. - Due to the large variety of sort algorithms, they have provided examples of many methods of algorithm design over the years, and so have historical interest also. - There is a theoretical lower bound, which we can match, and we can usually prove good upper bounds for sorting algorithms. The lower bound can often be used to establish lower bounds on algorithms that use sorting. - Many software engineering issues arise when implementing sorting algorithms. Obtaining the best performance often depends more on choosing best algorithm based on the situation rather than on "tweaking" the code. Sorting algorithms II.3 In Chapter 2, we saw insertion sort, which has Theta(n^2) worst (and average) case run time, and merge sort, which does not sort "in place" (recall that an algorithm sorts in place if only a constant number of elements are ever stored outside the input array). In Chapter 6, we introduce heapsort, which sorts in place in O(n lg n) time. It uses an important data structure called a heap, which can be used to implement a priority queue. In Chapter 7, we study quicksort, which also sorts in place, but has Theta(n^2) worst case run time, but Theta(n lg n) average case run time, and a smaller constant than heapsort. The sorting algorithms above are "comparison sorts": they determine the sorted order by comparing elements. Chapter 8 shows that such sorts have a lower bound of Omega(n lg n) for the worst case run time. Chapter 8 also presents three algorithms for sorting in linear time -- if their input data satisfies special conditions: counting sort, radix sort, and bucket sort. Algorithm Worst Case Average Case II.4 Running Time Running Time ------------ -------------- Insertion-sort Theta(n^2) Theta(n^2) Merge sort Theta(n lg n) Theta(n lg n) Heapsort O(n lg n) O(n lg n) ?? Quicksort Theta(n^2) Theta(n lg n) Counting sort Theta(k + n) Theta(k + n) Radix sort Theta(d(k+n)) Theta(d(k+n)) Bucket sort Theta(n^2) Theta(n) The i-th order statistic of n numbers is the i-th smallest number in that set. One way to obtain the i-th order statistic is to sort the numbers in O(n lg n) time and extract the i-th one. Chapter 9 presents two algorithms: one that finds the i-th smallest element in O(n^2) in the worst case, but Theta(n) on average, and a more complicated algorithm that runs in O(n) worst case time. Background Average case analyses often require some mathematical sophistication and involve probabilistic methods (which are reviewed in Appendix C). Such analysis is also required for randomized algorithms. Chapter 6 Heapsort 6.1.1 Running Time Memory ------------ -------------- 2 Insertion-sort O(n ) sorts in place Merge-sort O(n*lg(n)) requires auxiliary array Heap-sort O(n*lg(n)) sorts in place In addition to sorting, a heap can be used to implement priority queues and is useful in other data structures. The term "heap" was first used with this meaning, not allocatable storage which is entirely different. 6.1 Heaps Q: What is a (binary) heap? A: A "nearly full" binary tree (only the bottom level may not be complete) 16 / \ / \ 14 10 / \ / \ / \ / \ 8 7 9 3 / \ / / \ / 2 4 1 6.1.2 A heap has the property that, for every node other than a leaf, its value is greater than or equal to that of its children. A heap can be implemented as an array, A : +--+--+--+--+--+--+--+--+--+--+ A |16|14|10| 8| 7| 9| 3| 2| 4| 1| +--+--+--+--+--+--+--+--+--+--+ 1 2 3 4 5 6 7 8 9 10 where nodes are described by array indexes i. It has two attributes: A.length, the size of the array, and A.heap-size, the number of elements that are actually part of the heap so A.heap-size <= A.length The root is A[1], and given an index i of a node, the parent, left, and right child can easily by found by the following functions: PARENT(i) return floor( i/2 ) LEFT(i) return 2i RIGHT(i) return 2i + 1 Note: these can be implemented by bit shifts. Example: if i = 4 = 0100 6.1.3 then LEFT(i) => 8 = 1000 (shift left one bit) RIGHT(i) => 9 = 1001 (shift left one bit and "bit-or" 1) PARENT(i) => 2 = 0010 (shift right one bit) There are two kinds of heaps, max-heaps and min-heaps, which satisfy a heap property: The max-heap property: A[parent(i)] >= A[i], all i except root so that the largest element is in the root. Similarly we have the min-heap property: A[parent(i)] <= A[i], all i except root with the smallest element in the root. The height of a node is the number of edges on the longest path downward to a leaf; the height of the heap is the height of its root. Since a heap of n elements is based on a complete binary tree, its height, h, is Theta(lg n) (Exercise 6.1-2). We shall see that the basic heap operations run in time at most proportional to h, and thus O(lg n) time. A useful utility routine is: 6.1.4 MAX-HEAPIFY, which runs in O(lg n) time, is a routine that maintains the heap property. The five basic operations on a max-heap are: BUILD-MAX-HEAP, which runs in Theta(n) time, produces a max-heap from an unordered array. MAX-HEAP-INSERT which runs in O(lg n) time and inserts an element into a max-heap. HEAP-EXTRACT-MAX which runs in O(lg n) time and removes and returns the maximum element. HEAP-MAXIMUM which runs in Theta(1) time and just returns the maximum element. HEAP-INCREASE-KEY which runs in O(lg n) time and increases the key value of a node. The last four allow the heap to be used as a priority queue. Also, a heap can be used to sort an array by using the procedure: HEAPSORT, which runs in O(n lg(n)) time. 6.2 Maintaining the heap property 6.2.1 MAX-HEAPIFY(A,i) assumes the binary trees at LEFT(i) and RIGHT(i) are max-heaps, but that A[i] may be smaller than its children. MAX-HEAPIFY lets the value at A[i] "float down" the heap so that the subtree rooted at i becomes a max-heap. MAX-HEAPIFY(A,i) 1 l = LEFT(i) 2 r = RIGHT(i) 3 if l <= heap-size[A] and A[l] > A[i] 4 largest = l 5 else largest = i 6 if r <= heap-size[A] and A[r] > A[largest] 7 largest = r 8 if largest != i 9 exchange A[i] <-> A[largest] 10 MAX-HEAPIFY(A,largest) At each step, we determine "largest", the index of the biggest of A[i], A[LEFT(i)], and A[RIGHT(i)]. If A[i] is biggest, MAX-HEAPIFY terminates. Otherwise one of the children is biggest and we swap it with A[i], which causes the heap rooted at i to satisfy the max-heap property. But the heap rooted at "largest" may not satisfy the max-heap property, so we recursively call MAX-HEAPIFY on it. Note that MAX-HEAPIFY is tail-recursive, so it could be implemented as a while loop. Figure 6.2 shows how MAX-HEAPIFY works. 6.2.2 (a) The value 4 at index 2 is out of place. 16 / \ / \ i = 2 (4) 10 / \ / \ / \ / \ 14 7 9 3 / \ / / \ / 2 8 1 We fix it with a call to MAX-HEAPIFY(A,2), which determines that 14 is the biggest of 4, 14, and 7, and so swaps 4 and 14, giving: (b) The value 4 at index 4 is out of place. 16 / \ / \ 14 10 / \ / \ / \ / \ i = 4 (4) 7 9 3 / \ / / \ / 2 8 1 We fix it by calling MAX-HEAPIFY(A,4), 6.2.3 which determines that 8 is the biggest of 4, 2, and 8, and so swaps 4 and 8, giving: (c) The value 4 at index 9 is all right. 16 / \ / \ 14 10 / \ / \ / \ / \ 8 7 9 3 / \ / / \ / 2 (4) 1 i = 9 The conditions for all the if-statements are FALSE, so MAX-HEAPIFY terminates. Here is the trace: MAX-HEAPIFY(A,2) |> swap A[2] <-> A[4] MAX-HEAPIFY(A,4) |> swap A[4] <-> A[9] MAX-HEAPIFY(A,9) |> largest <- 9 return return |> tail recursion optimization return |> can eliminate this The running time of MAX-HEAPIFY on a subtree of size n rooted at node i is Theta(1) to fix up the relationships among A[i], A[LEFT(i)], and A[RIGHT(i)], plus the time to run the recursive call on one of the subtrees. 6.2.4 The size of a subtree is at most 2n/3, which occurs when the last row is exactly half full. So the worst case running time T(n) satisfies the recurrence: T(n) <= T(2n/3) + Theta(1) (*) Which by case 2 of the "Master Theorem 4.1", has the solution: T(n) = O(lg n). Or we may get the same result by noticing that MAX-HEAPIFY spends Theta(1) time at each level and it does so at at most O(h) levels, and h = Theta(lg n). Or use the substition method and prove by induction that (*) has solution T(n) = O(lg n) i.e. that T(n) <= clg(n), for some c > 0 (definition of big-O) Induction Hypothesis (ind. hyp.): T(k) <= clg(k), k < n T(n) <= T(2n/3) + a given by (*) <= clg(2n/3) + a by ind. hyp. = clg(2/3) + clg(n) + a by log property = clg(n) + clg(2/3) + a <= clg(n) + c(-1/2) + a lg(2/3) = -.58 = clg(n) - (c/2 - a) So, T(n) <= clg(n) if we pick c > 2a. 6.3 Building a heap 6.3.1 We can use BUILD-MAX-HEAP to convert any A[1..n], where n = A.length into a heap. We note that the entries in A[floor(n/2)+1..n] are all leaves (by Exercise 6.1-7) and so are also 1-element heaps. Then we go through the remaining nodes applying MAX-HEAPIFY to each. BUILD-MAX-HEAP(A) 1 A.heap-size = A.length 2 for i = floor(A.length/2) downto 1 3 MAX-HEAPIFY(A,i) Figure 6.3 shows an example: +--+--+--+--+--+--+--+--+--+--+ A | 4| 1| 3| 2|16| 9|10|14| 8| 7| +--+--+--+--+--+--+--+--+--+--+ 1 2 3 4 5 6 7 8 9 10 (a) 4 / \ / \ / \ 1 3 / \ / \ / \ / \ / \ / \ 2 i=5 (16) 9 10 / \ / / \ / / \ / 14 8 7 (b) 4 6.3.2 / \ / \ / \ 1 3 / \ / \ / \ / \ / \ / \ i=4 (2) 16 9 10 / \ / / \ / / \ / 14 8 7 (c) 4 / \ / \ / \ 1 i=3 (3) / \ / \ / \ / \ / \ / \ 14 16 9 10 / \ / / \ / / \ / 2 8 7 (d) 4 6.3.3 / \ / \ / \ i=2 (1) 10 / \ / \ / \ / \ / \ / \ 14 16 9 3 / \ / / \ / / \ / 2 8 7 (e) i=1 (4) / \ / \ / \ 16 10 / \ / \ / \ / \ / \ / \ 14 7 9 3 / \ / / \ / / \ / 2 8 1 (f) finished. 16 6.3.4 / \ / \ / \ 14 10 / \ / \ / \ / \ / \ / \ 8 7 9 3 / \ / / \ / / \ / 2 4 1 To prove BUILD-MAX-HEAP works correctly, we use the following loop invariant: At the start of each iteration of lines 2-3, each node i+1, i+2, ..., n is the root of a max-heap. Intialization: Prior to the first iteration, i = floor(n/2), and each node floor(n/2)+1, floor(n/2)+2, ..., n is a leaf and thus a root of a trivial max-heap. Maintenance: Note that each child of i has a higher index than i, so are both max-heaps. This is the condition MAX-HEAPIFY(A,i) needs to make node i a max-heap root, and it preserves the property that nodes i+1, i+2, ..., n are roots of max-heaps. Decrementing i re-establishes the loop invariant. 6.3.5 Termination: At termination, i = 0, so by the loop invariant, each node 1, 2,..., n is the root of a max-heap, node 1 in particular. Q: What is BUILD-MAX-HEAP's running time? One answer: sum up the cost of build-heap at each NODE of the heap: Total cost = number of calls x cost of to MAX-HEAPIFY MAX-HEAPIFY <= n/2 x O(lg n) = O(nlg(n)) This is correct, but not asymptotically tight. Better: sum up the cost at each _height_, h, noting heap height is floor(lg n), and there h+1 are at most ceiling(n/2 ) nodes at height h and the cost of MAX-HEAPIFY is O(h). So: Total cost <= lg(n) Sum ( number of nodes x MAX-HEAPIFY's ) h = 0 at height h cost at height h 6.3.6 lg(n) n = Sum ( ceiling(----) x O(h) ) h = 0 h+1 2 lg(n) 1 inf. 1 <= Sum ( nO(h) --- ) <= O(n Sum ( h --- ) ) h = 0 2^h h = 0 2^h 1/2 inf. k x = O(n -------- ) by Sum (kx ) = ------ 2 k=0 2 (1-1/2) (1-x) Formula A.8 Appendix A = O(2n) = O(n) So, BUILD-MAX-HEAP builds a max-heap from an unordered array in _linear_ time. Similarly, we can use BUILD-MIN-HEAP to build a min-heap in linear time. BUILD-MIN-HEAP is the same as BUILD-MAX-HEAP, except MAX-HEAPIFY is replaced by MIN-HEAPIFY (Exercise 6.2-2). 6.4 The heapsort algorithm 6.4.1 Heapsort starts by building a max-heap on the array A[1..n], where n = A.length, by using BUILD-MAX-HEAP. Since A[1] is the largest, it can be put in the correct position by swapping it with A[n]. Then we "discard" node n from the heap by decrementing A.heap-size. Then we use MAX-HEAPIFY(A,1) to re-establish the heap property for A[1..n-1]. We then repeat the process on the subarray A[1..n-1], ending with the two largest elements in A[n-1] and A[n], and a max-heap in A[1..n-2]. We keep doing this until we decrement the heap-size to 1, at which point we are done. This can be shown to be correct, using a loop invariant (Exercise 6.4-2). HEAPSORT(A) 1 BUILD-MAX-HEAP(A) // cost O(n) 2 for i = A.length downto 2 // n-1 times 3 exchange A[1] <-> A[i] // O(1) 4 A.heap-size = A.heap-size - 1 // O(1) 5 MAX-HEAPIFY(A,1) // O(lg n) Total cost = O(n) + (n-1)( 2*O(1) + O(lg n) ) <= O(n) + n*O(lg n) = O(n*lg(n)) Figure 6.4 The operation of HEAPSORT. 6.4.2 The array after it has been heapified by the call to BUILD-MAX-HEAP: +--+--+--+--+--+--+--+--+--+--+ A |16|14|10| 8| 7| 9| 3| 2| 4| 1| +--+--+--+--+--+--+--+--+--+--+ 1 2 3 4 5 6 7 8 9 10 (a) 16 original heap / \ / \ / \ 14 10 / \ / \ / \ / \ / \ / \ 8 7 9 3 / \ / / \ / / \ / 2 4 1 (b) 14 after line 5 / \ when i = 10 / \ / \ 8 10 / \ / \ / \ / \ / \ / \ 4 7 9 3 / \ / \ / \ 2 1 (16) i = 10 (c) after line 5 10 6.4.3 when i = 9 / \ / \ / \ 8 9 / \ / \ / \ / \ / \ / \ 4 7 1 3 / / / 2 (14) (16) i = 9 (d) after line 5 9 when i = 8 / \ / \ / \ 8 3 / \ / \ / \ / \ / \ / \ 4 7 1 2 (10) (14) (16) i = 8 (e) after line 5 8 6.4.4 when i = 7 / \ / \ / \ 7 3 / \ / / \ / / \ / 4 2 1 (9) i = 7 (10) (14) (16) (f) after line 5 7 when i = 6 / \ / \ / \ 4 3 / \ / \ / \ 1 2 (8) (9) i = 6 (10) (14) (16) (g) 4 after line 5 / \ when i = 5 / \ / \ 2 3 / / / i = 5 1 (7) (8) (9) (10) (14) (16) (h) after line 5 3 6.4.5 when i = 4 / \ / \ / \ 2 1 i = 4 (4) (7) (8) (9) (10) (14) (16) (i) after line 5 2 when i = 3 / / / 1 (3) i = 3 (4) (7) (8) (9) (10) (14) (16) (j) after line 5 1 when i = 2 i=2 (2) (3) (4) (7) (8) (9) (10) (14) (16) (k) final array: +--+--+--+--+--+--+--+--+--+--+ A | 1| 2| 3| 4| 7| 8| 9|10|14|16| +--+--+--+--+--+--+--+--+--+--+ 1 2 3 4 5 6 7 8 9 10 6.5 Priority queues 6.5.1 Binary heaps make efficient priority queues. There are two kinds of priority queues: max-priority queues and min-priority queues, which are based on max-heaps and min-heaps respectively. A priority queue is a data structure for maintaining a set S of elements, each with an associated key value. A max-priority queue supports the following operations: INSERT(S,x) inserts the element x into the set S, which we can write: S <- S union {x} MAXIMUM(S) returns the element in S with the largest key. EXTRACT-MAX(S) returns the element in S with the largest key and removes it. INCREASE-KEY(S,x,k) increases the value of x's key to k (assumed to be >= x's current key). One application: job scheduling on a shared computer. When a job is interrupted or finished, the highest priority job is obtained from those pending using EXTRACT-MAX. New jobs are added or suspended jobs are put back by using INSERT. A min-priority queue supports the 6.5.2 operations INSERT, MINIMUM, EXTRACT-MIN, and DECREASE-KEY. They can be used in an event- driven simulator, where an element is an event and the key value is the time it occurred. See Exercise 6.5-3. We can use a heap to implement a priority queue. In an application, it is often necessary to know which application object corresponds to a priority queue element, and vice-versa. So we store a handle (a pointer or index, depending on the application) to the application object in each heap element. And similarly, the application object has a handle (e.g. array index) back to the heap, which needs to be updated when a heap operation changes it. We ignore these details below. The procedure HEAP-MAXIMUM implements MAXIMUM in Theta(1) time: HEAP-MAXIMUM(A) 1 return A[1] HEAP-EXTRACT-MAX implements EXTRACT-MAX: HEAP-EXTRACT-MAX(A) 1 if A.heap-size < 1 2 error "heap underflow" 3 max = A[1] 4 A[1] = A[A.heap-size] 5 A.heap-size <- A.heap-size - 1 6 MAX-HEAPIFY(A,1) 7 return max 6.5.3 HEAP-EXTRACT-MAX runs in O(lg n) time since there are just a constant number of operations on top of MAX-HEAPIFY which uses O(lg n) time. HEAP-INCREASE-KEY(A,i,key) first updates the key of A[i] to "key". This may violate the max-heap property, so we push A[i] up the heap to its proper place by exchanging it with its parent -- similar to the insertion loop of INSERTION_SORT. HEAP-INCREASE-KEY(A,i,key) 1 if key < A[i] 2 error "new key < current key" 3 A[i] = key 4 while i > 1 and A[PARENT(i)] < A[i] 5 exchange A[i] <-> A[PARENT(i)] 6 i = PARENT(i) HEAP-INCREASE-KEY runs in O(lg n) time since the path traced from node i to the root has length O(lg n) for an n-element heap. It can be proved to be correct by using a loop invariant (Exercise 6.5-5). Figure 6.5 shows an example of its operation: (a) 16 6.5.4 original heap / \ i = 9 / \ / \ 14 10 / \ / \ / \ / \ / \ / \ 8 7 9 3 / \ / / \ / / \ / 2 (4) 1 (b) 16 heap with the / \ key increased / \ / \ 14 10 / \ / \ / \ / \ / \ / \ 8 7 9 3 / \ / / \ / / \ / 2 (15) 1 i = 9 (c) 16 6.5.5 after one / \ iteration / \ of while / \ loop 14 10 / \ / \ / \ / \ / \ / \ i = 4 (15) 7 9 3 / \ / / \ / / \ / 2 8 1 (d) 16 after a second / \ iteration / \ of while / \ loop i = 2 (15) 10 / \ / \ / \ / \ / \ / \ 14 7 9 3 / \ / / \ / / \ / 2 8 1 At this point the loop terminates since A[PARENT(i)] >= A[i]. 6.5.6 MAX-HEAP-INSERT(A,key) implements the INSERT operation. It expands the heap by one element and sets its key value to -infinity. Then it makes a call to HEAP-INCREASE-KEY to set the key value to "key" and restore the max-heap property. MAX-HEAP-INSERT(A,key) 1 A.heap-size = A.heap-size + 1 2 A[A.heap-size] = -infinity 3 HEAP-INCREASE-KEY(A, A.heap-size, key) The running time of MAX-HEAP-INSERT is O(lg n) since that is the running time of HEAP-INCREASE-KEY and only a constant number of other operations are performed. Note: One could implement a stack or an ordinary FIFO queue with priority queues (see Exercise 6.5-7). Variations on max-heaps: 1. A min-priority queue based on a min-heap. 2. One could have a d-ary heap instead of a binary heap (Problem 6-2).