Chapter 11 Hash Tables 11.0.1 Some dynamic set applications, such as the symbol table for a compiler, only require the dictionary operations: SEARCH, INSERT, DELETE. Although a hash table can have O(n) worst-case behavior for these operations, it is usually possible to obtain O(1) times. A hash table generalizes an array: direct addressing, covered in Section 11.1, allows Theta(1) access to the table, but requires one array position for every possible key. Often the number of actual keys is quite a bit smaller than the number of possible keys. In this case, a "hash function" is used to compute an index into a smaller array from the value of the key. This is discussed in Section 11.2, where "chaining" is used to resolve "collisions". Section 11.3 discusses hash functions. Section 11.4 discusses another technique for handling collisions, namely "open addressing". By using a well-chosen hash function, each of these methods can give O(1) average case run time. If the hash table is static (no inserts or deletes), Section 11.5 shows how to obtain O(1) worst-case run time by using "perfect hashing". 11.1 Direct-address tables 11.1.1 If the set of keys fits in a relatively small interval, U = {0, 1, ..., m-1}, or "universe" of values (assuming no duplicate keys), we can use a "direct-address table", T[0..m-1], in which each T[k], or "slot", corresponds to a key k in U. If there is no element with key k then T[k] = NIL. Figure 11.1, page 223, shows an example. We can implement the dictionary operations as: DIRECT-ADDRESS-SEARCH(T,k) return T[k] DIRECT-ADDRESS-INSERT(T,x) T[key[x]] <- x DIRECT-ADDRESS-DELETE(T,x) T[key[x]] <- NIL each of which only take O(1) time. We can either store all the satellite data in T or we can store a key and a pointer to the satellite data. We can also omit storing the key, since it is just the array index, which we know. However, then we need a way to tell if the slot is empty. 11.2 Hash tables 11.2.1 Problems with direct addressing: 1) storing array T of size |U| is impractical 2) the set of keys actually used is small compared to |U|, so much space is wasted Cure: use a "hash function" h that maps each key k into the range 0..m-1, where m is smaller than |U|. That is, h maps U into the slots of a "hash table" T[0..m-1]: h : U --> {0, 1, ..., m-1} We say an element with key k "hashes" to slot h(k), and h(k) is the "hash value" of k. Thus storage needs are reduced from Theta(|U|) to Theta(m). The run times of the operations are Theta(1) on average, but can be Theta(n) in the worst case. This is caused by the need to handle "collisions", when two keys hash to the same slot. Figure 11.2 shows an example. We would like to avoid collisions by choosing a good hash function, maybe a "random" one (which captures the random mixing idea of the concept "to hash" something). But collisions can't be avoided entirely since |U| > m, so we need a way of resolving them. The remainder of this section presents "chaining" as one such method; another, "open addressing" is discussed in Section 11.4. Collision resolution by chaining 11.2.2 In "chaining", collisions are resolved by putting all elements that hash to the same slot j into a linked list T[j], as shown in Figure 11.3 (page 225). The dictionary operations then become: CHAINED-HASH-SEARCH(T,k) search for an element with key k in T[h(k)] CHAINED-HASH-INSERT(T,x) insert x at the head of list T[h(k)] CHAINED-HASH-DELETE(T,x) delete x from the list T[h(k)] The worst-case run time of a search is O(n), if there are n items in the table and they all hash to the same slot. The worst-case run time of insertion is O(1) (unless we want to search the slot for duplicates first). Also, the worst-case run time of deletion is O(1) if the lists are doubly linked, since we are given a pointer to the element, x. Analysis of hashing with chaining As mentioned above, hash tables can have bad worst-case performance, but good average case run times depending on how evenly the hash function distributes keys. To analyze the average case, we make 11.2.3 the assumption of "simple uniform hashing:" that any element is equally likely to hash into any of the m slots. For the analysis, we define the "load factor" alpha as n/m where n is the number of elements and m is the number of slots in the hash table T. Note that alpha can be less than, equal to, or greater than 1. For j = 0, 1, ..., m-1, we let n_j denote the length of the list T[j], so that the average value of n_j is E[n_j] = alpha = n/m, and n = n_0 + n_1 + ... + n_(m-1). (*) We compute the expected number of elements to be examined for both a successful and for an unsuccessful search. We note that there is an O(1) cost for computing h(k). Theorem 11.1 In a hash table in which collisions are resolved by chaining, an unsuccessful search takes expected time Theta(1 + alpha), under the assumption of simple uniform hashing. Proof: Any key k not already in the table is equally likely to hash to any of the m slots. For an unsuccessful search, we will examine each of the elements of T[h(k)], which has expected length E[n_(h(k))] = alpha. Thus the total running time is Theta(1 + alpha), accounting for the cost of computing h(k) and searching the list of length alpha. 11.2.4 For a successful search, the probability that a list is searched is proportional to its length. However, Theta(1 + alpha) is still the expected time. Theorem 11.2 In a hash table where collisions are resolved by chaining, a successful search takes expected time Theta(1 + alpha), on average, under the simple uniform hashing assumption. Proof: We assume that the target is equally likely to be any of the n elements in T. The number of elements examined is one more (i.e. x) than the number of elements before x in x's list. So, to find the expected number of elements examined, we take the average over the n elements in T, of 1 plus the expected number of elements added to x's list after x was added to it. Let x_i denote the i-th element inserted into T, and k_i = key[x_i] for i = 1, 2, ..., n. For keys k_i and k_j, we define the indicator random variable X_ij = I{h(k_i) = h(k_j)}. Assuming simple uniform hashing, Pr{h(k_i) = h(k_j)} = 1/m, so by Lemma 5.1 (page 95) E[X_ij] = 1/m also. Thus, the expected number of elements examined in a successful search is: n n 11.2.5 E[ 1/n * Sum ( 1 + Sum X_ij )] i = 1 j=i+1 n n = 1/n * Sum ( 1 + Sum E[X_ij] ) by linearity i = 1 j=i+1 of expectation n n = 1/n * Sum ( 1 + Sum 1/m ) i = 1 j=i+1 n = 1 + 1/(nm) * Sum ( n - i ) i = 1 = 1 + 1/(nm) * n(n-1)/2 = 1 + (n-1)/2m = 1 + alpha/2 - alpha/(2n) So the total time for a successful search is: Theta(2+alpha/2-alpha/(2n)) = Theta(1 + alpha) So, if m is proportional to n, we have n = O(m), and thus alpha = n/m = O(m)/m = O(1) and therefore searching is O(1). Recalling that insertion takes O(1) worst-case and deletion takes O(1) for doubly-linked lists, we see that all dictionary operations can be supported in O(1) time on the average. 11.3 Hash functions 11.3.1 We first discuss what makes a good hash function, then we list three methods for creating them: hashing by division, hashing by multiplication, and universal hashing, which gives provably good results via randomization. What makes a good hash function? The ideal goal is simple uniform hashing, but we usually don't know the distribution of keys and they may not be drawn independently. As an example, if we know the distribution to be random real numbers, k in the range [0,1), the hash function h(k) = floor(km) satisfies the condition of simple uniform hashing. A good strategy is to make the hash value independent of patterns in the data. E.g. the division method gives good results if we divide by a prime unrelated to the data. Heuristic methods such as hashing by division and hashing by multiplication, can give good hash functions. Information about the keys can be useful. For example the "nearby" keys num and nums may occur in the symbol table of a program that is being compiled -- it would be useful if they hashed to different places, preferably far apart if we are using linear probing in open addressing. Universal hashing often provides such hash value "spreading". Interpreting keys as natural numbers 11.3.2 Most hash functions assume that the universe of keys is the set N = {0, 1, 2, ...} of natural numbers. So if keys are not natural numbers, we must find a way to interpret them as natural numbers. As an example, character strings can be interpreted as (large) integers in base 128, so num expressed as a radix-128 integer is 110*128^2 + 117*128 + 109 = 1817325, since the ASCII codes for n, u, and m are 110, 117, and 109 respectively. So, in the following we assume the keys are integers. 11.3.1 The division method In the "division method", we define the hash function by h(k) = k mod m, which is quite fast since it requires just one division. With the division method, it is best to avoid some values of m. E.g., if m = 2^p, then h(k) is just the p lowest-order bits of k, which is fine if we know all such lowest-order bits are equally likely, but not a good choice if we want to involve all bits in the hash value. Also, choosing m = 2^p - 1 may not be good if k is a character string (Exercise 11.3-3). As an example, if we want a table to hold 2000 strings, allowing an average of 3 strings per slot, then m = 701 is a good choice since it is a prime near 2000/3 & not a power of 2. 11.3.2 The multiplication method 11.3.3 The multiplication method takes two steps: first multiply k by a constant A, in the range 0 < A < 1 and extract the fractional part of kA, then multiply this value by m and take the floor of the result. So the hash function is: h(k) = floor( m * ( kA mod 1 ) ) where "kA mod 1" is the fractional part of kA. An advantage of this method is that the value of m is not critical. We can choose m = 2^p, and proceed as follows. Assume the word size of the computer is w bits & k fits in a word. We choose A to be a fraction of the form s/2^w where s is an integer with 0 < s < 2^w. First multiply k by the w-bit integer s = A*2^w, giving the 2w-bit integer r1*2^w + r0, r1 & r0 being the high and low words of the product. The desired p-bit hash value consists of the p most significant bits of r0. Figure 11.4 on page 232 illustrates this process. Some choices of A work better than others, possibly depending on the data. Knuth says A close to (sqrt(5)-1)/2 = 1/phi often works well, where phi is the golden ratio. Page 232 has an example with k = 123456, p = 14, and w = 32. A = 2654435769/2^32 is the closest binary fraction to (sqrt(5)-1)/2. 11.3.3 Universal hashing (skipped) 11.4 Open addressing 11.4.1 In "open addressing", all the elements are stored in the table itself, so the load factor alpha < 1 and the table, T, can get full. We could store the pointers of a linked list in the table, but to save space, we compute the slots to examine or "probe". The sequence of slots examined is called the "probe sequence", and depends on the key being considered. The hash function now includes the "probe number", 0, 1, ..., m-1 as a second argument: h: U x {0,1,...,m-1} --> {0,1,...,m-1} where we require that for every key k, be a permutation of {0,1,...,m-1}, so that eventually every slot is considered as the table gets full. We assume the elements in T are keys with no satellite data, so each slot contains a key or NIL. Here is the insertion algorithm: HASH_INSERT(T,k) 1 i <- 0 2 repeat j <- h(k,i) 3 if T[j] = NIL 4 then T[j] <- k 5 return j 6 else i <- i + 1 7 until i = m 8 error "hash table overflow" The search algorithm uses the same 11.4.2 probe sequence as insertion, so if it finds a NIL, the key is not in the table, assuming no keys are deleted from the table. HASH_SEARCH(T,k) 1 i <- 0 2 repeat j <- h(k,i) 3 if T[j] = k 4 then return j 5 i <- i + 1 6 until T[j] = NIL or i = m 7 return NIL Deletion is tricky, since if we just set the slot j to NIL, we couldn't find any key whose insertion previously probed j and found it occupied. One solution is to mark slot j as DELETED instead of NIL, then HASH_SEARCH still works and HASH_INSERT can be modified to insert into DELETED slots also. However, now search times can become longer and no longer depend on alpha, so chaining is usually used in hashing with deletions. For run time analysis, we make the assumption of "uniform hashing": each key is equally likely to have any of the m! permutations of <0,1,...,m-1> as its probe sequence. This generalizes simple uniform hashing in which just one number was produced. Double hashing, defined below gives a good approximation. We look at three common methods to 11.4.3 compute probe sequences: linear probing, quadratic probing, and double hashing. The first two yield m probe sequences; the third yields m^2 sequences and the best results. Linear probing Let h': U --> {0,1,...,m-1} be an ordinary hash function, the "auxiliary hash function", then "linear probing" uses the hash function: h(k,i) = (h'(k) + i) mod m for i = 0,1,...,m-1. I.e. the slots probed are T[h'(k)], T[h'(k)+1], T[h'(k)+2], ..., T[m-1], T[0], T[1], ... T[h'(k)-1], which is determined by T[h'(k)], so there are m such sequences. Linear probing is prone to the problem of "primary clustering" -- long runs of occupied slots. This happens because an empty slot preceded by i full slots gets filled next with probability (i+1)/m, much higher than the 1/m probability if the preceding slot were empty. Quadratic probing "Quadratic probing" uses a hash function of the form: h(k,i) = (h'(k) + c1*i + c2*i^2) mod m where h' is an auxiliary hash function, c1 and c2 != 0 are constants, and i = 0,1,...,m-1. This method probes T[h'(k)] first, but 11.4.4 later probes are separated by increasing distances. This works better than linear probing in avoiding primary clustering, but m, c1, and c2 must be picked carefully. Also, it suffers from "secondary clustering": if h(k1,0) = h(k2,0), then h(k1,i) = h(k2,i) for i > 0 -- i.e. they produce the same probe sequence. And since h(k,0) determines the entire sequence, there are only m sequences. Double hashing Double hashing is one of the best methods, since it approximates uniform hashing. Double hashing uses a hash function of the form: h(k,i) = ( h1(k) + i*h2(k) ) mod m where h1 and h2 are auxiliary hash functions. This avoids secondary clustering since, even if h1(k1) = h1(k2), there is only a 1/m chance that h2(k1) = h2(k2) also. Figure 11.5 shows an example. h2(k) must be relatively prime to m for all of T to be searched. One way to do this is to let m = 2^p and have h2 always produce an odd number. Another way to do this is to make m prime and have h2 always produce a number less than m. For example, we can let h1(k) = k mod m h2(k) = 1 + (k mod m') where m' is a bit less than m (e.g. m-1). Analysis of open-address hashing 11.4.5 As with chaining, we express the number of probes in an operation in terms of the load factor, alpha = n/m < 1, since n < m. Theorem 11.6 Given an open-address hash table with load factor alpha = n/m < 1, the expected number of probes in an unsuccessful search is at most 1/(1-alpha), assuming uniform hashing. Proof: On pages 241-242. Here's an intuitive proof that the expected number is 1/(1-alpha) = 1 + alpha + alpha^2 + alpha^3 + ... We always make one probe. The probability is alpha that we hit an occupied slot and must make a second probe. There is probability alpha^2 that both probed slots are occupied and we must make a third probe; and is alpha^3 that the first 3 slots are occupied and we we need to make a fourth probe, ..., etc. If T is half full (alpha = .5), the average number of probes in an unsuccessful search is at most 1/(1 - .5) = 2; if alpha = .9, the average number is at most 1/(1 - .9) = 10. Corollary 11.7 Inserting an element into an open-address hash table with load factor alpha requires 1/(1-alpha) probes on average. Proof: Inserting a key uses an unsuccessful search followed by placement into an empty slot, so the number of probes is 1/(1-alpha). Theorem 11.8 Given an open-address 11.4.6 hash table with load factor alpha < 1, the expected number of probes in a successful search is at most 1/alpha * ln(1/(1-alpha)) assuming uniform hashing and that each key in T is equally likely to be searched for. Proof: A search for key k follows the same probe sequence as when it was inserted. By Corollary 11.7, the expected number of probes to insert the (i+1)st key k is at most 1/(1 - i/m) = m/(m-i). Averaging over all n keys gives the average number of probes: 1 n-1 m m n-1 1 m - * Sum --- = - * Sum --- = - * (H_m - H_m-n) n i=0 m-i n i=0 m-i n m 1 1 1 = - * ( ----- + ----- + ... + - ) n m-n+1 m-n+2 m m m 1 < - * Integral( - dx ) n m-n x m m 1 1 = - * ln(---) = -----ln(-------) n m-n alpha 1-alpha If T is half full (alpha = .5), the average number of probes in a successful search is less than 1.387: if alpha = .9, the average number is less than 2.5585.