Chapter 11 Hash Tables 11.0.1
Some dynamic set applications, such as the
symbol table for a compiler, only require the
dictionary operations: SEARCH, INSERT, DELETE.
Although a hash table can have O(n) worst-case
behavior for these operations, it is usually
possible to obtain O(1) times.
A hash table generalizes an array: direct
addressing, covered in Section 11.1, allows
Theta(1) access to the table, but requires one
array position for every possible key.
Often the number of actual keys is quite a
bit smaller than the number of possible keys.
In this case, a "hash function" is used to
compute an index into a smaller array from
the value of the key. This is discussed
in Section 11.2, where "chaining" is used to
resolve "collisions". Section 11.3 discusses
hash functions.
Section 11.4 discusses another technique for
handling collisions, namely "open addressing".
By using a well-chosen hash function, each of
these methods can give O(1) average case run
time. If the hash table is static (no inserts
or deletes), Section 11.5 shows how to obtain
O(1) worst-case run time by using "perfect
hashing".
11.1 Direct-address tables 11.1.1
If the set of keys fits in a relatively small
interval, U = {0, 1, ..., m-1}, or "universe"
of values (assuming no duplicate keys), we can
use a "direct-address table", T[0..m-1], in
which each T[k], or "slot", corresponds to a
key k in U. If there is no element with key k
then T[k] = NIL. Figure 11.1 shows an
example.
We can implement the dictionary operations as:
DIRECT-ADDRESS-SEARCH(T,k)
1 return T[k]
DIRECT-ADDRESS-INSERT(T,x)
1 T[x.key] = x
DIRECT-ADDRESS-DELETE(T,x)
1 T[x.key] = NIL
each of which only take O(1) time.
We can either store all the satellite data in
T or we can store a key and a pointer to the
satellite data. We can also omit storing the
key, since it is just the array index, which
we know. However, then we need a way to tell
if the slot is empty.
11.2 Hash tables 11.2.1
Problems with direct addressing:
1) storing array T of size |U| is impractical
2) the set of keys actually used is small
compared to |U|, so much space is wasted
Cure: use a "hash function" h that maps each
key k into the range 0..m-1, where m is
smaller than |U|. That is, h maps U into
the slots of a "hash table" T[0..m-1]:
h : U --> {0, 1, ..., m-1}
We say an element with key k "hashes" to slot
h(k), and h(k) is the "hash value" of k. Thus
storage needs are reduced from Theta(|U|) to
Theta(m). The run times of the operations are
Theta(1) on average, but can be Theta(n) in
the worst case. This is caused by the need to
handle "collisions", when two keys hash to the
same slot. Figure 11.2 shows an example.
We would like to avoid collisions by choosing
a good hash function, maybe a "random" one
(which captures the random mixing idea of the
concept "to hash" something). But collisions
can't be avoided entirely since |U| > m, so we
need a way of resolving them. The remainder
of this section presents "chaining" as one
such method; another, "open addressing" is
discussed in Section 11.4.
Collision resolution by chaining 11.2.2
In "chaining", collisions are resolved by
putting all elements that hash to the same
slot j into a linked list T[j], as shown in
Figure 11.3. The dictionary operations then
become:
CHAINED-HASH-INSERT(T,x)
1 insert x at the head of list T[h(x.key)]
CHAINED-HASH-SEARCH(T,k)
1 search for an element with key k in T[h(k)]
CHAINED-HASH-DELETE(T,x)
1 delete x from the list T[h(x.key)]
The worst-case run time of a search is O(n),
if there are n items in the table and they all
hash to the same slot. The worst-case run
time of insertion is O(1) (unless we want to
search the slot for duplicates first). Also,
the worst-case run time of deletion is O(1)
if the lists are doubly linked, since we are
given a pointer to the element, x.
Analysis of hashing with chaining
As mentioned above, hash tables can have bad
worst-case performance, but good average case
run times depending on how evenly the hash
function distributes keys.
To analyze the average case, we make 11.2.3
the assumption of "simple uniform hashing:"
that any element is equally likely to hash
into any of the m slots. For the analysis, we
define the "load factor" alpha as n/m where n
is the number of elements and m is the number
of slots in the hash table T. Note that alpha
can be less than, equal to, or greater than 1.
For j = 0, 1, ..., m-1, we let n_j denote the
length of the list T[j], so that the average
value of n_j is E[n_j] = alpha = n/m, and
n = n_0 + n_1 + ... + n_(m-1). (*)
We compute the expected number of elements
to be examined for both a successful and for
an unsuccessful search. We note that there is
an O(1) cost for computing h(k).
Theorem 11.1
In a hash table in which collisions are
resolved by chaining, an unsuccessful search
takes expected time Theta(1 + alpha), under
the assumption of simple uniform hashing.
Proof: Any key k not already in the table is
equally likely to hash to any of the m slots.
For an unsuccessful search, we will examine
each of the elements of T[h(k)], which has
expected length E[n_(h(k))] = alpha. Thus
the total running time is Theta(1 + alpha),
accounting for the cost of computing h(k)
and searching the list of length alpha.
11.2.4
For a successful search, the probability that
a list is searched is proportional to its
length. However, Theta(1 + alpha) is still
the expected time.
Theorem 11.2
In a hash table where collisions are resolved
by chaining, a successful search takes
expected time Theta(1 + alpha), on average,
under the simple uniform hashing assumption.
Proof: We assume that the target is equally
likely to be any of the n elements in T. The
number of elements examined is one more (i.e.
x) than the number of elements before x in
x's list. So, to find the expected number of
elements examined, we take the average over
the n elements in T, of 1 plus the expected
number of elements added to x's list after x
was added to it. Let x_i denote the i-th
element inserted into T, and k_i = x_i.key
for i = 1, 2, ..., n. For keys k_i and k_j,
we define the indicator random variable
X_ij = I{h(k_i) = h(k_j)}. Assuming simple
uniform hashing, Pr{h(k_i) = h(k_j)} = 1/m, so
by Lemma 5.1 (page 118) E[X_ij] = 1/m also.
Thus, the expected number of elements examined
in a successful search is:
n n 11.2.5
E[ 1/n * Sum ( 1 + Sum X_ij )]
i = 1 j=i+1
n n
= 1/n * Sum ( 1 + Sum E[X_ij] ) by linearity
i = 1 j=i+1 of expectation
n n
= 1/n * Sum ( 1 + Sum 1/m )
i = 1 j=i+1
n
= 1 + 1/(nm) * Sum ( n - i )
i = 1
= 1 + 1/(nm) * n(n-1)/2
= 1 + (n-1)/2m = 1 + alpha/2 - alpha/(2n)
So the total time for a successful search is:
Theta(2+alpha/2-alpha/(2n)) = Theta(1 + alpha)
So, if m is proportional to n, we have
n = O(m), and thus alpha = n/m = O(m)/m = O(1)
and therefore searching is O(1). Recalling
that insertion takes O(1) worst-case and
deletion takes O(1) for doubly-linked lists,
we see that all dictionary operations can be
supported in O(1) time on the average.
11.3 Hash functions 11.3.1
We first discuss what makes a good hash
function, then we list three methods for
creating them: hashing by division, hashing by
multiplication, and universal hashing, which
gives provably good results via randomization.
What makes a good hash function?
The ideal goal is simple uniform hashing, but
we usually don't know the distribution of keys
and they may not be drawn independently. As
an example, if we know the distribution to be
random real numbers, k in the range [0,1), the
hash function h(k) = floor(km) satisfies the
condition of simple uniform hashing.
A good strategy is to make the hash value
independent of patterns in the data. E.g.
the division method gives good results if we
divide by a prime unrelated to the data.
Heuristic methods such as hashing by division
and hashing by multiplication, can give good
hash functions. Information about the keys
can be useful. For example the "nearby" keys
num and nums may occur in the symbol table of
a program that is being compiled -- it would
be useful if they hashed to different places,
preferably far apart if we are using linear
probing in open addressing. Universal hashing
often provides such hash value "spreading".
Interpreting keys as natural numbers 11.3.2
Most hash functions assume that the universe
of keys is the set N = {0, 1, 2, ...} of
natural numbers. So if keys are not natural
numbers, we must find a way to interpret them
as natural numbers. As an example, character
strings can be interpreted as (large) integers
in base 128, so num expressed as a radix-128
integer is 110*128^2 + 117*128 + 109 =
1817325, since the ASCII codes for n, u, and
m are 110, 117, and 109 respectively. So, in
the following we assume the keys are integers.
11.3.1 The division method
In the "division method", we define the hash
function by h(k) = k mod m, which is quite
fast since it requires just one division.
With the division method, it is best to avoid
some values of m. E.g., if m = 2^p, then h(k)
is just the p lowest-order bits of k, which is
fine if we know all such lowest-order bits are
equally likely, but not a good choice if we
want to involve all bits in the hash value.
Also, choosing m = 2^p - 1 may not be good if
k is a character string (Exercise 11.3-3).
As an example, if we want a table to hold
2000 strings, allowing an average of 3 strings
per slot, then m = 701 is a good choice since
it is a prime near 2000/3 & not a power of 2.
11.3.2 The multiplication method 11.3.3
The multiplication method takes two steps:
first multiply k by a constant A, in the range
0 < A < 1 and extract the fractional part of
kA, then multiply this value by m and take the
floor of the result. So the hash function is:
h(k) = floor( m * ( kA mod 1 ) )
where "kA mod 1" is the fractional part of kA.
An advantage of this method is that the value
of m is not critical. We can choose m = 2^p,
and proceed as follows. Assume the word size
of the computer is w bits & k fits in a word.
We choose A to be a fraction of the form s/2^w
where s is an integer with 0 < s < 2^w. First
multiply k by the w-bit integer s = A*2^w,
giving the 2w-bit integer r1*2^w + r0, r1 & r0
being the high and low words of the product.
The desired p-bit hash value consists of the p
most significant bits of r0. See Figure 11.4
Some choices of A work better than others,
possibly depending on the data. Knuth says
A close to (sqrt(5)-1)/2 = 1/phi often works
well, where phi is the golden ratio. Page 232
has an example with k = 123456, p = 14, and
w = 32. A = 2654435769/2^32 is the closest
binary fraction to (sqrt(5)-1)/2.
11.3.3 Universal hashing (skipped)
11.4 Open addressing 11.4.1
In "open addressing", all the elements are
stored in the table itself, so the load factor
alpha < 1 and the table, T, can get full. We
could store the pointers of a linked list in
the table, but to save space, we compute the
slots to examine or "probe". The sequence of
slots examined is called the "probe sequence",
and depends on the key being considered. The
hash function now includes the "probe number",
0, 1, ..., m-1 as a second argument:
h: U x {0,1,...,m-1} --> {0,1,...,m-1}
where we require that for every key k,
be a permutation
of {0,1,...,m-1}, so that eventually every
slot is considered as the table gets full.
We assume the elements in T are keys with no
satellite data, so each slot contains a key
or NIL. Here is the insertion algorithm:
HASH_INSERT(T,k)
1 i = 0
2 repeat j = h(k,i)
3 if T[j] == NIL
4 T[j] = k
5 return j
6 else i = i + 1
7 until i == m
8 error "hash table overflow"
The search algorithm uses the same 11.4.2
probe sequence as insertion, so if it finds a
NIL, the key is not in the table, assuming no
keys are deleted from the table.
HASH_SEARCH(T,k)
1 i = 0
2 repeat j = h(k,i)
3 if T[j] == k
4 return j
5 i = i + 1
6 until T[j] == NIL or i == m
7 return NIL
Deletion is tricky, since if we just set the
slot j to NIL, we couldn't find any key whose
insertion previously probed j and found it
occupied. One solution is to mark slot j as
DELETED instead of NIL, then HASH_SEARCH still
works and HASH_INSERT can be modified to
insert into DELETED slots also. However, now
search times can become longer and no longer
depend on alpha, so chaining is usually used
in hashing with deletions.
For run time analysis, we make the assumption
of "uniform hashing": each key is equally
likely to have any of the m! permutations of
<0,1,...,m-1> as its probe sequence. This
generalizes simple uniform hashing in which
just one number was produced. Double hashing,
defined below gives a good approximation.
We look at three common methods to 11.4.3
compute probe sequences: linear probing,
quadratic probing, and double hashing. The
first two yield m probe sequences; the third
yields m^2 sequences and the best results.
Linear probing
Let h': U --> {0,1,...,m-1} be an ordinary
hash function, the "auxiliary hash function",
then "linear probing" uses the hash function:
h(k,i) = (h'(k) + i) mod m
for i = 0,1,...,m-1. I.e. the slots probed
are T[h'(k)], T[h'(k)+1], ..., T[m-1], T[0],
T[1], ... T[h'(k)-1], which is determined by
T[h'(k)], so there are m such sequences.
Linear probing is prone to the problem of
"primary clustering" -- long runs of occupied
slots. This happens because an empty slot
preceded by i full slots gets filled next with
probability (i+1)/m, much higher than the 1/m
probability if the preceding slot were empty.
Quadratic probing
"Quadratic probing" uses a hash function of
the form:
h(k,i) = (h'(k) + c1*i + c2*i^2) mod m
where h' is an auxiliary hash function, c1 and
c2 != 0 are constants, and i = 0,1,...,m-1.
This method probes T[h'(k)] first, but 11.4.4
later probes are separated by increasing
distances. This works better than linear
probing in avoiding primary clustering, but m,
c1, and c2 must be picked carefully. Also, it
suffers from "secondary clustering": if
h(k1,0) = h(k2,0), then h(k1,i) = h(k2,i) for
i > 0 -- i.e. they produce the same probe
sequence. And since h(k,0) determines the
entire sequence, there are only m sequences.
Double hashing
Double hashing is one of the best methods,
since it approximates uniform hashing. Double
hashing uses a hash function of the form:
h(k,i) = ( h1(k) + i*h2(k) ) mod m
where h1 and h2 are auxiliary hash functions.
This avoids secondary clustering since, even
if h1(k1) = h1(k2), there is only a 1/m chance
that h2(k1) = h2(k2) also. Figure 11.5 shows
an example. h2(k) must be relatively prime to
m for all of T to be searched. One way to do
this is to let m = 2^p and have h2 always
produce an odd number. Another way to do this
is to make m prime and have h2 always produce
a number less than m. For example, we can let
h1(k) = k mod m
h2(k) = 1 + (k mod m')
where m' is a bit less than m (e.g. m-1).
Analysis of open-address hashing 11.4.5
As with chaining, we express the number of
probes in an operation in terms of the load
factor, alpha = n/m < 1, since n < m.
Theorem 11.6 Given an open-address hash table
with load factor alpha = n/m < 1, the expected
number of probes in an unsuccessful search is
at most 1/(1-alpha), assuming uniform hashing.
Proof: On pages 274-275. Here's an intuitive
proof that the expected number is 1/(1-alpha)
= 1 + alpha + alpha^2 + alpha^3 + ...
We always make one probe. The probability is
alpha that we hit an occupied slot and must
make a second probe. There is probability
alpha^2 that both probed slots are occupied
and we must make a third probe; and is alpha^3
that the first 3 slots are occupied and we
we need to make a fourth probe, ..., etc.
If T is half full (alpha = .5), the average
number of probes in an unsuccessful search is
at most 1/(1 - .5) = 2; if alpha = .9, the
average number is at most 1/(1 - .9) = 10.
Corollary 11.7 Inserting an element into an
open-address hash table with load factor alpha
requires 1/(1-alpha) probes on average.
Proof: Inserting a key uses an unsuccessful
search followed by placement into an empty
slot, so the number of probes is 1/(1-alpha).
Theorem 11.8 Given an open-address 11.4.6
hash table with load factor alpha < 1, the
expected number of probes in a successful
search is at most 1/alpha * ln(1/(1-alpha))
assuming uniform hashing and that each key in
T is equally likely to be searched for.
Proof: A search for key k follows the same
probe sequence as when it was inserted. By
Corollary 11.7, the expected number of probes
to insert the (i+1)st key k is at most
1/(1 - i/m) = m/(m-i). Averaging over all n
keys gives the average number of probes:
1 n-1 m m n-1 1 m
- * Sum --- = - * Sum --- = - * (H_m - H_m-n)
n i=0 m-i n i=0 m-i n
m 1 1 1
= - * ( ----- + ----- + ... + - )
n m-n+1 m-n+2 m
m m 1
< - * Integral( - dx )
n m-n x
m m 1 1
= - * ln(---) = -----ln(-------)
n m-n alpha 1-alpha
If T is half full (alpha = .5), the average
number of probes in a successful search is
less than 1.387: if alpha = .9, the average
number is less than 2.5585.