Hash Tables
The basic idea of hashing is that you can provide some function to map
data into table locations. Therefore, once we have such a function, h, we can place any object in the table by saying:
A[h(x)] = x
... and can retrieve it from the table by looking at:
A[h(x)]
Terminology
Each item that you want to hash into the table
has a key which is representative of that item. For simple objects
such as ints, the key is merely the object itself. For more complex objects,
some user-defined key must be provided. For example, a good key for students
might be student ID number. The set of all keys for a particular object
is called the universe of keys, abbreviated U.
A hash function takes any key from U and maps it to some
location in the table. Let us call the table size m.
A hash function is some function h(k) such that:
h : U -> [0,m-1]
This says that h is a function that maps keys from U onto the range
[0,m-1]. For an array in C/C++, these are the available indices
in a table of size m.
Hash Functions: The Division Method
One of the simplest hash functions maps ints onto the table range using the mod operator:
h(k) = k % m
Hash functions of this form are said to use the division method of hashing. Note
that this guarantees that we will end up with a value on the range
[0,m-1].
Multi-part Hash Functions
This works very well when keys are ints, but what if they aren't? We can
split the hash function into two parts; the first part takes a key
from U and maps it to an int. The second part takes an int and maps
it onto the range [0,m-1]. We end up with a pair of functions like
this:
f : U -> Z
g : Z -> [0,m-1]
h(k) = g(f(k))
For example, you hash might floats by saying:
f(x) = (int)(100*x)
g(x) = x % m
h(k) = g(f(k))
or
h(k) = ((int)(100*k)) % m
Desirable Properties of Any Hash Function
There are two desirable properties for a hash function: it should be easy to compute and it should evenly distribute keys across the available table positions. We'll talk more about this later, but remember these two properties.
So what?
The biggest benefit of hash tables is that insertion, find, and deletion
(sort of, see below) can all happen in constant time. This means
that unlike trees and lists, the number of elements in the table has no
bearing on its performance. This is a nice idea in theory, but in practice,
things get a little more complicated. When do complications arise?
Collisions
What if we have a table of size 10 and we want to hash the value 20 into it.
If h(k) = k % m, we simply compute h(k) and perform the insertion:
A[h(k)] = k
A[20 % 10] = 20
A[0] = 20
So we insert 20 at position 0. What if we want to insert 30?
Since 30 % 10 is also 0, we have what is called a collision.
For our purposes, collisions are unavoidable so we must have some mechanism for collision resolution. Two main strategies exist: separate chaining and open addressing.
Separate Chaining
In separate chaining we make each table location (also called a bucket)
a list. Collision resolution is handled by inserting the element into the list.
With a very bad hash function, we end up with a disproportionate number of elements in a single bucket and find and remove operations can become costly. In a perfect world, with n elements in a table of size m, we would have
n/m values in each bucket. This would be an evenly distributed hash function. This ratio is called the load factor
of the table. Note that the load factor can be infinite with separate chaining because the lists can be of arbitrary length.
Open Addressing
In open addressing, a collision means that another position must be chosen.
This introduces the idea of the number of tries (or probes) that it has
taken to find an available slot in the table. Our hash function now takes
a second parameter which is the probe number. For example:
h(k,p) = (k % m + p) % m
Because of the addition of p itself to the hash value, this is called linear probing. If we had added p2 this would be
quadratic probing. Note that we have to mod the result by the table size
to ensure that our new address is also inside the table.
We begin with a probe value of 0. If h(k,0) is already taken, we try h(k,1).
If that is already taken, we try h(k,2), etc. We give up when the table is
full. Searching is a similar process. Try looking at h(k,0). If that's not the value
we want, try h(k,1). We stop when we've found an empty position or when we run out of indices to try.
Deletion from a Hash Table
While find and insert have obvious implementations based on the table structure,
deletion is a little trickier. With separate chaining, we simply delete the value from its bucket.
With open addressing, we can't simply remove the value.
Given the example above (including the hash table), try to insert both of those
values and then delete 20. If index 0 is now marked as empty, is it possible
to find 30 by probing? Because of this, we can only mark objects
as deleted; we cannot remove the value. Since the state of the table (what was in it and where) at the time of insertion affects the position of the current element, deleting earlier insertions creates problems when finding or removing
later values.
Resizing?
Another issue that arises with hash tables is the desire to resize them
when either (1) you run out of space (open addressing) or (2) your load
factor gets too high (separate chaining). The problem with changing the
table's size (m) is that since the hash function is a function
of m, the location for each of the objects isn't the same using the hash function for the
new table. The way around this is to go through and rehash the values
from the old table into the new table.