Hash Tables

The basic idea of hashing is that you can provide some function to map data into table locations. Therefore, once we have such a function, h, we can place any object in the table by saying:

A[h(x)] = x

... and can retrieve it from the table by looking at:

A[h(x)]

Terminology

Each item that you want to hash into the table has a key which is representative of that item. For simple objects such as ints, the key is merely the object itself. For more complex objects, some user-defined key must be provided. For example, a good key for students might be student ID number. The set of all keys for a particular object is called the universe of keys, abbreviated U.

A hash function takes any key from U and maps it to some location in the table. Let us call the table size m. A hash function is some function h(k) such that:

h : U -> [0,m-1]

This says that h is a function that maps keys from U onto the range [0,m-1]. For an array in C/C++, these are the available indices in a table of size m.

Hash Functions: The Division Method

One of the simplest hash functions maps ints onto the table range using the mod operator:

h(k) = k % m

Hash functions of this form are said to use the division method of hashing. Note that this guarantees that we will end up with a value on the range [0,m-1].

Multi-part Hash Functions

This works very well when keys are ints, but what if they aren't? We can split the hash function into two parts; the first part takes a key from U and maps it to an int. The second part takes an int and maps it onto the range [0,m-1]. We end up with a pair of functions like this:

f : U -> Z
g : Z -> [0,m-1]

h(k) = g(f(k))

For example, you hash might floats by saying:

f(x) = (int)(100*x) g(x) = x % m h(k) = g(f(k)) or h(k) = ((int)(100*k)) % m

Desirable Properties of Any Hash Function

There are two desirable properties for a hash function: it should be easy to compute and it should evenly distribute keys across the available table positions. We'll talk more about this later, but remember these two properties.

So what?

The biggest benefit of hash tables is that insertion, find, and deletion (sort of, see below) can all happen in constant time. This means that unlike trees and lists, the number of elements in the table has no bearing on its performance. This is a nice idea in theory, but in practice, things get a little more complicated. When do complications arise?

Collisions

What if we have a table of size 10 and we want to hash the value 20 into it. If h(k) = k % m, we simply compute h(k) and perform the insertion:

A[h(k)] = k
A[20 % 10] = 20
A[0] = 20

So we insert 20 at position 0. What if we want to insert 30? Since 30 % 10 is also 0, we have what is called a collision. For our purposes, collisions are unavoidable so we must have some mechanism for collision resolution. Two main strategies exist: separate chaining and open addressing.

Separate Chaining

In separate chaining we make each table location (also called a bucket) a list. Collision resolution is handled by inserting the element into the list. With a very bad hash function, we end up with a disproportionate number of elements in a single bucket and find and remove operations can become costly. In a perfect world, with n elements in a table of size m, we would have n/m values in each bucket. This would be an evenly distributed hash function. This ratio is called the load factor of the table. Note that the load factor can be infinite with separate chaining because the lists can be of arbitrary length.

Open Addressing

In open addressing, a collision means that another position must be chosen. This introduces the idea of the number of tries (or probes) that it has taken to find an available slot in the table. Our hash function now takes a second parameter which is the probe number. For example:

h(k,p) = (k % m + p) % m

Because of the addition of p itself to the hash value, this is called linear probing. If we had added p² this would be quadratic probing. Note that we have to mod the result by the table size to ensure that our new address is also inside the table.

We begin with a probe value of 0. If h(k,0) is already taken, we try h(k,1). If that is already taken, we try h(k,2), etc. We give up when the table is full. Searching is a similar process. Try looking at h(k,0). If that's not the value we want, try h(k,1). We stop when we've found an empty position or when we run out of indices to try.

Deletion from a Hash Table

While find and insert have obvious implementations based on the table structure, deletion is a little trickier. With separate chaining, we simply delete the value from its bucket.

With open addressing, we can't simply remove the value. Given the example above (including the hash table), try to insert both of those values and then delete 20. If index 0 is now marked as empty, is it possible to find 30 by probing? Because of this, we can only mark objects as deleted; we cannot remove the value. Since the state of the table (what was in it and where) at the time of insertion affects the position of the current element, deleting earlier insertions creates problems when finding or removing later values.

Resizing?

Another issue that arises with hash tables is the desire to resize them when either (1) you run out of space (open addressing) or (2) your load factor gets too high (separate chaining). The problem with changing the table's size (m) is that since the hash function is a function of m, the location for each of the objects isn't the same using the hash function for the new table. The way around this is to go through and rehash the values from the old table into the new table.