In the last lecture we defined the notion of pairwise independence, and saw some ways to use this concept in reducing the randomness of a randomized algorithm. Today we will apply these ideas to designing hash functions and “perfect” hash tables.
1. Universal Hash Families
Last time we showed how to create a joint distribution on random variables such that
- each is uniformly distributed on some finite set , and
- the joint distribution on the ‘s is pairwise independent.
Let be the sample space underlying the ‘s. Drawing a sample from determines values for all the ‘s. We can think of this another way: after choosing , then associated with every there is a value in (namely, the value of ).
Formally, for every there is a function given by . (The random variable is implicitly a function of .) This family of functions satisfies the following important property
A family of functions satisfying this property is called a -universal hash family. So we can restate our construction from last time as follows.
Theorem 1 For any , there exists a -universal hash family with , and . Sampling a function from this family requires mutually independent, uniform random bits. Consequently, each hash function can be represented using bits of space. Evaluating any hash function in this family at any point of takes only a constant number of arithmetic operations involving -bit integers.
One can easily modify this construction to handle any and with , and a power of two.
2. Perfect Hashing
Let be a “universe” of items. We will assume that our computational model has words of bits and that any standard arithmetic operation involving -bit words takes time.
Suppose we wish to store a given subset as a “dictionary”, so that we can efficiently test whether any element belongs to . There are many well-known solutions to this problem. For example, a balanced binary tree allows us to store the dictionary in words of space, while search operations take time in the worst case. (Insertion and deletion of items also take time.)
We will present an alternative dictionary design based on hashing. This is a “static dictionary”, which does not allow insertion or deletion of items.
Theorem 2 There is a randomized algorithm that, given a set , stores the set in a data structure using words of space. There is a deterministic algorithm that, given any item , can search for in that data structure in time in the worst case.
Let be any set with and a power of two. Theorem 1 gives us a -universal hash family mapping to for which any hash function can be evaluated in time.
Using this family, consider the following simple design for our dictionary. First create a table of size and randomly choose a function from the hash family. Can we simply store each item in the table in location ? This would certainly be very efficient, because finding in the table only requires evaluating , which takes only time. But the trouble of course is that there might be collisions: there might be two items such that , meaning that and want to lie in the same location of the table.
Most likely you have discussed hash tables collisions in an undergraduate class on data structures. One obvious way to try to avoid collisions is to take to be larger; the problem is that this increases the storage requirements, and we are aiming for a data structure that uses words. Alternatively one can allow collisions to happen and use some other technique to deal with them. For example one could use chaining, in which all items hashing to the same location of the table are stored in a linked list, or linear probing, in which some items hashing to the same location and relocated to nearby locations. Unfortunately neither of those techniques solves our problem: when the table size is both of those techniques will typically require super-constant time to search the table in the worst case.
The solution we present today is based on a simple two-level hierarchical hashing scheme. At the top-level, a single hash function is used to assign each item to a “bucket” in the top-level table. Then, for every bucket of the top-level table we create a new second-level hash table to store the items that hashed to that bucket.
Collisions. Using the pairwise independence property, we can easily determine the probability of any two items having a collision. For any unordered pair of distinct items ,
by Markov’s inequality. The problem is that storing this would require space.
Our design. Instead, we will choose to be (rounded up to a power of two). This will typically result in some collisions, but the second-level hash tables are used to deal with those collisions. The top-level hash function is denoted , and the elements of are called buckets. By (1),
Let us condition on that event happening: the number of collisions of the top-level hash function is at most . For every bucket , let
be the set of items that are hashed to bucket by the top-level hash function. Let . For each we will sample a new second-level hash function mapping where . With probability at least there will be no collisions of the hash function , as shown by (2). We will repeatedly sample and count its number of collisions until finding a function with no collisions.
Search operation. Every item in is stored in exactly one location in these second-level tables. So to search our data structure for an item we follow a very simple procedure. First we compute , the top-level bucket containing item . Then we compute , which gives the location of in the second-level table . This process takes time.
Space requirements. The only thing left to analyze is the space requirements. The size of the top-level table is at most , by construction. The total size of the second-level tables is
We also need to need to write down the description of the hash functions themselves. By Theorem 1 each hash function can be represented using only a constant number of words. So the total space required is words.