One recent line of research I've really been enjoying is the work on cuckoo hashing variants. Currently, my graduate student Adam Kirsch and I are working on a submission tackling the question of how to design cuckoo hashing variants that are more practical in the setting of hardware implementation, such as for routers. So this gives me an excuse to write about cuckoo hashing, explain the problems we see in applying them in practice, and describe the interesting theoretical and practical questions that result. I'll do this over a series of posts.
To start, let us recall the basics of the power of two choices, or multiple-choice hashing. In the standard setting, we are hashing n elements into a hash table with n buckets. Normally, we would hash each item once to determine its bucket; this gives a maximum load, or largest number of elements in any bucket, of about log n / log log n. (All the results I state hold with high probability, so I'll avoid repeating the phrase.) Suppose instead we use two hash functions, giving two choices of locations for each element. Elements are sequentially placed in the least loaded of their two buckets. When we do a lookup, we'll now have to look in two places for the item. In return, the maximum load now falls to about log log n / log 2 -- we've got an extra log in there! (With d > 2 choices, the maximum load would be log log n / log d, just a constant factor better than d = 2.) Results of this type are due to Azar, Broder, Karlin, and Upfal, in their paper
Balanced Allocations. (There was some earlier work with a
similar result by Karp, Luby, and Meyer auf der Heide.)
In the course of these results, the question arose as to whether one could do better if placing the elements offline. That is, if you had all n elements and their hashes in advance, could you place each element in one of their bucket choices and achieve a better load than the O(log log n) obtained by the sequential placement? The answer turns out to be yes. The best placement only has a constant maximum load. Another way to think of this is to say that if we place the elements sequentially, but retain the power to move them later, as long as each element ends up at one of its bucket choices, we could get down to constant maximum load.
The idea of cuckoo hashing is that such movements can actually be done online quite efficiently. Cuckoo hashing is, essentially, multiple-choice hashing plus moves, and it appears very effective. Let us consider an extreme case, analyzed in the original
Cuckoo Hashing paper by Pagh and Rodler. (If my description is unsatisfactory, there is a
good explanation already on Wikipedia!) Each bucket can contain just 1 item, and we will have n buckets, split into two subtables of size n/2. Each element has one choice of bucket in each subtable, where the choices are given by independent hash functions. Each hashed element resides at one of its two choices. A lookup therefore always requires
at most two accesses into the hash table.
What to do, however, on an insertion -- what if both spaces for an item are already full? When we insert a new item, we put it in the first subtable. If another item is already there, we kick it out, and make it move to the second subtable. If another item is already there, we kick it out, and make it move back to the first table, and so on. This is where the name cuckoo hashing comes from; as the Wikipedia article says,
The name derives from the behavior of some species of cuckoo, where the mother bird pushes eggs out of another bird's nest to lay her own.
Pagh and Rodler prove that if the number of elements inserted is at most (1/2 - epsilon)n for some constant epsilon, then this process finishes in constant expected time! In other words, one can achieve memory utilization of up to 50 percent of capacity, with guaranteed constant lookup time and average constant insertion time.
Further variants significantly improve the memory utilization that can be achieved. Specifically, allowing more than two choices of location helps (more choices of where to go), as does having more than one item per bucket (more choices of what to move). Memory utilization of over 90 percent is easily possible in practice.
This paper and
this one have more information.
That summarizes the theoretical background, and hopefully convinces you that the idea of cuckoo hashing is a good one. In the next post, we'll look at why you might want to implement cuckoo hashing in hardware, and what obstacles lie therein.