Wednesday, June 13, 2007

Cuckoo Hashing, Theory and Practice : Part 1

One recent line of research I've really been enjoying is the work on cuckoo hashing variants. Currently, my graduate student Adam Kirsch and I are working on a submission tackling the question of how to design cuckoo hashing variants that are more practical in the setting of hardware implementation, such as for routers. So this gives me an excuse to write about cuckoo hashing, explain the problems we see in applying them in practice, and describe the interesting theoretical and practical questions that result. I'll do this over a series of posts.

To start, let us recall the basics of the power of two choices, or multiple-choice hashing. In the standard setting, we are hashing n elements into a hash table with n buckets. Normally, we would hash each item once to determine its bucket; this gives a maximum load, or largest number of elements in any bucket, of about log n / log log n. (All the results I state hold with high probability, so I'll avoid repeating the phrase.) Suppose instead we use two hash functions, giving two choices of locations for each element. Elements are sequentially placed in the least loaded of their two buckets. When we do a lookup, we'll now have to look in two places for the item. In return, the maximum load now falls to about log log n / log 2 -- we've got an extra log in there! (With d > 2 choices, the maximum load would be log log n / log d, just a constant factor better than d = 2.) Results of this type are due to Azar, Broder, Karlin, and Upfal, in their paper Balanced Allocations. (There was some earlier work with a similar result by Karp, Luby, and Meyer auf der Heide.)

In the course of these results, the question arose as to whether one could do better if placing the elements offline. That is, if you had all n elements and their hashes in advance, could you place each element in one of their bucket choices and achieve a better load than the O(log log n) obtained by the sequential placement? The answer turns out to be yes. The best placement only has a constant maximum load. Another way to think of this is to say that if we place the elements sequentially, but retain the power to move them later, as long as each element ends up at one of its bucket choices, we could get down to constant maximum load.

The idea of cuckoo hashing is that such movements can actually be done online quite efficiently. Cuckoo hashing is, essentially, multiple-choice hashing plus moves, and it appears very effective. Let us consider an extreme case, analyzed in the original Cuckoo Hashing paper by Pagh and Rodler. (If my description is unsatisfactory, there is a good explanation already on Wikipedia!) Each bucket can contain just 1 item, and we will have n buckets, split into two subtables of size n/2. Each element has one choice of bucket in each subtable, where the choices are given by independent hash functions. Each hashed element resides at one of its two choices. A lookup therefore always requires at most two accesses into the hash table.

What to do, however, on an insertion -- what if both spaces for an item are already full? When we insert a new item, we put it in the first subtable. If another item is already there, we kick it out, and make it move to the second subtable. If another item is already there, we kick it out, and make it move back to the first table, and so on. This is where the name cuckoo hashing comes from; as the Wikipedia article says,
The name derives from the behavior of some species of cuckoo, where the mother bird pushes eggs out of another bird's nest to lay her own.
Pagh and Rodler prove that if the number of elements inserted is at most (1/2 - epsilon)n for some constant epsilon, then this process finishes in constant expected time! In other words, one can achieve memory utilization of up to 50 percent of capacity, with guaranteed constant lookup time and average constant insertion time.

Further variants significantly improve the memory utilization that can be achieved. Specifically, allowing more than two choices of location helps (more choices of where to go), as does having more than one item per bucket (more choices of what to move). Memory utilization of over 90 percent is easily possible in practice. This paper and this one have more information.

That summarizes the theoretical background, and hopefully convinces you that the idea of cuckoo hashing is a good one. In the next post, we'll look at why you might want to implement cuckoo hashing in hardware, and what obstacles lie therein.

10 comments:

Anonymous said...

Hi Professor Mitzenmacher,

This blog really helps a lot to find the right things to read!

Zhenming

Josiah Carlson said...

In practical work, I usually choose open-address double hashing, if only because I already have an implementation ;).

In any case, cuckoo hashing is very interesting. My personal concern is that unless one chooses hashes carefully, one could get into a situation where there is a relocation cycle, though this is perhaps only of concern with two tables, and perhaps very unlikely in practice.

A simple variant of this, which considers entries within k steps of a single computed hash per key without relocation, seems to perform reasonably well in tests (I get a tested maximum load of 3 with k=2, and 2 with k=6, both at 90% full).

Michael Mitzenmacher said...

Josiah --

The relocation cycle is a problem I glossed over (it will show up in part 2), but you're right, it is a problem!

The within k steps approach is also useful for many types of hash tables, though maybe not for hardware. (Again, part 2 is coming...)

Anonymous said...

hi mike, i am sure you will cover the following "intuition" at some point, but couldn't resist this comment .. is it reasonable to conjecture that if there are at most (1 - epsilon)*n elements to hash, then you expect to spend only O(1/epsilon) time to hash an element -- basically because only (1 - epsilon)*n locations are occupied, and the random process will find an empty slot in expected
O(1/epsilon) time? is such a result known to be true?

aravind

Michael Mitzenmacher said...

Aravind,

I'm not sure exactly what your model is for "finding an empty slot" and "random process" in your comment; could you elaborate?

I haven't got a good handle on how well intuition maps to reality for cuckoo schemes, as there's a lot of unpleasant dependencies that generally need to be worked around. For example, once you "find" a long sequence of moves to insert an item, it tells you something significant -- if you ever hash into one of the locations on the chain, you'll likely have to do a lot of moves. I think that sort of intuition may be useful for thinking of the possible limitations of these schemes, but I'm not sure it's the right direction for analysis.

Anonymous said...

Mike, yes, I understand that dependency issues are likely to crop up. I was suggesting the heuristic idea that since epsilon * n slots are always empty and since independent probes will find an empty slot in 1/epsilon expected time, whether we can believe (heuristically) that a scheme such as Cuckoo will do well even with (1 - epsilon) * n occupancy. And yes, I understand the slippery slope of such "heuristics". My question was, do we think that this (1 - epsilon)*n conjecture is true but we don't yet know how to prove it, or is it false to start with?

thanks, aravind

Michael Mitzenmacher said...

Aravind,

It seems like a great conjecture. The Fotakis/Pagh/Sanders/Spirakis paper seems to have the best known results, and (if I'm reading correctly) they have that they can find an augmenting shortest path by doing a Breadth First Search that hits (1/epsilon)^O(log d) vertices for d choices. Their simulation results, though, seem to suggest that with enough choices, you could get close to full utilization with something like O(1/epsilon) probes.

Anonymous said...

It sounds like if this is done in hardware (I'm assuming a use case is for routing tables), it might be possible to build a sequence of inserts that would give some worse case data sets that would cause a lot of moves. I suppose randomizing the hash function could solve this, though.

Also, what happens when the hash table is full? It seems like when utilization exceeds a certain percentage, the time per insert would degenerate badly.

Anonymous said...

Sound great, but I suspect this method goes hard on the cash lines.
Regards, Frank Hirsch

Anonymous said...

With IP Address on routers the need is to find the longest prefix match, not an exact one. Have you considered how to use hashing to solve this?
hashing a full (say IPv4) 32 bit IP address may give us a miss but that doesn't mean we can not route the pkt. We need to look for a shorter than 32 bit prefix with the first k bits of the address.
I can imagine this could be done by hashing, however it would need a hash for each possible prefix length... and then selecting the longest one.