Thursday, December 03, 2009

Tight Thresholds for Cuckoo Hashing via XORSAT

As I promised sometime back, we now have a paper (up on the arxiv) giving the thresholds for cuckoo hashing, a problem that had been open and was part of my survey (pdf) on open problems in cuckoo hashing.

One interesting thing about our paper is that our (or at least "my") take is that, actually, the thresholds were right there, but people hadn't put all the pieces together. The argument is pretty easy to sketch without any actual equations.

In cuckoo hashing, we have n keys, m > n buckets, and each key is hashed to k possible buckets. The goal is to put each key into one of its k buckets, with each bucket holding at most one key. We can represent this as a random hypergraph, with each key being a hyperedge consisting of k buckets, represented by vertices; our goal is to "orient" each edge to one of its vertices. A natural first step is to repeatedly "peel off" any vertex that doesn't already have an edge oriented to it and has exactly one adjacent unoriented edge, and orient that edge toward it. At the end of this process, we're left with what is called the 2-core of the random hypergraph; every vertex has 2 adjacent edges. Can we orient the remaining edges of the 2-core? In particular, we're interested in the threshold behavior; as m,n grow large, what initial ratio m/n is required to have a suitable mapping of keys to buckets with high probability. (This corresponds to the memory overhead of cuckoo hashing.)

Now consider the random k-XORSAT problem, where we have m variables and n clauses, where each clause is randomly taken from the possible clauses
x_{a_1} + x_{a_2} + ... + x_{a_k} = b_a.
Here the x_{a_i} are distinct variables from the m variables, b_a is a random bit, and addition is modulo 2. The question is whether a random k-XORSAT problem has a solution. Again, we can put this problem in the language of random hypergraphs, with each clause being a hyperedge of k variables, represented by vertices. Again, we can start by oriented edges to vertices as we did with the cuckoo hashing representation; here, a clause has an associated orientation to a variable if that variable is "free" to take on the value to make sure that clause is satisfied. Is the remaining formula represented by the 2-core satisfiable?

The correspondence between the two problems is almost complete. To finish it off, we notice (well, Martin Dietzfelbinger and Rasmus Pagh noticed, in their ICALP paper last year) that if we have a solution to the k-XORSAT problem, there must be a permutation mapping keys to buckets in the corresponding cuckoo hashing problem. This is because if the k-XORSAT problem has a solution, the corresponding matrix for the graph has full rank, which means there must be a submatrix with non-zero determinant, and hence somewhere in the expansion of the determinant there's a non-zero term, which corresponds to an appropriate mapping.

And the threshold behavior of random k-XORSAT problems is known (using the second moment method -- see, for example, this paper by Dubois and Mandler.)

While in some sense it was disappointing that the result was actually right there all along, I was happy to nail down the threshold behavior. In order to add something more to the discussion, our paper does a few additional things.

First, we consider irregular cuckoo hashing (in the spirit of irregular low-density parity-check codes). Suppose that you allow that, on average, your keys have 3.5 buckets. How should you distribute buckets to keys, and what's the threshold behavior? We answer these questions.

Second, what if you have buckets that can hold more than one key? We have a conjecture about the appropriate threshold, and provide evidence for it, using a new fast algorithm for assigning keys to buckets (faster than a matching algorithm).

I should point out that since I originally "announced" we had this result two other papers have appeared that also have nailed down the cuckoo hashing threshold (Frieze/Melsted ; and Fountoulakis/Panagiotou). Each seems to have different takes on the problem.

4 comments:

Anonymous said...

Your last paragraph leads to an interesting question how it'll be resolved who first got the bound?

Anonymous said...

I guess Michael's point is that the bound was already known - you just had to look at existing results in the right way.

Anonymous said...

The paper by Dubois and Mandler seems to handle only the case of 3-XORSAT. They claim that similar methods could be used to handle general k, but they neither give explicitly the value of the threshold nor any hint what is different in the proof. Do you know if there is any more complete reference?

Michael Mitzenmacher said...

Anon #3 -- check our references; Montanari's book devotes an entire chapter to k-XORSAT and explains how to get the thresholds. The 2nd moment calculation seems to be just that -- an ugly moment calculation. There's a more detailed analysis in a related coding theory paper we cite (also with Montanari as a co-author).