Wednesday, October 22, 2014

Cuckoo Filters

An upcoming paper appearing at CoNext will be of interest to any Bloom Filter user or aficionado:

Cuckoo Filter:  Practically Better than Bloom
(Bin Fan, David Andersen, Michael Kaminsky, Michael Mitzenmacher)

Let me describe a cuckoo filter and some of what's in the paper for you.  If you want to avoid a technical discussion, all you need to know is that for reasonably large sized sets, for the same false positive rate as a corresponding Bloom filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters cannot do).  If you want to look at code, there's even a github repository for you with code for cuckoo filters. 

Note:  I'll assume you probably know what a Bloom filter is in the discussion that follows.  As a reminder, it's a hash-based data structure for set membership that can give false positives;  by allowing false positives, you can save space in your representation of the set.  I'll assume you probably know what cuckoo hashing is.  As a reminder, in cuckoo hashing, you have buckets that can contain a fixed number of keys, and each key gets some number d of buckets (obtained by getting d values from hashing the key) where it can be placed.  When you insert a new key, if all of its buckets are full, you can move keys out of that bucket to one of its other buckets to make room.  Since apparently there are drinking games that exist based on taking a drink whenever I use the word "Bloom filter" or "cuckoo hashing" in a talk, I imagine most readers of the blog know about these things.  (Or, always, Wikipedia.)

The framework used in the paper comes from the following idea.  One way of implementing Bloom filter functionality when given a static set of items is to use a perfect hash function;  that is, as hash function that maps the n elements of your set into an array of size n in a 1-1 fashion.  Then, you use another hash function to store a fingerprint of the element instead of the element itself in the array location.  To do a set membership test on an element z, you hash z once to find its location in the array, and hash z again to find its fingerprint, and check against the fingerprint in the array location.  For elements not in the set, you get a false positive rate of 2^{# of fingerprint bits}.

The problem is that the perfect hashing framework doesn't allow you to insert new items naturally.  Or do deletions (but standard Bloom filters don't allow deletions also -- you need a counting Bloom filter for that).  So a natural thing to do is to think of using some other sort of hash table, besides a perfect hash table, and see what happens.  This idea was explored way back in for instance some papers I worked on in 2006 (here and here) using "d-left" hash tables, which are based on a multiple choice hashing scheme.

If you're going to use multiple choice hashing schemes, though, you should think about using cuckoo hashing.  The ability to move keys around means you should get better space utilization;  for example, even with 2 choices, if your buckets can hold 4 items, cuckoo hashing can get you about 95% space utilization.  The problem with cuckoo hashing in this setting is that, for a Bloom filter, you want to just keep fingerprints of keys, not the keys themselves.  So, when you want to move the key, how do you figure out where to move it to -- you no longer have the key to hash?

The approach used in the paper is a trick called partial-key cuckoo hashing.  To get two buckets for our key, we do a hash to get the first bucket (idealized to be uniformly at random), but to get the second bucket, we do not hash the key, but just the fingerprint of the key;  we XOR that value with the value for the first bucket to get the second.  The upside of this approach is given a bucket and a fingerprint in the bucket we can find the other bucket corresponding to the fingerprint.  (XORing the hash of the fingerprint works if the fingerprint is in the first bucket or the second, as is easily checked.)  The downside is if our fingerprint has F bits, that means our second bucket is really only chosen from 2^F possibilities;  it is not uniform conditioned on the first.

Because of this limitation, one interesting aspect of the paper is that we show, in theory, this approach just will not work.  Specifically, there is an Omega(log n) lower bound on the size of the fingerprint that must be stored; this means the cuckoo filter is super-linear in space, unlike standard Bloom filters which are linear in space.  This lower bound is easy to derive.  Suppose that you have buckets that can hold b keys.  If 2b+1 keys have the same pair of buckets, then there is trouble;  you can't fit 2b+1 keys into 2b bucket slots.  So the question is whether there are collections of 2b+1 keys that have the same pair of buckets.  With standard cuckoo hashing, this would be a low probability event.  With partial-key cuckoo hashing, the probability depends on the size of the fingerprint, since this determines how many possible second choices of a bucket there are for each key (given their first random choice of a bucket).  If you have too few choices, you will get too many collisions, or too many keys in a pair of buckets.  Some straightforward calculations yield the Omega(log n) lower bound on the size of the fingerprint.

So now that we've proven theoretically that this won't work, how could it work in practice?  If you go back to those calculations, you find that the lower bound has some nice constant factors.  The bounds you get on the fingerprint size are actually (roughly speaking) log n/(2b).  That constant factor in the denominator, even when b=4, covers a large range of n.  In practice, fingerprints can be quite a reasonable size, and still avoid the potential bottleneck of too large a group of keys all hashing to the same small number of buckets. 

In any case, this may be the first paper I've written that contains a proof that the described data structure fails in theory but works in practice.  (The other way around I generally find much easier.) 

Big open theoretical question:  try to analyze why partial-key cuckoo hashing gets roughly the same load threshold as regular cuckoo hashing.  (Anyone want to work on that?  Let me know...)

This post is long enough;  there's a lot more in the paper, including comparisons with other schemes, bit-saving tricks, and lots of performance results.  Have a look.  

Postscript/Tangent:  In a recent review (for a different paper, with a different theme) one reviewer made the following comment/criticism:  "Despite years of research, no one has found a variant of cuckoo hashing that can outperform simpler methods in realistic situations."  To which I say, hunh?  Yes, I understand linear probing is wonderful in practice due to cache effects and block sizes and in many and perhaps even most situations can yield better performance than cuckoo hash tables.  But certainly not everywhere.  Cuckoo hash tables certainly seem at least potentially better for many hardware settings, and for at least some important applications -- such as this one -- cuckoo hashing seems to be a clear win.  (The issue here is the number of fingerprints you have to deal with on a lookup affects the false positive probability -- so even if linear probing lookups are "faster" due to cache effects and block sizes, if you try to get this sort of space utilization you get larger contiguous blocks which increase your false positives.)

My co-author David Andersen also violently disagrees with this review comment, and points to these recent papers at NSDI and Eurosys.  

Perhaps I'm missing something, but my thought was this was a sign of a bad review (both in content, and in its overall rudeness).  I felt obliged to try to correct the record (and I can only hope the reviewer sees this post).     
 

5 comments:

Anonymous said...

re: Linear probing. Bin Fan has pointed out that linear probing is not so great for high-load hash tables like the 95% you can get with cuckoo hashing and buckets of size 4.

Mark Hammond said...
This comment has been removed by the author.
Mark Hammond said...

Thanks for the comparative analysis with quotient filter and BQF and CF variants.
Given it's appealing properties I found the performance statistics for standard operations enlightening.

Unknown said...

One clarification regarding a frequently asked question: why does it use "xor Hash(fingerprint)" rather than "xor fingerprint" to derive the second hash bucket index.

This is because when fingerprint is relatively small (e.g., a few bits), "xor fingerprint" results in moving elements in a very small area in the table space, leading to higher chance of cuckoo failures and worse space utilization; while "xor hash(fingerprint)" makes the kick operations of cuckoo hashing "global" to the table, and achieving high space utilization.

Han said...

Is there any hardware (FPGA/ASIC) implementation of this work?

Regarding hardware implementation (for example on FPGA). Which one do you think is better (or suitable), hash table with linear probing or cuckoo hash?