Wednesday, March 04, 2009

Journal Version : Two Choices + Bloom Filters

A paper that's been sitting around in the pipeline far too long is finally out in its journal version.

Using the Power of Two Choices to Improve Bloom Filters, by Lumetta/Mitzenmacher.

This paper pretty much came about from a dare. I was visiting my friend Steve Lumetta while at Allerton, and as he was ribbing me about all my papers being on either Two Choices or Bloom filters, he suggested I should be able to combine the two in one paper. It took a few minutes for me to work out that he wasn't entirely teasing me, he actually had an idea.

The basic idea is the following: instead of having one group of k hash functions we use to insert an item into a Bloom filter, we have two groups of hash functions. We insert items sequentially into the Bloom filter, but we use our power of two choices as follows: when we insert an item, we use the group of k hash functions that sets the smaller number of bits to 1 (since fewer 1s means fewer false positives). When checking if an item in the Bloom filter, we have to see if all corresponding bits are set to 1 for either group of k hash functions, since we don't know which group we used initially for each item.

It turns out you can analyze this scheme, and somewhat surprisingly to me, there are settings where it leads to improved false positive rates. That is, the gain you get from putting in fewer ones actually more than makes up for the fact that now you have to check two groups of k hash locations. We also explore even better schemes (based on better choosing which group to use for each item) that we can't analyze.

I admit I took this as a funny mathematical curiosity more than a practical scheme. And I couldn't resist the title. (If you know my work, it really does sound like a joke.) But someone saw the paper and ran with the idea: the follow-on SIGMETRICS paper Building high accuracy bloom filters using partitioned hashing gives a potentially practical variation. (It's not clear to me when you'd use this variation over a perfect-hash-function-based Bloom filter; I suppose their variation is more amenable to adding a small number of additional items on the fly.)

My hope is perhaps it will inspire other non-obvious valuable uses of the power of two choices.

3 comments:

Miguel J said...

I saw the old version of your paper now turned into journal. I am impressed by the improvements in false positive (2 times in some cases).
I wrote this paper: http://www.csee.usf.edu/~mjimeno/pubs/ipccc07.pdf, which is similar to yours, but we pick from N groups of hash functions the best one and use it for mapping. For testing, we know which one is the best already. Actually, compared to the SIGMETRICS paper you point to, we don't need to go through all the groups of functions each time we test for a string.

I wasn't able to quite understand your technique, specially the rounds in the offline setting. Do you have an implementation available to be shared that I can use to compare? Thanks,
Miguel

Christian Esteve Rothenberg said...

This work did really inspire our recent intern-networking proposal (http://www.psirp.org/files/Deliverables/PSIRP-TR09-0001_LIPSIN.pdf), where we used the notion of power of choices to construct d candidate "forwarding Bloom filters". However, there are important differences on our application. We work with very compact, packet header size Bloom filters (e.g., 256 bits). Given 20-30 link identities to be included, we compute d candidate Bloom filters using each time a different set of k hash functions. Since the Bloom filter is carried in the packet header and queries are done in a distributed manner by forwarding nodes, we can include the identifier of the set of d hash functions used at Bloom filter construction time. Blind testing for the multiple candidates in parallel did perform worse in terms of false positives after choosing the Bloom filter with the best fill factor. This can be explained by our low m/n ratio.

However, by including the identifier d we have the best of both worlds. Additionally, we can limit the effects of false positives for our specific forwarding purposes (e.g., include policy-based routing or link avoidance). Currently, we are working on an extended set of design choices for practical "in-packet Bloom filters" and relevant (emergent) network applications.

Best,
-Christian

Michael Mitzenmacher said...

Miguel and Chesteve --

I'm glad to see the idea -- although not well publicized (it was rejected from a theory conference, naturally, because it was neither "deep" enough nor "immediately practical" enough -- perhaps reasonable complaints, but it seemed to biased me at the time that the idea could inspire other work), has led to some more practical follow-on work. I admit I'm also (in my biased way) happy to see that the Kirsch/Mitzenmacher result (based on the work of Dillinger and Manolios) that you really only need 2 hash functions has come into play as well!

If you ever want to talk further (or need a Bloom filter "consultant") let me know -- I love finding practical problems that can use these techniques. (In particular, Miguel, perhaps you could clarify your question and send it to me via e-mail.)