My Biased Coin: Hash Table with Preemptive Bloom Filter

Friday, February 27, 2009

Hash Table with Preemptive Bloom Filter

There's a useful hash table/Bloom filter trick I've been seeing across a number of papers. (If you don't know what a Bloom filter is, then I haven't been doing my job...) For the record, I'm not claiming this trick is mine; maybe it originated in the paper Longest Prefix Matching Using Bloom Filters, where it plays a key role. But I also can't recall a paper that clearly highlights the idea, so I'm doing it here.

Suppose you'll be doing lookups of (large) keys in a (slow) hash table, and many of your lookups will be for items that actually are not in the table. For example, you may be storing URLs on blacklist or dangerous packet signatures or prefixes of addresses or whatever else. Then it makes sense to keep a Bloom filter for the items in your hash table in (fast) memory. Before you do a lookup into the table, you check the Bloom filter. A false positive in the Bloom filter means you do a lookup when you didn't have to, but this allows you to avoid a lookup for a key that's not in your hash table the large majority of the time. The approach works for any type of hash table.

This should be considered a standard approach in hash table design, so I suggest we come up with a catchy name for it, so it doesn't have to be continually repeated in future papers. I'll suggest it be called a "hash table with preemptive Bloom filter", but feel free to suggest a cooler sounding name for it in the comments.

This example highlights the need for coming up with names for nice ideas that generalize to many situations. I've seen this idea described repeatedly in multiple papers precisely because nobody has thought to give it a good name.

13 comments:

Matt Welsh said...: I propose we call it a "Mitzenfilter".; February 27, 2009 at 7:56 AM
Anonymous said...: Well, the same technique applies in front of other data structures, so more generally it could be "preemptive Bloom Filtering", or "Bloom preFiltering", or even just "using a Bloom Filter"?; February 27, 2009 at 9:25 AM
Anonymous said...: How about "Bloom-accelerated hash table"?; February 27, 2009 at 10:26 AM
Anonymous said...: Also, a fairly similar technique is used in database hash join implementations when the tables are much larger than disk. You can filter out tuples that won't join and avoid writing them back to disk partitions. Implemented in multiple commercial databases.; February 27, 2009 at 12:18 PM
Anonymous said...: A related technique is
where you project the data
in a low dimensional space,
to prune quickly the candidates.

An interesting question then... is why stop there? Why not have several filters, progressively more expensive and tighter?

I wrote a blog post about this:

http://www.daniel-lemire.com/blog/archives/2008/11/20/how-to-speed-up-retrieval-without-any-index/

See also my paper...

Daniel Lemire, Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound, to appear in Pattern Recognition.
http://arxiv.org/abs/0811.3301; February 27, 2009 at 12:25 PM
Anonymous said...: Why not just call it a Blooming hash table?; February 27, 2009 at 12:57 PM
Michael Mitzenmacher said...: Anon 2: I agree in spirit that really this is just "using a Bloom filter", but it's helpful to give it a name so people will stop feeling the need to describe it. Although perhaps just a standard sentence, "We use a Bloom filter on the keys of the hash table to preempt most unsuccessful searches" should suffice.

Anon 6: Blooming is unsatisfactory. Bloom is the man's name. We don't call it a Karping reduction, even though that sounds kind of funny too.; February 27, 2009 at 3:12 PM
Anonymous said...: A blushing filter?; February 27, 2009 at 3:23 PM
Greg Morrisett said...: Blash table!; February 28, 2009 at 8:32 AM
Anonymous said...: bloomin' onion; February 28, 2009 at 12:22 PM
Anonymous said...: ...and hashbrowns.; February 28, 2009 at 12:24 PM
Anonymous said...: Naming things like this puts up barriers to entry, thus hurting outsiders. It does make things easier for the crowd that knows the language, but the cost is too high. When we teach, we should just say "This is a Bloom Filter; it is named after a person and has nothing to do with flowers; here are some uses."; March 3, 2009 at 4:42 AM
David Andersen said...: access-filtered hash table

pre-screened hash table

mostly-positive hash table
(humor: generally-positive hash table)

...

btw: I've been spending too much time thinking about bloom filters lately. I'd love to chat @ NSDI or at the sigcomm TPC meeting, if you're going to be at either.; March 24, 2009 at 1:15 PM

My Biased Coin

Friday, February 27, 2009

Hash Table with Preemptive Bloom Filter

13 comments:

My Book on Randomized Algorithms

Blog Archive

Science/Tech Blogs I Read

Contributors

Legal Statement

Analytics