Friday, February 27, 2009

Hash Table with Preemptive Bloom Filter

There's a useful hash table/Bloom filter trick I've been seeing across a number of papers. (If you don't know what a Bloom filter is, then I haven't been doing my job...) For the record, I'm not claiming this trick is mine; maybe it originated in the paper Longest Prefix Matching Using Bloom Filters, where it plays a key role. But I also can't recall a paper that clearly highlights the idea, so I'm doing it here.

Suppose you'll be doing lookups of (large) keys in a (slow) hash table, and many of your lookups will be for items that actually are not in the table. For example, you may be storing URLs on blacklist or dangerous packet signatures or prefixes of addresses or whatever else. Then it makes sense to keep a Bloom filter for the items in your hash table in (fast) memory. Before you do a lookup into the table, you check the Bloom filter. A false positive in the Bloom filter means you do a lookup when you didn't have to, but this allows you to avoid a lookup for a key that's not in your hash table the large majority of the time. The approach works for any type of hash table.

This should be considered a standard approach in hash table design, so I suggest we come up with a catchy name for it, so it doesn't have to be continually repeated in future papers. I'll suggest it be called a "hash table with preemptive Bloom filter", but feel free to suggest a cooler sounding name for it in the comments.

This example highlights the need for coming up with names for nice ideas that generalize to many situations. I've seen this idea described repeatedly in multiple papers precisely because nobody has thought to give it a good name.

13 comments:

Matt Welsh said...

I propose we call it a "Mitzenfilter".

Anonymous said...

Well, the same technique applies in front of other data structures, so more generally it could be "preemptive Bloom Filtering", or "Bloom preFiltering", or even just "using a Bloom Filter"?

Anonymous said...

How about "Bloom-accelerated hash table"?

Anonymous said...

Also, a fairly similar technique is used in database hash join implementations when the tables are much larger than disk. You can filter out tuples that won't join and avoid writing them back to disk partitions. Implemented in multiple commercial databases.

Anonymous said...

A related technique is
where you project the data
in a low dimensional space,
to prune quickly the candidates.

An interesting question then... is why stop there? Why not have several filters, progressively more expensive and tighter?

I wrote a blog post about this:

http://www.daniel-lemire.com/blog/archives/2008/11/20/how-to-speed-up-retrieval-without-any-index/


See also my paper...

Daniel Lemire, Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound, to appear in Pattern Recognition.
http://arxiv.org/abs/0811.3301

Anonymous said...

Why not just call it a Blooming hash table?

Michael Mitzenmacher said...

Anon 2: I agree in spirit that really this is just "using a Bloom filter", but it's helpful to give it a name so people will stop feeling the need to describe it. Although perhaps just a standard sentence, "We use a Bloom filter on the keys of the hash table to preempt most unsuccessful searches" should suffice.

Anon 6: Blooming is unsatisfactory. Bloom is the man's name. We don't call it a Karping reduction, even though that sounds kind of funny too.

Anonymous said...

A blushing filter?

Greg Morrisett said...

Blash table!

Anonymous said...

bloomin' onion

Anonymous said...

...and hashbrowns.

Anonymous said...

Naming things like this puts up barriers to entry, thus hurting outsiders. It does make things easier for the crowd that knows the language, but the cost is too high. When we teach, we should just say "This is a Bloom Filter; it is named after a person and has nothing to do with flowers; here are some uses."

David Andersen said...

access-filtered hash table

pre-screened hash table

mostly-positive hash table
(humor: generally-positive hash table)

...

btw: I've been spending too much time thinking about bloom filters lately. I'd love to chat @ NSDI or at the sigcomm TPC meeting, if you're going to be at either.