## Tuesday, March 02, 2010

### Teaching Bloom Filters

As I finished my lecture today for my undergraduate algorithms and data structures course, after spending about half an hour explaining Bloom filters and what they did, I couldn't help but wonder, yet again, why this incredibly useful, simple data structure, which also didactically demonstrates some very nice concepts like one-sided error and the ability to trade off speed/memory with correctness, isn't in the standard undergraduate textbooks?  (Of course, you can always buy this book....)

If you're teaching algorithms and data structures, do you students a favor, and sneak Bloom filters in one lecture.

Anonymous Rex said...

This might be answered if I read your book, but why is Alice on the cover? (Is it just that Alice and Bob show up a lot?)

Michael Mitzenmacher said...

It's the cards -- cards go well with randomness.

And we (Eli and I) both seem to like Alice in Wonderland.

Pall Melsted said...

I can testify to the effectiveness/awesomeness of Bloom filters. In my new job in genetics I was analyzing short reads of DNA sequence and trying to keep all k-mers (dna seq of length k) in memory and how often I'd seen them. The idea is simple, you go once over your input keeping track of each distinct k-mer in a hash-table incrementing counts when you see them. Since there is so much overlap between the sequences, we expect to see each k-mer about 10 times.

The problem is you run out of memory, even with 10G. The reason is that the new fancy high-throughput machines have higher error rates, so every time there is an error you get a unique never-seen-again k-mer.

Since we have no intention of keeping k-mers seen only once we filter them using a Bloom filter, and store only information for items already found in the Bloom filter. After this step you have a list of all k-mers seen at least twice, plus the false positives from the Bloom filter (which can be found with another pass through the data). Using a Bloom filter is the key to fitting the whole thing in memory.

I'd say, bin balanced trees entirely in favor of teaching Bloom filters.

Anonymous said...

Eden H said...

One problem with presenting Bloom Filters is that the simple space analysis relies on uniform random hash functions. Yes, you can relax this and get strong results with succinctly represented hash functions using (e.g.) pairwise independence, but the analysis is more involving. As a result, it's easy to get the (wrong) impression that the remarkable space bounds are just an accounting trick: you're not including the space required to store the hash functions! This is what I thought for a long time, anyway.

I think it's worth emphasizing to students that space savings stem from the set indicator function being represented in a lossy way.

Mihai said...

Michael, when you say Bloom filter, do you actually mean approximate dictionary? If this is so, it's a good data structure to have in some cases (eg, what a previous commenter is saying about DNA sequences). I am actually very fond of a related data structure, dictionaries with retrieval but not membership queries.

But the original Bloom data structure is truly horrible. I can't think of any setting of parameters where linear probing that stores a small hash of the element is not better. So I think teaching Bloom filters, in the most proper sense, is actually detrimental. I hope you include this warning in the course.

Arvind Subramanian said...

Dear Professor Michael Mitzenmacher ,

I am Arvind, a CS grad from Georgia Tech.
I agree-Yes, it would have been great if there was exposure to Bloom filters at the undergrad level.

More importantly, I feel that this should be introduced in context- packet classification, file systems- and how a Bloom filter might give an optimal solution.

At present I think there is a wide gap between variants of Bloom filters published in Conference paper/journals, and off-the-shelf libraries of Bloom filter in mainstream languages.

I am extremely interested in use-cases of Bloom filter variants, and I had written an article on the same :

http://appolo85.wordpress.com/2010/08/03/bloom-filter/

It would be really great, if you could comment about your opinion of the same.

Do you feel the CS research community is going for an over-kill when yet another paper on yet another Bloom filter variant is published?

Best Regards,
Arvind

Anonymous said...

Hi Mike,

I decided to follow you and teach Bloom filters in my upcoming course. But reading your book and your survey on the topic, the first thing that annoyed me was the assumption that the hash function maps elements in a completely independent way. This seems to be critical to the analysis, while at the same time it is not clear why such a function can be stored using less bits than the table itself.

Is there a way to remedy this assumption? That is, can the analysis be carried out (perhaps with some non-trivial complication) while making a much milder assumption on the hash functions?

Michael Mitzenmacher said...

Anon --

The answer is that this is something only a theorist would care about....

But I'd point you to the paper Mitzenmacher/Vadhan, Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream as a (partial) explanation of why this assumption is often reasonable in practice.

Anonymous said...

I do find it odd how the web is full of incorrect derivations of the bound for false positives for Bloom filters (they all assume independence for each bit being set). Mitzenmacher's book is the only place I have seen that mentions there is a problem.