Tuesday, March 06, 2018

Optimizing Learned Bloom Filters with Sandwiching


For the small-ish subset of people out there who care about "learned Bloom filters" (the subject of my last post), I have a small-ish update.  I guess the data structure has been unconsciously sitting around in the back of my head, and when I paged it back into my conscious self, I noticed there was an (in hindsight) obvious improvement. 

Recall that the learned Bloom filter uses a (magic) black-box predictor obtained from learning theory that predicts whether an element is in a set or not;  to avoid false negatives, it then uses a backup Bloom filter to rescue any set elements that have been incorrectly predicted as not being in the set.  This leads to two sources of false positives, from the predictor and the backup Bloom filter.

It turns out it is better to get rid of some of those false positives up front.  That is, you get better performance if you have a regular old Bloom filter up front, followed by the learned predictor, followed by a (much smaller) backup Bloom filter to round up stray false negatives.  Because you have the predictor sandwiched between two Bloom filters, I refer to this as the "sandwiched learned Bloom filter". 

The math is surprisingly easily -- and at the same time, really beautiful in terms of the result.  It turns out that, if you think of distributing "Bloom filter bits" out to the up-front Bloom filter and the backup Bloom filter, the right thing to do is first put bits on the backup Bloom filter, up to a certain fixed amount, which depends on the the parameters of the predictor (false positive rate, false negative rate).  After that, all the bits should go to the up-front Bloom filter.

Like most Bloom filter optimizations, the gains are worthwhile but do not seem to be another-factor-of-two;  it looks like it this can cut the size another 10-25% or so, depending on the settings, and it essentially free.   

The whole writeup is less than 2 pages, so rather than write and explain it all here, if you're interested, check the arxiv draft.  Hopefully, more fun in this setting will be forthcoming....

1 comment:

Vijayshankar Raman said...

Hi, thanks for the post, was very useful and easy to follow. I wonder if the neural network probability (F_p and F_n) will not get affected by the presence of the bloom filter in front.
Because now a different distribution of data is reaching the neural network
(I assume the network itself will be trained on this data, which has passed the upfront bloom filter).