There has already been some pushback on some of the claims from the draft paper (see here for a response to the paper's discussion of B-trees, and here for a discussion regarding that cuckoo hash tables seem to do just as well as "learned" hash tables). I also read the paper and had some back and forth with the authors about some thoughts and concerns I had; they have been both responsive and thoughtful, and I think the conversation was fruitful on both sides. (At least, they helped me understand their work, and I hope my suggestions for them were useful.) In what follows, I'll explain some of specific thoughts and concerns regarding their proposed learned Bloom filters. I'll try to keep the blog post self-contained, but you may want to read their paper. Also, I've tried to write my ideas down more formally in a short write-up. Here's a link to a draft; this link may age out or disappear if/when I put this on arxiv.
What follows is my description (so any issues are mine, not the authors of the Learned Index paper).
Recall that a Bloom filter approximately represents a set S in small space. One can ask queries of the form: is y an element of S. If y is an element of S, the answer will always be yes (no false negatives), if not there is some chance of false positive. That is, it trades off some amount of error to save space. A key feature is that the probability that any given input y yields a false positive does not depend on y. (Well, it does after the Bloom filter has been instantiated -- some elements give false positives, some don't. But a priori, before you build the Bloom filter using hash functions, if you don't know the hash functions, you don't know what the false positives will be.)
For the learned Bloom filter, suppose we have a very good (black-box) predictor P, which takes an input x and returns P(x), which should be interpreted as an estimate of the probability that x is an element of S. That is, we have a function that can give a very good "guess" if an input x is an element of S. Then we can use P as a "pre-filter" to improve our Bloom filter as follows. When asked if x is an element of S, compute P(x), and simply return that x is in S if P(x) is larger than some suitable threshold z. This may introduce false positives, but it may also introduce false negatives; there may be elements x of S for which P(x) < z. To deal with those, we use a backup filter, which is just a standard Bloom filter that holds the false negatives from the pre-filter using the predictor P. So now we accept that x is an element of S if P(x) is above the threshold, or x obtains a positive answer from the backup filter. The backup filter also introduces false positives.
The authors of the learned index paper suggest that machine learning allows for good predictors P that require small space, thereby improving the Bloom filter in terms of space while achieving the same false positive probability. Note that the predictor does not necessarily have to be that great; even if half the elements of S are false negatives, the backup filter would then be a little bigger than half the size of a standard Bloom filter, so if the predictor P is suitably small, it is plausible that one could get the same performance with smaller space.
I don't know if this idea is new. I don't think I've seen a probabilistic pre-filter for a Bloom filter before; usually a Bloom filter is a pre-filter to some other part of the system. It seems like an interesting idea; certainly, the potential to layer filters is useful, and putting a filter before a Bloom filter is sensible.
A major issue, though, is that the nature of the guarantees one obtains are quite different for learned Bloom filters and standard Bloom filters, which is a subtlety that I worry may not be clear to all readers of the original paper. In particular, it seems in part that confusion might arise because the term "false positive rate" appears to have slightly different meanings in the Bloom filter community and the machine learning community; in the Bloom filter setting, false positive rate generally refers to that false positive probability, which is independent of the query. In the machine learning setting, the false positive rate is an empirical quantity, which depends on the input data and the queries.
An example may be helpful; here's one I used from the writeup. (This is a "thought-experiment" example, not based on actual experiments.) Suppose we are working with numbers in the range [0,1000000), and our learned Bloom filter is supposed to hold a set S that contains a 1000 elements. Half of those elements are random numbers in the range [1000,2000], and half those elements are random over the rest of the range. A good machine learning function might learn a predictor P that says that if x is in the range [1000,2000], then P(x) is "large" -- say around 1/2 -- and otherwise P(x) is small. It could use a threshold z of around 0.4, in which case about half the elements of S are false negatives, and must be stored in the backup filter. This seems like a reasonable outcome for learning.
What is the false positive rate now? Well, it depends on the queries. If all the queries are uniform over the range [1000,2000], the false positive rate would be quite high -- every query to a non-set element would be a false positive. If the queries are uniform over the range [0,1000000), the false positive rate is quite low, since one will rarely get unlucky and ask a query in the range [1000,2000], and otherwise the only false positives will come from the backup Bloom filter. The point is that the false positive probability depends on the distribution of the queries (as well as the predictor). But there may be many applications where the queries are naturally more often for elements that are "similar to" set elements -- that is, they have high P values, like the values in [1000,2000] that are not elements of S -- which may lead to high false positive rates. And if you don't know the query stream in advance, you don't even know what the false positive rate will be.
The response to this seems to be that if you have data about the queries, you can figure out the false positive rate empirically. That is, if you have a test set of queries that you assume is a sample from a distribution over all possible queries, you can determine the false positive rate of that test set, and use it to predict the false positive rate of future queries. My writeup provides basic definitions to formalize this idea, which seems correct. However, this still assumes that "future queries" will come from the same distribution as the test set. Returning to our example, if in our test set queries are uniform over [0,1000000), but then in the future queries change so that they are uniform over a smaller interval [0,100000), that significantly changes the false positive rate of the filter.
My takeaways were the following:
- The idea of adding learning into data structures is certainly an interesting idea.
- But it's still important to try to set up a theoretical framework, to understand what kinds of guarantees can be obtained.
- At the moment, I can see applications where learned Bloom filters might be helpful -- if your data set has significant structure that can be useful to a machine learning function, if you don't have to deal with insertions and deletions frequently, if you have access to query data, and if you have reason to believe the query stream isn't changing (in a distributional sense).
- At the same time, I can see applications where I would be very concerned about the use of learned Bloom filters, such as in a security setting. Also, modern Bloom filter alternatives, such as cuckoo filters (which use less space, and allow insertions/deletions more easily) should be considered and compared when thinking about using learned Bloom filters.
- There's probably more to think about here. In particular, I wonder if there are other places where additional "filter layering" may be useful.