On my
publications page there is a recently finished
new paper I'm quite excited about. Part of that excitement is because it's (finally!) my first paper with my remarkable colleague
Salil Vadhan. The rest is because I think it answers a question that's been bothering me for nearly a decade.
A fair bit of my research has been on hashing, and often I end up doing simulations or experiments to check or demonstrate my results. For many analyses of hashing processes -- including
Bloom filters or the
power of two choices, we (or at least
I) assume the hash function is truly random -- that is, each element is independently mapped to a uniform location by the hash function. This greatly helps in the analysis. Such an assumption is downright painful to strict theorists (like Salil, who has questioned me about this for years), who rightly point out that hash functions being used are not anything close to truly random, and in fact truly random functions would be far too expensive in terms of space to even write down in practice.
Of course theorists have developed ways to deal with this -- with the seminal work being Carter and Wegman's introduction of
universal hashing. The key idea is that you choose a hash function at random from a small family of hash functions that are easy to compute and require little space to express, and this choice guarantees some random property, even it is is not as strong as choosing a truly random hash function. For example, the basic universal hash function families guarantee that for any elements x and y in the universe, if you choose a hash function uniformly from the family, then the probability that x and y land in the same bucket is 1/#of buckets. That is, the collision probability of any specific pair is what it should be. Once you consider 3 elements at the same time, however, the probability that all 3 land in the same bucket
need not be (1/# of buckets)^2, as it would be for a truly random hash function.
The most used variation is probably k-wise independent hash families, which have the property that when a hash function is chosen from that family, any collection of k elements are independently and uniformly distributed.
A negative with using these weakened hash families is that usually, naturally, you get weaker results from the analysis than you would using truly random hash functions. A fine example of this is the recent work by Pagh, Pagh, and Ruzic on
Linear Probing with Constant Independence. With truly random hash functions, linear probing takes expected search time O(1/(1-alpha)) where alpha is the load on the hash tale. They show that using pairwise (2-wise) independence linear probing can take time O(log n) on average -- it no longer depends on the load. Using 5-wise independence they get expected time polynomial in 1/(1-alpha). That's a great result, although there's still a gap from truly random.
The
real problem though, is the following: pretty much whenever you do an experiment on real data, even if you use a weak hash function (like a pairwise independent hash function, or even just some bits from MD5), you get results that match the truly random analysis. I've found this time and time again in my experiments, and it's been noted by other for decades. That's part of the reason why when I do analysis, I don't mind doing the truly random analysis. However, it leaves a nagging question. What's going on?
The obvious answer is that there is also randomness in the data. For example, if your data items were chosen uniformly from the universe, then all you would need from your hash function is that it partitioned the universe (roughly) equally into buckets. Such a broad assumption allows us to return to the truly random analysis. Unfortunately, it's also amazingly unrealistic. Data is more complicated than that. If theorists don't like the idea of assuming truly random hash functions, you can imagine practitioners don't like the idea of assuming truly random data!
For years I've wanted an explanation, especially when I get a paper review back that questions the truly random analysis on these grounds. Now I think I have one. Using the framework of randomness extraction, we can analyze the "combination" of the randomness from choosing the hash function and the randomness in the data. It turns out that even with just pairwise independent hash functions, you only need a small amount of randomness in your data, and the results will appear uniform. The resulting general framework applies to all of the above applications -- Bloom filters, multiple-choice hashing, linear probing -- and many others.
In the next post, I'll give a few more details.