Monday, July 23, 2007

Why Simple Hash Functions Work Well

On my publications page there is a recently finished new paper I'm quite excited about. Part of that excitement is because it's (finally!) my first paper with my remarkable colleague Salil Vadhan. The rest is because I think it answers a question that's been bothering me for nearly a decade.

A fair bit of my research has been on hashing, and often I end up doing simulations or experiments to check or demonstrate my results. For many analyses of hashing processes -- including Bloom filters or the power of two choices, we (or at least I) assume the hash function is truly random -- that is, each element is independently mapped to a uniform location by the hash function. This greatly helps in the analysis. Such an assumption is downright painful to strict theorists (like Salil, who has questioned me about this for years), who rightly point out that hash functions being used are not anything close to truly random, and in fact truly random functions would be far too expensive in terms of space to even write down in practice.

Of course theorists have developed ways to deal with this -- with the seminal work being Carter and Wegman's introduction of universal hashing. The key idea is that you choose a hash function at random from a small family of hash functions that are easy to compute and require little space to express, and this choice guarantees some random property, even it is is not as strong as choosing a truly random hash function. For example, the basic universal hash function families guarantee that for any elements x and y in the universe, if you choose a hash function uniformly from the family, then the probability that x and y land in the same bucket is 1/#of buckets. That is, the collision probability of any specific pair is what it should be. Once you consider 3 elements at the same time, however, the probability that all 3 land in the same bucket need not be (1/# of buckets)^2, as it would be for a truly random hash function.

The most used variation is probably k-wise independent hash families, which have the property that when a hash function is chosen from that family, any collection of k elements are independently and uniformly distributed.

A negative with using these weakened hash families is that usually, naturally, you get weaker results from the analysis than you would using truly random hash functions. A fine example of this is the recent work by Pagh, Pagh, and Ruzic on Linear Probing with Constant Independence. With truly random hash functions, linear probing takes expected search time O(1/(1-alpha)) where alpha is the load on the hash tale. They show that using pairwise (2-wise) independence linear probing can take time O(log n) on average -- it no longer depends on the load. Using 5-wise independence they get expected time polynomial in 1/(1-alpha). That's a great result, although there's still a gap from truly random.

The real problem though, is the following: pretty much whenever you do an experiment on real data, even if you use a weak hash function (like a pairwise independent hash function, or even just some bits from MD5), you get results that match the truly random analysis. I've found this time and time again in my experiments, and it's been noted by other for decades. That's part of the reason why when I do analysis, I don't mind doing the truly random analysis. However, it leaves a nagging question. What's going on?

The obvious answer is that there is also randomness in the data. For example, if your data items were chosen uniformly from the universe, then all you would need from your hash function is that it partitioned the universe (roughly) equally into buckets. Such a broad assumption allows us to return to the truly random analysis. Unfortunately, it's also amazingly unrealistic. Data is more complicated than that. If theorists don't like the idea of assuming truly random hash functions, you can imagine practitioners don't like the idea of assuming truly random data!

For years I've wanted an explanation, especially when I get a paper review back that questions the truly random analysis on these grounds. Now I think I have one. Using the framework of randomness extraction, we can analyze the "combination" of the randomness from choosing the hash function and the randomness in the data. It turns out that even with just pairwise independent hash functions, you only need a small amount of randomness in your data, and the results will appear uniform. The resulting general framework applies to all of the above applications -- Bloom filters, multiple-choice hashing, linear probing -- and many others.

In the next post, I'll give a few more details.


Anonymous said...

I have a few questions on your paper :
1-You didn't mention cuckoo hashing. Is the analysis for multiple choice hash functions also include cuckoo hashing.
2-In your paper you mention 2-wise and 4-wise hashing. Is the k-wise independence the only measure of strength of hash functions : Perhaps the space that is used to describe a hash function could also be a good measure of strength of hash functions : for example zobrist hash is 3-wise independent but uses a lot of space 256*l for a keys of l bytes.

Michael Mitzenmacher said...

Anonymous #1:

Yes, the analysis will work for cuckoo hashing as well. (This is mentioned in the paper.)

We mention 2-wise and 4-wise because they are well-known in theory and have efficient implementations. (Mikkel Thorup, in particular, has written some very nice papers describing how to do 2-wise and 4-wise independence efficiently.) The analysis in the paper, however, is general, and just depends on the "collision probability" of the hash function family.

Anonymous said...

From the paper:

In the future, we hope that there will be a collaboration between theory and systems researchers aimed at fully understanding the behavior of hashing in practice.

This truly sounds like you :)

By the way, Thorup's data structure can evaluate 5-wise independent hash functions as quickly at 4-wise (a point not made in the paper). Does that help?

Pseudonym said...

I don't know about you, but I've found that even universal hash functions (weak ones) perform measurably worse on "real" data (in a constant factor sense) than "good" ones unless your hash table size is prime or close to it.

(Using a prime hash table size covers up many sins, but a) it requires computing primes, b) it requires a division operation rather than a bit mask operation to determine the bucket number, and c) it fails for any situation where you don't get to choose the table size, such as in external hashing.)

I think the important point here is that constant factors sometimes really do matter. Wasting a small amount of CPU time per operation is one thing if that's not the bottleneck, but adding an extra disk seek per operation is a big deal.

Michael Mitzenmacher said...

Anonymous 2: I'm sure 5-wise independence helps a bit, but I don't have a clean way to add it in to the analysis; those moment inequalities always work out much easier with even moments.

Anonymous said...

Hi Prof. Mitzenmacher,

Where could I find a version with the appendix ... there are some theorems in the paper that I'm having trouble figuring out myself, and I can't find the appendix in any of the versions online ...