A subject I've grown interested in, related to multiple-choice hashing schemes, is when (and why) double hashing can be used in place of "random hashing" with an asymptotically negligible difference in performance.

One early, useful work on this subject is by Lueker and Molodowitch. They provide a very nice coupling argument between double hashing and random hashing in the setting of open address hashing in their paper More Analysis of Double Hashing. (The original paper appeared in STOC 1988.) In this post I'll summarize their argument. I apologize that both the text and my exposition might be a little rough.

They work in the open address hashing setting; each key runs through a permutation of the table locations when it is being placed, and it placed in the first empty location, with each location holding a single key. When searching for a key, we run sequentially through its permutation; we either eventually find the element or we find an empty slot, in which case we know the key was not in the table, and the search was unsuccessful. We measure the expected time for an unsuccessful search when a table with m slots is loaded with pm keys for a constant fraction p. For convenience we will have m be prime, as this will simplify matters when we consider double hashing. If each key's permutation is completely uniform over all permutations, we call this random hashing, and the expected time to search for key not in the table is 1/(1-p) + o(1); with some work you can get that it is 1/(1-p)+O(1/m), but we will not concern ourselves so much with the low order terms here.

With double hashing, for a key x, the permutation is given by h_1(x)+ j h_2(x) mod m for hash functions h_1 and h_2, where h_1(x) is uniform over the range [0,m-1], h_2(x) is uniform over the range [1,m-1], and the permutation takes the values in the order j=0,1,2,... This gives a permutation (because m is prime), and with double hashing, you just need two random hash values, which from a theoretical standpoint is "much less randomness" than a fully random permutation, and from a practical standpoint is easier to implement.

What Lueker and Molodowitch show is that for any (constant) load factor p, with double hashing, the expected time for an unsuccessful search remains 1/(1-p) + o(1). They show this through a coupling, which shows that double hashing and random hashing can be coupled so the "the same thing happens" -- that is, the key goes into the same slot -- under both double hashing and random hashing most of the time. Unfortunately, it doesn't happen all the time; the coupling is not strong enough to say that all the keys are placed the same with high probability. But they show that they can arrange the coupling so that thing work out nicely just the same.

To start, let us start with a setting where we have loaded our tables with n keys using random hashing, and now take two copies of our state, and consider a single step of random hashing in one copy and double hashing in the other copy, side by side. Clearly, for random hashing, the probability that a key is placed in any empty slot is 1/(m-n) for each slot. In expectation (over the random past), by symmetry, for double hashing the expected probability that a key is placed in any empty is 1/(m-n), but the actual probability for each slot will depend on the configuration. But what they show, using Chernoff bounds, is the the actual probability the key is placed in each slot is at most q/(m-n) for some q that is (1+o(1)), with high probability over the past random placements of the n keys.

Now for the coupling. Starting from empty, at each step we use double hashing in both of our copies with probability 1/q = 1- o(1). Note that this ensures that the probability a key is placed in the "random hashing" copy of the process is at most 1/(m-n), so far. So with probability 1/q, we have placed the key in the same slot in both tables, and so it is as though we've done random hashing for this step.

But what about what happens with probability 1-1/q? Maybe we could ignore it, if 1/q was 1-o(1/n) for example, as a low probability event; unfortunately, that's not the case. In particular, we actually expect that the coupling will fail for some smallish (polylogarithmic) number of steps.

Instead, with probability 1-1/q we place the key so that the step follows random hashing in total. I'm not saying with probability 1-1/q we place the key at random; I'm saying we place the key so that, in total (including the 1-1/q probability first step where they key was placed by double hashing) we place the key so that, overall, the probability any empty slot obtains the key is 1/(m-n). Another way of thinking about this is in the other direction; my coupling always placed the key according the random hashing, and with probability 1/q (which again is very close to 1) that matches what would be done with double hashing.

So in our random hashing copy of the table, we just placed a key according to random hashing. How should we think of what is happening over in the double hashing copy? For that table, with probability 1/q all went fine -- a key was placed by double hashing -- and with probability 1-1/q some key just dropped into the table that wasn't placed by double hashing. It's like an extra present from above. But it's not a key placed by double hashing.

The next part of the argument is to recognize that that's OK, in the following sense. If you simply add a key anywhere is an open addressed hash table, you just make things worse, in a very specific way. Any slot in the table that would have been filled if you hadn't put in that key will still be filled at the end of the process even when you add that key. That is, if S is the set of slots that would contain a key if no extra keys get placed, and S' is the set of slots that contain a key if you, at various points in the process, just add some extra keys anywhere at any point, then a simple induction gives S is a subset S'.

So now let's consider multiple steps of this coupling. At each step, the ball is actually placed according to random hashing, so at every point in the process, the "state" is that of a random hashing process. On the double hashing side of the coupling, with probability 1/q a ball was placed by double hashing, and with probability 1-1/q an extra ball was just placed. So if we count the number of balls placed by double hashing, when we reach the time when n keys have been placed by double hashing in this process, on average n/(1-1/q) = n(1+o(1)) keys (in expectation -- by Chernoff bounds one can get a high probability result) have been placed overall.

The result: placing n keys by double hashing is stochastically dominated (in terms of the keys that have been placed) by placing n(1+o(1)) keys by random hashing. In particular, after we place n=pm keys using double hashing, the expected time for an unsuccessful search is bounded above the expected time for unsuccessful search after putting in pm+o(pm) keys using random hashing, which is 1/(1-p) + o(1). You can do a similar sort of coupling to show that double hashing stochastically dominates placing n(1-o(1)) keys by random hashing. As a result, asymptotically, there's only an o(1) difference in terms of the expected time for unsuccessful search, a result which explains the negligible difference in performance one sees in implementation.

## Wednesday, March 18, 2015

Subscribe to:
Post Comments (Atom)

## No comments:

Post a Comment