My Biased Coin: March 2015

A subject I've grown interested in, related to multiple-choice hashing schemes, is when (and why) double hashing can be used in place of "random hashing" with an asymptotically negligible difference in performance.

One early, useful work on this subject is by Lueker and Molodowitch. They provide a very nice coupling argument between double hashing and random hashing in the setting of open address hashing in their paper More Analysis of Double Hashing. (The original paper appeared in STOC 1988.) In this post I'll summarize their argument. I apologize that both the text and my exposition might be a little rough.

They work in the open address hashing setting; each key runs through a permutation of the table locations when it is being placed, and it placed in the first empty location, with each location holding a single key. When searching for a key, we run sequentially through its permutation; we either eventually find the element or we find an empty slot, in which case we know the key was not in the table, and the search was unsuccessful. We measure the expected time for an unsuccessful search when a table with m slots is loaded with pm keys for a constant fraction p. For convenience we will have m be prime, as this will simplify matters when we consider double hashing. If each key's permutation is completely uniform over all permutations, we call this random hashing, and the expected time to search for key not in the table is 1/(1-p) + o(1); with some work you can get that it is 1/(1-p)+O(1/m), but we will not concern ourselves so much with the low order terms here.

With double hashing, for a key x, the permutation is given by h_1(x)+ j h_2(x) mod m for hash functions h_1 and h_2, where h_1(x) is uniform over the range [0,m-1], h_2(x) is uniform over the range [1,m-1], and the permutation takes the values in the order j=0,1,2,... This gives a permutation (because m is prime), and with double hashing, you just need two random hash values, which from a theoretical standpoint is "much less randomness" than a fully random permutation, and from a practical standpoint is easier to implement.

What Lueker and Molodowitch show is that for any (constant) load factor p, with double hashing, the expected time for an unsuccessful search remains 1/(1-p) + o(1). They show this through a coupling, which shows that double hashing and random hashing can be coupled so the "the same thing happens" -- that is, the key goes into the same slot -- under both double hashing and random hashing most of the time. Unfortunately, it doesn't happen all the time; the coupling is not strong enough to say that all the keys are placed the same with high probability. But they show that they can arrange the coupling so that thing work out nicely just the same.

To start, let us start with a setting where we have loaded our tables with n keys using random hashing, and now take two copies of our state, and consider a single step of random hashing in one copy and double hashing in the other copy, side by side. Clearly, for random hashing, the probability that a key is placed in any empty slot is 1/(m-n) for each slot. In expectation (over the random past), by symmetry, for double hashing the expected probability that a key is placed in any empty is 1/(m-n), but the actual probability for each slot will depend on the configuration. But what they show, using Chernoff bounds, is the the actual probability the key is placed in each slot is at most q/(m-n) for some q that is (1+o(1)), with high probability over the past random placements of the n keys.

Now for the coupling. Starting from empty, at each step we use double hashing in both of our copies with probability 1/q = 1- o(1). Note that this ensures that the probability a key is placed in the "random hashing" copy of the process is at most 1/(m-n), so far. So with probability 1/q, we have placed the key in the same slot in both tables, and so it is as though we've done random hashing for this step.

But what about what happens with probability 1-1/q? Maybe we could ignore it, if 1/q was 1-o(1/n) for example, as a low probability event; unfortunately, that's not the case. In particular, we actually expect that the coupling will fail for some smallish (polylogarithmic) number of steps.

Instead, with probability 1-1/q we place the key so that the step follows random hashing in total. I'm not saying with probability 1-1/q we place the key at random; I'm saying we place the key so that, in total (including the 1-1/q probability first step where they key was placed by double hashing) we place the key so that, overall, the probability any empty slot obtains the key is 1/(m-n). Another way of thinking about this is in the other direction; my coupling always placed the key according the random hashing, and with probability 1/q (which again is very close to 1) that matches what would be done with double hashing.

So in our random hashing copy of the table, we just placed a key according to random hashing. How should we think of what is happening over in the double hashing copy? For that table, with probability 1/q all went fine -- a key was placed by double hashing -- and with probability 1-1/q some key just dropped into the table that wasn't placed by double hashing. It's like an extra present from above. But it's not a key placed by double hashing.

The next part of the argument is to recognize that that's OK, in the following sense. If you simply add a key anywhere is an open addressed hash table, you just make things worse, in a very specific way. Any slot in the table that would have been filled if you hadn't put in that key will still be filled at the end of the process even when you add that key. That is, if S is the set of slots that would contain a key if no extra keys get placed, and S' is the set of slots that contain a key if you, at various points in the process, just add some extra keys anywhere at any point, then a simple induction gives S is a subset S'.

So now let's consider multiple steps of this coupling. At each step, the ball is actually placed according to random hashing, so at every point in the process, the "state" is that of a random hashing process. On the double hashing side of the coupling, with probability 1/q a ball was placed by double hashing, and with probability 1-1/q an extra ball was just placed. So if we count the number of balls placed by double hashing, when we reach the time when n keys have been placed by double hashing in this process, on average n/(1-1/q) = n(1+o(1)) keys (in expectation -- by Chernoff bounds one can get a high probability result) have been placed overall.

The result: placing n keys by double hashing is stochastically dominated (in terms of the keys that have been placed) by placing n(1+o(1)) keys by random hashing. In particular, after we place n=pm keys using double hashing, the expected time for an unsuccessful search is bounded above the expected time for unsuccessful search after putting in pm+o(pm) keys using random hashing, which is 1/(1-p) + o(1). You can do a similar sort of coupling to show that double hashing stochastically dominates placing n(1-o(1)) keys by random hashing. As a result, asymptotically, there's only an o(1) difference in terms of the expected time for unsuccessful search, a result which explains the negligible difference in performance one sees in implementation.

I'm spending (part of) the week at "The Power of Randomness in Computation Workshop", an IMA (Institute for Mathematics and its Applications) and ARC (Georgia Tech Algorithm and Randomness Center) co-sponsored workshop at Georgia Tech. Here's the schedule. I'm told slides will eventually be put up somewhere on the IMA website for such things. Great organization at Georgia Tech -- a big crowd in a very nice room, lots of food and coffee, all very well organized. They even had Ben Affleck waiting in front of the building for us this morning. He seemed to be a little busy shooting a movie to greet us properly, but maybe he'll have a bit more time to chat tomorrow.

Besides Ben, a few other highlights:

Leslie Goldberg started things of talking about the complexity of approximating complex-valued Ising and Tutte partition functions. I remember the Ising/Tutte models (mostly from graduate school and shortly after); now there are connections between various problems in quantum computing and these functions on complex values, which (or course) I had not known.

Nike Sun gave a talk on the exact k-SAT threshold (for large k). It was very clearly presented and gave the argument at the intuitive level. I gained some insight into why the "locally random tree" type argument I've enjoyed in coding/belief propagation arguments breaks down in certain satisfiability problems, due to clustering of solutions and other challenging correlations, and how those issues can be handled. I started to understand (I think) the point of replica symmetry breaking arguments and how they were used to guide the analysis of the k-SAT problem.

Other talks from the day: Amin-Coja Oghlan also talked about replica symmetry techniques and their uses for random graph coloring problems, Eli Upfal talked about some new shuffling techniques for oblivious storage dubbed the Melbourne shuffle, Aravind Srinivasan gave a talk on the Lovasz Local Lemma (staring from the Moser-Tardos results and showing how these arguments carry forward and give greater power and insight into the use of the LLL for additional problems), and I talked about invertible Bloom lookup tables and briefly mentioned a few other unrelated things in progress.

My Biased Coin

Saturday, March 28, 2015

Links: HBR article on Women in STEM and AAUW Report

Wednesday, March 18, 2015

Double Hashing (Lueker and Molodowitch)

Monday, March 16, 2015

Power of Randomness at Georgia Tech

Tuesday, March 03, 2015

Hate EasyChair

My Book on Randomized Algorithms

Blog Archive

Science/Tech Blogs I Read

Contributors

Legal Statement

Analytics