Friday, September 26, 2008

Allerton : Moves on Deletions and Insertions

To wrap up, the other paper I presented at Allerton (pdf, talk slides), joint with Adam Kirsch, continues our work in analyzing the power of moving objects around in multi-level hash tables. The gold standard here, at least in theory, is cuckoo hashing, which achieves very high loads but can require a large number of moves when inserting an item. That might not be practical for many applications (especially in network router hardware). In previous work, we had considered the gains obtainable from moving at most 1 item on an insertion. In this work, we add the power to move items on a deletion as well. (Lookups we keep fast and don't do any extra work.)
[The discussion below is, admittedly, probably more for those at least familiar with our previous work...]

Moving an item after another item is deleted is difficult -- generally, you'd want to move some other item deeper in your Multi-Level Hash table into the now empty space higher up in your hash table, but how can you find what item to move? Actually, we'd had ideas on how to do this back when we were working on the insertion paper. But we didn't include it then, for several reasons. First, our ideas required additional space (as explained below). Second, we couldn't analyze them, and one of the points we were aiming for in that paper was to analyze schemes. Third, we didn't have space -- not in the hardware, but in the actual paper. Blame page limits (yes, even for the journal version).

Another recent INFOCOM paper that offered an idea of how to handle moves on deletions inspired me to have us write down our approach -- because I feel our approach is clearly better than that proposed by this other INFOCOM paper, as we discuss in our paper. Even if we can't prove what we'd like, we can still show it works well through simulation.

Our (obvious) idea is just to keep a hint at each hash table cell, if there was a collision there, of where the collided item was placed, so you can find it later on a delete. Hints take space, but not too much. If multiple collisions occur, you have to choose which possible hint to keep; we found most recent collision worked best. Moves on deletes work reasonably well, but the real gains naturally occur from combining moves on insertions and deletes. This gives pretty good performance -- you can get space utilizations more in line with (though of course still not as good as) cuckoo hashing, albeit with more hash functions -- but again, using just 1 move per insert/delete. More details in the paper.

7 comments:

Anonymous said...

could the analysis be related to some caching problems?

Michael Mitzenmacher said...

It's certainly possible that one could use some techniques from caching analysis here... though I don't know how.

I admit that personally I look for techniques that provide (near)-exact numerical answers for performance, without resorting to simulation (which is always another alternative). Many caching techniques I'm aware of are more based on competitive ratio or other measures; I'm perhaps less interested in such results in this setting.

Unknown said...

Since you first mentioned cuckoo hashing I've been wondering about the possibilities when dealing with a single hash and extending it by multiplying it by various primes modulo 2^n (assuming 2^n bins and a bit of special case handling for zero).

As well as giving you an arbitrary number of hash functions from the one initial hash (which was interesting to me for Python/C++/Java implementations), there's the chance to play with the inverses (and there's always inverses) when it comes to deletion time. Although you'd still have to track which function got each entry into each bin.

Just a thought, don't know whether it's helpful or not. I'll probably have a further look when I'm less busy.

Anonymous said...

What about reducing size of hints while retaining almost same performance. For example size of hint can be reduced to about log(n)/2 bits. For that, one could first use a mother hash function H to map keys to range n^{3/2}(about 1.5log(n) bits). Then we can use H to compute hashes for all tables. Collisions will be augmented only by roughly sqrt(n)(which might be considered as negligible). To compute hashes for the d tables, we use a easily computable bijection followed by a division. Instead of storing a full hint for a key , we simply store the quotient of division of key instead of key position.

Michael Mitzenmacher said...

Dave: I think the method you're talking about is equivalent to the approach I don't think works so well-- you might want to check the paper. You may be giving up too much in randomness.

Anon #4: You could certainly optimize the hints (just by keeping the "right" number of bits at each level). The more extensive approach you're suggesting looks like it trades off overflow probability with randomness needed. It's not clear what the "right" tradeoff here is; my instinct is that the "simplicity" of a separate has per level would be the deciding factor, in terms of getting people to actually implement it.

Anonymous said...

My suggestion is that by storing only log(n)/2 bits by hint, we can recover the mother hash function of the key that has collided at that bucket. From that we can deduce position of key. One possible advantage is that we can possibly pack two or three hints in a 32 bits word and manage hints in FIFO order. I know this is probably too much complicated to be useful in practice.
Another observation is that using hints is only advantageous if size of key+data is relatively large. If size of key+data is about 16 bytes, then using 4 bytes for a hint will increase space by 25% which may outweigh any space advantage of using hints.

Anonymous said...

Clarification : my approach to compute hash functions is the same as the one used in your paper about improved counting bloom filters via d-left hashing. The goal is not to reduce randomness bu rather to reduce size of hint. If we have 1 million elements, it might be easier to find 10 spare bits than 20 bits in existing bucket structure.