Friday, August 29, 2008

A Survey on Hash-Based Packet-Processing Algorithms

Sometime way back, Graham Comorde invited me to a DIMACS workshop on Algorithms for Next Generation Networks, where I talked the usual talk about hash tables, Bloom filters, and things. The organizers later had the wherewithal to undertake putting together a book or collected chapters on the topic, and asked us (us being, of course, me and my student Adam Kirsch) for a chapter related to the talk. Adam was just finishing his thesis, and was willing to take on modifying his thesis to form the basis for a survey chapter. I wrote up a section or two, and for good measure, we got George Varghese, who is the world expert -- and author of the book Network Algorithmics -- to join in as well. (In particular, we really wanted George's much more practical perspective on what works and what doesn't!) The result here is a hopefully pleasant light read on current goings-on in the area of hash-based network algorithms, attempting to blend and show the connections between theory and practice.

There's still time for feedback for the final version; please mail me if you have any constructive suggestions.


Anonymous said...

So I read it through (quickly) and enjoyed it but reacted that it didn't fully cover the scope of hashing in networking, or, indeed, packet processing.

So, it left out packet tracing (which requires work during packet processing). The first use of Bloom filters in networking was the paper on SPIE by Snoeren et al (where, warning, et al includes me) in 2001 which describes a novel form of packet tracing.

Second, if we're tracing the history of hashing there are two more notable uses:

* for routing tables, hashing was common in routing protocol implementations (e.g. the earliest routers and GATED) until IP shifted to classless (CIDR) addressing. Indeed, GATED had a famous bug for a while -- its hash algorithm always returned the same hash value (for about a year?), with, as you expect, horrible performance implications. So this is in the period 1977 to 1987 or so, well before this chapter suggests the use of hashing for routing.

* for protocol verification -- Gerard Holzmann's SPIN verifier used hashing to probabalistically deal with state space explosion (back c. 1988). Search for Gerard and you'll find a long list of SPIN publications. I mention it because Gerard's work led me to suggest using hashing in SPIE, which in turn led Alex Snoeren to suggest Bloom filters (when we found the false positive rate for regular hashing was too high), which in turn seems to have reintroduced Bloom filters to networking (as the torrent of Bloom filter papers post 2001 shows).

Michael Mitzenmacher said...

Craig --

Certainly the survey doesn't cover everything. (Heck, we went well over the requested page length to put in what we had.)

I personally wouldn't consider the "packet tracing" algorithms as "packet-processing" in the sense of the survey, although that's an arguable point.

For the history, I'd really give the credit for "re-introducing" the Bloom filter to the networking community to the Summary Cache paper. I'd suppose you'd have to ask Alex if that's where he saw it, or elsewhere.

The other references sound interesting from the historical context, particularly GATED,sound interesting.

Anonymous said...

Hi Michael:

Well we can always disagree about what constitutes packet processing.

Alex did indeed get the Bloom filter idea from the Summary Cache paper (he was part of the MIT group that read the paper and found it striking). And while I consider the Web a distributed system (and thus distinct from networking) the paper appeared at SIGCOMM -- which I suppose says I'm wrong.

For gated -- I had remembered that there was a gated paper somewhere (which, of course, doesn't mention the bug) but it may predate the web and a search of citations to gated doesn't reveal it. (Part of the problem here is that few folks considered route lookup algorithms interesting until CIDR showed up -- and, actually, what they cared about was not lookup but update costs -- the concern for lookup costs really took off in the mid-1990s as we moved to multi-gigabit speeds).

Michael Mitzenmacher said...

I'd love to see the GATED paper, if you can find a reference.

Also, I'm aware that hashing was used pre-1990 for various lookup actions, even in routers, but as far as I know the sort of technology being used was a "standard" vanilla hash-table. The point of the survey is more focused on "modern" hashing -- multiple choice hash functions, Bloom filters, etc. -- and there I think the GigaSwitch we cite, which used multi-level hash tables as described by Broder and Karlin in the 1990 SODA paper -- is the first real example I'm aware of.

Thanks for your help!