The "final version" for our NSDI paper, Carousel: Scalable Logging for Intrusion Prevention Systems, is now up.
Here's the main idea. In IPS systems, the logger can get overwhelmed during an attack. Bad sources (with other related info) need to be recorded for later examination, but there's only a small amount of on chip memory to buffer bad sources, and only a small about of bandwidth (compared to the rate data is coming into the system) from the memory to the more robust recording infrastructure. How do you get all, or almost all, of the sources?
In our model, bad sources are hitting the system repeatedly -- the denial of service attack setting. On the plus side, this means you don't have to log a bad source the first time it appears - the assumption is it will come back again. On the negative side, you want to avoid sending the same source to the recorder multiple times, as it wastes your small bandwidth. (Dups can be removed at the recording side, though.)
The baseline solution is to just grab a bad source to record for the buffer as soon as you have room after you send one out. We consider a random model -- there's a bunch of bad sources, and the next one to appear is random from that set -- that shows that this approach is bad; by connecting it to the coupon collector's problem, we show it can be a logarithmic factor off of optimal. Other experiments show it can be even worse than this in realistic situations.
Our solution has two parts. We hash-and-partition the bad sources, adaptively finding a partition size that so that the bad sources in each partition fit into our small memory. That is, we hash each source, and put in a partition according to the last k bits. This breaks our N bad sources into groups of size (roughly) N/2^k, and if N/2^k is small enough so that the partition fits into memory, then (assuming we can avoid duplicates), we no longer have a memory problem. We just run through all the partitions, giving us our "Carousel".
To deal with duplicates, we do the obvious -- we use a Bloom filter (or equivalent structure) to avoid sending duplicates within each partition.
This hash-and-partition plus Bloom filter type framework seems like a potential generally useful trick; more details -- including both theoretical analysis and experiments -- are in the paper.
An interesting thing came up in the reviews. We pointed out that the "straw man" approach -- do nothing -- was quite bad. We mentioned that just using a Bloom filter -- without partitioning -- wouldn't really help, but then ignored that option. Apparently, this was a big concern for the reviewers; my understanding is that it almost "sunk" the paper. They wanted to see more details, including simulations, on this option. Luckily, it was still accepted, and in the final version, in response to the reviews, we've added some theory and simulations to prove our point. (The point is you'd need a really, really big Bloom filter -- too big for your memory -- to track all the sources, so eventually you have to clear the Bloom filter, which causes you to lose whatever it was going to gain you in the first place.) I'm not sure what the takeaway is there. The right outcome happened -- they should have accepted the paper, and we could add the appropriate stuff. But I'd have hated for what in my mind is a minor issue to have killed the paper. Perhaps we should have said more about it, although at some point space prevents you from providing details on every possible variation you could consider (even if you have actually considered them).
This seems like a good place to remind people about this book -- Algorithms for Next Generation Networks -- which includes a survey about these sorts of hashing applications in networks by Adam Kirsch, George Varghese, and me. The book does seem a little pricey; we have a slightly older version of the survey available online.