My Biased Coin: May 2008

Friday, May 30, 2008

More Robust Cuckoo Hashing (ESA paper)

I have a paper at ESA, with Adam Kirsch and Udi Wieder, entitled "More Robust Hashing: Cuckoo Hashing with a Stash." (Here's an extended preprint version.) The starting point of the paper is that cuckoo hashing is wonderful, and everybody should know about it and use it. (If you don't know about it or use it, see here or here. But basically, cuckoo hashing is multiple choice hashing with the added bonus of getting to move items around among their multiple choices as needed to balance load on-line.)

So, of course, we want to make cuckoo hashing even better. And one of the problems of cuckoo hashing is that it "fails" with non-trivial probability. For example, in standard 2-choice cuckoo hashing, putting n items into 2(1+eps)n cells with at most 1 item per cell can be done with probability 1 - Theta(1/n). That gives a Theta(1/n) "failure" rate, which doesn't mean anything in terms of "average" behavior, because it can always be handled by re-hashing everything with a new hash function, so that's the end of the story in most theory papers.

In practice, that kind of failure rate is a bit high. Re-hashing an entire hash table could just be too expensive an operation to have occur that frequently for many applications. Can we improve the failure rate somehow?

Let's look at where that Theta(1/n) failure rate comes from. When there are 2 choices per item, the simplest type of failure one might imagine is that 3 items try to use the same two cells -- that is, 3 items have the same 2 choices of location. Indeed, such a problem occurs with probability Theta(1/n) [exercise LTR]. But when such a problem occurs, we can imagine taking one of those 3 items and stashing it somewhere -- in a stash as suggested by the paper title. The stash should be a little space on the side, that we may have to check whenever we look for something. If we implement a stash, how does the size of the stash affect the failure rate?

If failures were essentially independent, so that each item "fails" independently with probability proportional to 1/n^2, then we'd expect f failed items with probability proportional to O(n^{-f}). This turns out to be the right intuition; 0ur analysis shows that a stash sized for s items (for constant s) reduces the failure rate to O(n^{-s-1}). The analysis is a bit trickier, since of course item failures aren't independent, but the result seems natural.

So a small, constant-sized hash -- easily implementable in software or hardware -- reduces the failure probability dramatically, allowing one to avoid rehashing in practice. What I found particularly interesting from the theory+practice perspective is the power in this setting of a constant-sized stash. It's a very natural addition for practitioners -- I don't think it would cause an engineer to blink -- but I think it really changes the potential for cuckoo hashing, making it a structure you could imagine employing on devices used by millions of customers, without having to worry about this otherwise far-too-common failure mode.

We show similar results for multiple cuckoo hashing variations. (The analysis, unfortunately, is different for all of the variations; it would be nice for someone to develop a cleaner, nicer, more unified analysis of all the varieties of cuckoo hashing.) The case of 2 choices is of non-trivial interest, however, since Udi, along with Moni Naor and Gil Segev, have recently shown that 2-choice cuckoo hashing can be used to develop a very efficient history-independent hash table (check out Udi's web page for the paper). The main problem with such a structure is, naturally, the non-trivial probability of rehashing, which can be avoided using a stash.

The stash idea also fits nicely with my previous paper with Adam on the Power of One Move for cuckoo hashing style schemes, although there the stashes (with suggested implementation as Content Addressable Memories in hardware) were larger than constant-sized. I'm beginning to think the idea of stashes (or CAMs) should be a standard section in practically oriented hashing related papers.

Wednesday, May 21, 2008

The "Conceptual" Debate

I'm not at STOC, but I see the debate about "conceptual" papers appeared again at the business meeting, as it rightly should. It's obviously an area where there's disagreement, in terms of whether there is an actual problem currently, as well as what to do about it. It's something that, as a community, we need to talk and think about.

Rather than engage in debate specifics, let me suggest (yet another) way of thinking about the problem. As a community, we need to think in terms of our field having a regular check-up. Not that we think there's any emergency, but as a community we act like our primary care physician and look at where we are, and make sure that problems that need to be looked into are being looked into.

As part of that check-up, I think it's reasonable to ask if our community has produced a healthy number of good concepts. (This, to me, is the heart of the debate.) At the end of the day, in many important ways (including NSF funding, prestige in comparison with other areas of CS, possibly even job production) the production of concepts is at least one significant way in which we'll be judged. Arguably, it's the most important way; certainly externally to CS, and to large part internally to CS, and even within TCS, people remember and latch onto concepts more than technical details.

I think the vision workshop before STOC played something of this role -- transforming concepts into nuggets for an NSF-style audience -- albeit looking more forward that backward. As a community, though, perhaps we need some metrics to go along with these checkups. We need some goals to let us know if we're being successful, and numbers to let us know if we're achieving those goals.

My bias is that one way to see if we're succeeding in concept production is to consider how much our work is used (measured say by citations) outside of TCS. If we come up with a good concept, often it will show by being spread along outside of our limited community. I know that's just one metric, but it's one I like. We should regularly see how FOCS/STOC conferences are doing in terms of the concepts making their way into other communities, and compare it to other theory conferences, and other conferences in related fields (say, with the International Symposium on Information Theory). We should see which theory papers become highly cited outside TCS and, when they don't appear in FOCS/STOC, we should question why, since FOCS/STOC are supposed to be our premier conferences.

And, personally, I think we need to think of other check-up items we ought to be looking into. (Example: how is hiring going in TCS? How many people are leaving the area after getting their Ph.D.s? Why?) Overall, I think these debates and discussions are quite healthy for the community, and we actually need more of them.

Tuesday, May 20, 2008

Summer Salary Shift

At Harvard (like most US institutions), Harvard pays us a "9-month" salary; we can earn 3 additional months "summer salary" by various means, including grants. For example, you can use NSF grants to pay up to two months summer salary (the 2 months is an NSF limitation -- the 3rd moth would have to come from somewhere else).

When I arrived at Harvard, I found it a little odd that the 9-month salary was distributed over the 12-month year, but the summer salary was all paid over the summer. What this meant is that if you took 3 months summer salary, say, you got paid almost 1/2 (well, 7/16, I think -- consider this a good word problem for your average high-schooler) of your annual salary over the 3-month summer. But I got used to it.

This summer it appears the practice is changing, with summer salary now being set to be distributed over the 12-month financial year. There is some nominal loss (for me and other faculty), in that the money we were going to get paid this summer will now be paid over the 2008-2009 academic year, costing us potential interest. I'm actually also concerned about what havoc this could eventually play with taxes, although with any luck there will be nothing significant. I suppose I'll get used to this as well, but I don't see it as a good thing.

Inside sources have suggested to me that this is part of a move to better keep up with compliance issues for federal grants, inspired by the fact that Yale is apparently currently under an unpleasant government microscope. A quick perusal of Google didn't shed any useful information on Yale's plight, and I'm not exactly sure why the change in practice would make a difference. Anyone with similar stories or insight is welcome to comment...

Wednesday, May 14, 2008

Writing a Final Exam

For me, one of the hardest challenges of teaching is making up homework assignments and final exams. For homework assignments, there's some assistance from the textbooks (and I re-use problems a lot), but final exams are very challenging: coming up with problems that test a student's learning but under a 3 hour time limitation. Making a final exam usually takes me several hours.

For my final exam, usually about 1/2 is True-False, Multiple-Choice, Give-an-Example-or-Counterexample, or Execute-the-Algorithm-on-This-Small-Example type problem. These are very different from the homework assignments, which are usually proof/computation/programming-oriented, and hence have "bigger" problems. The other 1/2 is more like the homework assignments -- proof-type-problems -- but sufficiently easier (or with sufficient hints) that students can hopefully get through them quickly. Interestingly, although you might think the first kinds of problems are easier, on both halves, student averages are about 70%.

I've toyed with the idea of giving take-home finals (which I do sometimes in my graduate classes), but these days it's just too easy for students to cheat. I know I've had students anonymously mail questions around looking for people to answer them. And arguably it's useful to have the final exam test something different than the type of problem-solving that they do on the homework.

When I think of the time spent making a final, I know I'd be happier not to give one. And of course the students would be happier too. Not better off, I think, but happier. Perhaps there's a different solution....

Monday, May 12, 2008

A Bubblesearch (Semi-)Success Story

A colleague was backing up files to DVDs and wanted to know a good bin-packing heuristic. I pointed him to First-Fit-Decreasing and Best-Fit-Decreasing, nicely described the Wikipedia entry on bin-packing, but suggested as long as he was doing that he ought to try Bubblesearch as well. Recall Bubblesearch just tries random perturbations of the greedy ordering, to try to find a better solution; the full description is in this paper.

His reported results are that with FFD everything fit into 5 DVDs, which was the minimum necessary; but he did find that Bubblesearch improved the utilization of the first 4 DVDs from 94% to 99% (with about 10,000 random perturbations, and some small parameter tuning), which he thought was impressive.

Your mileage my vary, and I'm sure there are better specific heuristics for bin-packing. The point behind Bubblesearch is that it is a natural extension for a wide class of Greedy heuristics, and moreover is really easy to understand and to implement. My completely self-interested and biased opinion is that it (or something like it) should be taught in all undergraduate algorithms classes....

Friday, May 09, 2008

Stopover at Google Boston

I gave a talk over at Google Boston yesterday, which is now actually located in Cambridge, in the building above the Kendall Square Legal Seafood [5 Cambridge Center]. It's new digs -- they've pretty much just moved in -- and the place had the look of still being put all together, in terms of decorations and all. It did look like a typical Google building -- very open cubicles, lots of conference rooms of various sizes, a game room, a lounge, a colorful cafeteria, etc. I talked about the Hiring Problem; they were going to record my talk and put it on YouTube, but nobody checked the camera battery before the talk, and it and the backup weren't charged. So I get to avoid for another day the issue of having my talks on YouTube. (The good: anyone can see your talk; the bad: anyone can see your talk.)

Although there's not "research" there per se, they're trying to build up a connection to Harvard, having speakers from EECS and other disciplines (economics, law, etc.) come by and give talks. I'm hoping they'll eventually put some research people there; it would be a nice addition to the soon-to-open Microsoft Cambridge lab. [Note to Big Tech companies -- research labs or lablets in Cambridge/the Boston area should be a no-brainer. If you want access to good students early in their careers, there are lots of good students around here...]

Wednesday, May 07, 2008

Suresh Venkatasubramanian and the "Foundations" Book

Suresh Venkatasubramanian of the famed Geomblog has offered to co-manage this Future of CS: Foundations book project with me, so we hopefully will get more chapters soon. Our goal is to arm-twist a half dozen or so people over the summer, and then once people see it get started, maybe more will join. At least, now I have a good excuse to arm-twist Suresh!

I've had some people e-mail me with interest -- I'm very enthused, and I'm hoping you'll all follow through! (And if you haven't yet, please e-mail us with interest.) I think the key rules, again, to remember are that it should be written for smart high school sophomores, and that each chapter should include a "glimpse into the future", which I think of as a description of a specific, big, open problem that one hopes might be solved in about the next 10 years.

It also, I think, helps to think of right-sized "nuggets" for such chapters. (Although I'm open to more experimental approaches, a la Chazelle.) One person suggested they'd write a chapter on universal hashing, which I think is a great idea. Trying to cover all of hashing seems too big. But explaining the basics of hashing, and then the specific role of universal hashing and why it's interesting, seems just right.

My Biased Coin