My Biased Coin: 2009

Wednesday, December 30, 2009

New Year's Again

It's time for the annual New Year posting. Last year, I talked about the potential power of affirmations -- in the sense of thinking about concrete goals you hope to accomplish for the year. This year, I'll work in the same theme, but I'll be a bit more specific, and specifically for graduate students.

Now is a good time for graduate students to ask themselves some important questions:

1) If you're planning on graduating this coming semester (or the end of summer), what do you need to do to get there? Do you have a timeline -- with a few extra weeks of padding for unforeseen delays or writer's block? Are you on top of your job search? What can you do to make the thesis and job search go more smoothly? (Hint -- how can you make things easier on your advisor, faculty readers, letter-writers, etc.?)

It's a challenging and scary time -- but you can and will do it, and graduating will feel good. Very, very good.

2) If you're not graduating this year, are you ready to commit to graduating next year? Now is a great time to talk to your advisor and plan what needs to be done to get you out the door within 18 months. (Almost surely your advisor will be thrilled to see such initiative -- even if they don't think you're ready...)

There's an old saying that there's two types of students -- those who are 6 months from graduating and those who are 2 years+ from graduating. The point is that it's mindset -- when you start thinking you're ready to graduate, you'll start moving toward graduating. Are you ready to start thinking that way?

3) If you're early on in the process, what can you do to make this year a good one? Perhaps you can start a collaboration with someone somewhat outside your area -- generally a good experience -- through a class project this semester or by just talking to people about what they're up to. Figure out ways you can talk to people beside your advisor, like going to seminars or serving as a "grad student host" for visitors. Also, now is the time to be applying for summer internships.

4) Finally, if you've found yourself struggling with graduate school, and wondering if you've made the right choice, now is the time for some careful thinking. Think about what you want, possibly talk about it with friends and colleagues -- and then talk to your advisor. Maybe you and your advisor can come up with a plan to make things better. Or maybe you're better off leaving (with a useful Master's Degree), and now is the right time to look for jobs and finish off remaining projects before the end of the academic year. It can be much better to ponder this all now rather than wait until summer nears and realize your options are limited.

Whatever you're up to, I wish you all a happy, healthy, successful 2010.

Wednesday, December 23, 2009

New Result : Tight Asymptotic Bounds for the Deletion Channel with Small Deletion Probabilities

Posting will be light over winter break, in part because so little is going on, but more because I'm busy working on papers for upcoming deadlines. I'll describe a fun little nugget, which is being submitted to ISIT, joint work with Adam Kalai and Madhu Sudan. (The starting point for the result was my giving my survey talk on deletion channels at MSRNE... always nice when something like that works out!) The goal was to get better bounds on the capacity of the deletion channel. (If you haven't been reading the blog long enough: In a binary deletion channel, n bits are sent, and the channel deletes each bit independently with probability p. So, for example, the message sent might be 00110011 and the received message could be 010011 if the 2nd and 4th bits were deleted.) It was known that for deletion probability p the capacity was at least 1-H(p). We show an essentially tight upper bound of 1-(1-o(1))H(p), where the o(1) term goes to 0 as p goes to 0. Here's the draft (a full two weeks before the submission deadline!).

In English, the binary deletion channel looks very much like a standard binary symmetric error channel when p is small. This is not the case when p is larger. (Here's a link to my survey on deletion channels and related channels for more info.)

Here's an attempt at describing the intuition. Let's first look back at the error channel. Suppose we had a code of N codewords each with n bits and a perfect decoder for at most pn errors. Then here's a funny way I could store data -- instead of storing n bits directly, I could store a codeword with pn errors that I introduce into it. To get back my data, I decode. Notice that when I decode, I automatically also determine the locations where the errors were introduced. This gives me N*{n choose pn} \approx N2^{nH(p)} possibilities, each of which I can use to represent a different data sequence. Since I'm only storing n bits, I better have N2^{nH(p)} <= 2^n, or else I've found a way to store more than n bits of data into n bits. So (log N)/n, or the rate, satisfies (log N)/n <= (1-H(p)). This is a different way of thinking about the Shannon upper bound on capacity. Of course, it's sweeping away details -- like what if you don't have a perfect decoder -- but it gives the right sort of insight into the bound. Now consider the deletion channel, and apply the same sort of reasoning. Suppose that we had a decoder for the deletion channel, and further, we had a method of determining which bits were deleted given the received string. Then we could use it to store data in the same way as above and obtain a similar (1-H(p)) upper bound on the rate. Now we have to worry about the details -- like what to do when you don't have a perfect decoder. But more importantly, we have to show that, most of the time with non-trivial probability, you can use the decoder to guess which bits were deleted. (This is where we use the fact that p is going to 0.) The details work out. Surprisingly, this doesn't seem to have been known before. The best (only) upper bound I know for this case previously was the work by Fertonani and Duman, mentioned in this blog post. Their upper bound as p goes to 0 was of the form 1 - cp for some constant c, so it was different in kind.

Slowly but surely, the mysteries of the deletion channel become, well, less mysterious.

Friday, December 18, 2009

Text-book Algorithms at SODA (Guest Post, Mikkel Thorup)

Mikkel Thorup sent in the following guest post:

Text-book algorithms at SODA

This is a pitch for promoting good text-book algorithms at SODA. Erdos promoted book proofs, but book algorithms are in some sense far more important in that they could end up being understood and used by every programmer with a degree in CS. This can yield a huge external impact, and I do think we do ourselves and the world a big favor taking this work seriously. Instead of taking papers on this theme (which would, incidentally, be a great idea), perhaps the area could serve as the basis for a lighter afternoon entertainment session, providing cool stuff that one could take home and show students.

To me the greatest text-book algorithms work well in both theory and practice. They have a cool non-obvious idea that will impress the students, yet, after first you get the idea, they are simple to understand and implement. Unfortunately, to get into a top conference, it is best if you also have 5-10 pages worth of complications. Typically the complications are not themselves that interesting, but they are helpful in making the paper look hard enough; otherwise some referees are bound to complain about lack of meat (fat?).

Note that by insisting on complications, we narrow the contributers to the small pond of theoreticians. Simple efficient algorithms are sought by every smart practitioner, and it no coincidence that many of the elegant algorithms theorists analyze are discovered outside theory. On the other hand, I do think theorists are the ones in the best position to develop great simple algorithms thanks to our fundamental understanding, and I think we should celebrate it when it
happens.

To be more clear, let me present a somewhat controversial example of what I consider a great text-book contribution which is buried in a paper [DHKP97] about very different issues. The example is for universal hashing (low collision probability) where the new scheme is simpler and much faster than the classic method.

Suppose we want universal hashing of w-bit keys to u-bit indices. The classic solution is to pick a prime p>2^w, and a random a in [p], and then use the hash function

h_a(x) = ((ax) mod p) mod 2^u --- math terminology.

The alternative from [DHKP97] is to let b be a random odd w-bit number and use

h_b(x) = ((bx) mod 2^w) div 2^(w-u) --- math terminology.

To prove that this is universal is a nice text book exercise using that odd numbers are relative prime to powers of two.

There may be no obvious advantage of one scheme over the other on an established theory model like the unit-cost RAM, but the difference is major in reality. Implementing the scheme from [DHKP97] is extremely simple with standard portable C-code. We exploit that C-multiplication (*) of unsigned w-bit numbers is done mod 2^w, and get

h_b(x) = (b*x) >> (w-u) --- C code.

By comparison, the implementation of the classic scheme is problematic. One issue is that the mod operation is very slow, and has been so for more than 30 years. Already when Carter and Wegman introduced universal hashing at STOC'77, they were aware of the issue.
They suggested using Mersenne primes (p=2^i-1) allowing us to bypass mod p with some faster bit-trick operations. Even using that, we still have the issue that the classic scheme requires us to compute ax exactly, and ax has more than 2w bits. Since w-bit multiplication is mod 2^w, we need 6 w-bit multiplications to compute ax in its full length, and that is even ignoring the issue of mod p. If 2w-bit multiplication is available, it suffices with two multiplications, but these are often more expensive than w-bit multiplication.

The impact of the scheme from [DHKP97] is big in that it unites theory and practice in what is probably the world's most common non-trivial inner loop. The classic prime based scheme is so slow that practitioners have come up with all kinds of alternatives that are not even remotely universal, e.g., some combination of shifts and xor, hence no entropy. The new scheme is faster than all these hacks, so now we can convince practitioners to use real universal hashing, often leading them to better more robust results.

Is the above theory? It is certainly involves a mathematical observation about the use of relative primality, and I like to think of algorithms as math with mathematically well-defined impact on computing. To get a mathematically well-defined measure, we can, for example, look at how many operations are needed in C, which has been used for efficient portable code since 1972. A theoretical irritant is that we have a whole array of measures, e.g., depending on how we count 2w-bit multiplication and mod-operations. However, the new scheme is clearly better: the variations only affect exactly how much it is better---some factor between 3 and 15.

It is a bit interesting to contrast the above situation with, say, the more robust world of polytime approximation with, say, a very well-defined difference between a worst-case factor 2 and 3. Translating to reality, if the polytime algorithm is superquadratic, it is typically too slow to finish on large scale problems. Moreover, one often gets much better results using simple heuristics with bad worst-case behavior.

For the hashing we are not just talking worst-case but all-cases (same instructions performed on all keys), and I have never tried a real computer on which the new scheme didn't gain at least a factor 4 in speed compared with the classic scheme tuned with Mersenne primes. On top of that, the new scheme is much simpler to implement. While this difference is very convincing for anyone experienced with efficient programming, it may be a bit hard to appreciate for "THEORY of Computing" conferences like STOC/FOCS. However, I see algorithms and SODA more as a "Theory of COMPUTING" with a scope closer to the reality of computing, hence with a bigger interest in text-book algorithms that unite theory and practice. Highlighting such simple, but incredibly useful, practical computing algorithms would both increase the impact of SODA (and arguably theory more generally) and provide a useful distinguishing characteristic for the conference.

[DHKP97] M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen.
A reliable randomized algorithm for the closest-pair problem.
J. Algorithms, 25:19-51, 1997.

Wednesday, December 16, 2009

Avoiding Travel

I'm spending the morning as part of the committee for a thesis defense over in Europe. I'm watching the talk over Skype; we're using a conference call for sound (and as a backup); I have copies of the slides on my laptop.

Is it the same as being there? No. But it's probably 90+% as good in terms of the experience of listening to the defense. It saves me a full day of plane travel (never mind other time overhead), it saves the institution multiple thousands of dollars in air fare and other travel expenses, and if you happen to feel that flying has negative externalities due to greenhouse gas emissions, then it's good for other reasons as well.

If the timing had worked out better, I might have made the trip, and arranged to give some talks to amortize the "travel cost" over more than the defense. But I'm glad to avoid the travel -- thank goodness for modern technology. And as much as I enjoyed the NSDI PC meeting Monday, if it hadn't been down the block at MIT, I would have enjoyed it much less. (Indeed, the location of the meeting was one of the incentives to accept the PC invitation.) I'm still waiting for the tech to get good enough so we can have an online PC meeting (w/video, sound, etc. to mimic the face-to-face discussions of a face-to-face meeting) that we don't have to travel to.

Monday, December 14, 2009

LiveBlogging NSDI PC Meeting

My post today is live-blogging the NSDI PC Meeting -- with a delay for security purposes, of course.

My take on the reviews (and from past experience) is that the NSDI PC is a very, very tough committee. People are looking for exciting and novel ideas, with clear and detailed experiments demonstrating real-world benefits (which usually means comparing against a real implementation from previous work). It's hard to get all that into one paper -- and to get everything so that the reviewers are all happy. And once in a while you can run into a reviewer like me, who expects your "good idea" to also have a suitable mathematical formulation when that makes sense. (If you're claiming to optimize something, I -- and others -- want a clear notion of what you're trying to optimize, and why your idea should help optimize it.)

So it's not surprising that, 4th paper in from the top, we've already hit our first paper where we're deferring our decision instead of accepting, and we're already getting some detailed discussions on whether a paper is good enough to accept. We'll have to speed up to make dinner....

As if to underscore that I did not have a great set of papers, I have just a few that I reviewed in our "DiscussFirst" pile, which takes us through lunch. Good thing I can keep busy with blog entries. And I have a review to write for another conference...

My submission appears (tentatively) accepted. Hooray! For this PC, we're not kicking people out of the room for conflicts -- people are just supposed to be keeping their mouths shut on papers where they have a conflict. For PC members with papers, however, you get kicked out of the room. So I've just spent a tense 15 minutes or so outside, but I'm happy to see the news once I'm back in. (More on this paper in another post....) Overall, I'd say (as expected) PC papers had no special treatment -- they were as harshly judged as all the other papers.

We're now having an interesting discussion about "experience" papers -- what do you learn after building/running a system after several years? A lot of people really think that having experience papers is a good idea, but there's some discussion of the bar -- what makes such papers interesting, and how interesting should they be? (Good anecdotes, but with quantifiable data to support the lessons.)

We're now about in the middle of the papers we're meant to discuss. Things here could go either way. Lots of technical discussion. As an aside, I can give my point of view on what are "hot topics". Data centers seems to be a big topic. There seemed to be a number of papers about scheduling/optimizing/choosing the right configuration in cloud computing systems -- how that could be done without making the programmer explicitly figuring out what configuration to use (but just give hints, or have the tools figure it out automatically). There's a significant amount of EE-focused papers -- essentially, trying to gains with some detailed, explicit look at the wireless signal, for example.

Headed to the end, more or less on schedule. Now we're assigning shepherds to all of the papers.

Sorry to say, I won't be revealing any information on specific papers -- everyone will find out when the "official" messages go out, or from their own connections on the PC...

I think the PC chairs have done a good job pushing us to keep on schedule; I think the discussions have been detailed and interesting. I think the committee is perhaps overly harsh (a target of 30 papers for about 175 submissions, or 17-18% acceptance; we ended up with 29). But I think we did a good job overall, and have earned our post-meeting dinner out.

Friday, December 11, 2009

Harvard Governance

Harry Lewis (and Fred Abernathy) take a stand against the Harvard Corporation. Required reading for anyone associated with Harvard, or interested in its current predicaments.

Thursday, December 10, 2009

News Items

The people at Microsoft Research New England are excited to announce that Boaz Barak will be joining them. I imagine the situation is similar to their hiring of Madhu Sudan. They continue to build up a remarkable collection of researchers.

Harvard's Allston complex is officially suspended.

JeffE doesn't post so much these days, so you might have missed this great post with his favorite useless theorem. (I hope this starts a series of "Favorite Useless Theorems" -- please send him, or me, if you have examples of your own.)

Nela Rybowicz, staff senior editor for IEEE (and who I've dealt with for pretty much all of my Transactions on Information Theory papers), informed me that she'll soon be retiring. She'll be missed.

Wednesday, December 09, 2009

TSA Oops

You may have heard on the news that the TSA (temporarily) put up their Screening Management Standard Operating Procedure on the web. As pointed out on this blog (and elsewhere), they made a small error:

"So the decision to publish it on the Internet is probably a questionable one. On top of that, however, is where the real idiocy shines. They chose to publish a redacted version of the document, hiding all the super-important stuff from the public. But they apparently don’t understand how redaction works in the electronic document world. See, rather than actually removing the offending text from the document they just drew a black box on top of it. Turns out that PDF documents don’t really care about the black box like that and the actual content of the document is still in the file."

Oops. Apparently, somebody hasn't read Harry Lewis's book Blown to Bits; the start of Chapter 3 discusses several similar cases where somebody thought they had redacted something from a PDF file... but didn't.

Sunday, December 06, 2009

Faculty Ratios

Perhaps the most basic breakdown one can make of a college faculty is into the categories Natural Sciences / Social Sciences / Humanities. (Yes, one could naturally think of "engineering" as a fourth category separate from "natural sciences"; feel free to do so.) So here are some basic questions:

1. What is the faculty breakdown in your institution into these three categories?
2. Do you think that breakdown is appropriate?
3. What do you think the breakdown will be (or should be) ten years from now? Twenty years from now?
4. Do you think your administration is thinking in these terms? Do you think they have a plan to get there? (And if so, are they open about it -- is the faculty involved in these decisions?)

[These questions were inspired by a comment of Harry Lewis over at Shots in the Dark.]

Friday, December 04, 2009

Retirement

No, not for me. But Harvard has announced its plans to encourage faculty to retire. I won't call it an "early retirement" package, since it is for people over 65. Though I suppose that is early for academia.

Note that (according to the article) Harvard's Faculty of Arts and Sciences will have 127 offers to its 720 (junior and senior) faculty. So I conclude (as of next year) 1/6 of Harvard's faculty would be over 65. And the article states that the average age of tenured Harvard professors is 56. Can people from elsewhere tell me if this is unusual? Harvard has a reputation for having an older faculty; this seems to confirm it. Does his suggest a lack of planning somewhere along the line? Or is this a good thing?

I don't expect drastic changes arising from this plan; it will be interesting to see how many faculty take the offer. In general, however, is it a good idea for a university to "encourage" retirement for older faculty? And if so, what means should they use to do it?

Viewed as an optimization problem, one can ask what is the "best" distribution of faculty ages at a university, and what mechanisms could (and should) be used to maintain that distribution?

Thursday, December 03, 2009

Tight Thresholds for Cuckoo Hashing via XORSAT

As I promised sometime back, we now have a paper (up on the arxiv) giving the thresholds for cuckoo hashing, a problem that had been open and was part of my survey (pdf) on open problems in cuckoo hashing.

One interesting thing about our paper is that our (or at least "my") take is that, actually, the thresholds were right there, but people hadn't put all the pieces together. The argument is pretty easy to sketch without any actual equations.

In cuckoo hashing, we have n keys, m > n buckets, and each key is hashed to k possible buckets. The goal is to put each key into one of its k buckets, with each bucket holding at most one key. We can represent this as a random hypergraph, with each key being a hyperedge consisting of k buckets, represented by vertices; our goal is to "orient" each edge to one of its vertices. A natural first step is to repeatedly "peel off" any vertex that doesn't already have an edge oriented to it and has exactly one adjacent unoriented edge, and orient that edge toward it. At the end of this process, we're left with what is called the 2-core of the random hypergraph; every vertex has 2 adjacent edges. Can we orient the remaining edges of the 2-core? In particular, we're interested in the threshold behavior; as m,n grow large, what initial ratio m/n is required to have a suitable mapping of keys to buckets with high probability. (This corresponds to the memory overhead of cuckoo hashing.)

Now consider the random k-XORSAT problem, where we have m variables and n clauses, where each clause is randomly taken from the possible clauses
x_{a_1} + x_{a_2} + ... + x_{a_k} = b_a.
Here the x_{a_i} are distinct variables from the m variables, b_a is a random bit, and addition is modulo 2. The question is whether a random k-XORSAT problem has a solution. Again, we can put this problem in the language of random hypergraphs, with each clause being a hyperedge of k variables, represented by vertices. Again, we can start by oriented edges to vertices as we did with the cuckoo hashing representation; here, a clause has an associated orientation to a variable if that variable is "free" to take on the value to make sure that clause is satisfied. Is the remaining formula represented by the 2-core satisfiable?

The correspondence between the two problems is almost complete. To finish it off, we notice (well, Martin Dietzfelbinger and Rasmus Pagh noticed, in their ICALP paper last year) that if we have a solution to the k-XORSAT problem, there must be a permutation mapping keys to buckets in the corresponding cuckoo hashing problem. This is because if the k-XORSAT problem has a solution, the corresponding matrix for the graph has full rank, which means there must be a submatrix with non-zero determinant, and hence somewhere in the expansion of the determinant there's a non-zero term, which corresponds to an appropriate mapping.

And the threshold behavior of random k-XORSAT problems is known (using the second moment method -- see, for example, this paper by Dubois and Mandler.)

While in some sense it was disappointing that the result was actually right there all along, I was happy to nail down the threshold behavior. In order to add something more to the discussion, our paper does a few additional things.

First, we consider irregular cuckoo hashing (in the spirit of irregular low-density parity-check codes). Suppose that you allow that, on average, your keys have 3.5 buckets. How should you distribute buckets to keys, and what's the threshold behavior? We answer these questions.

Second, what if you have buckets that can hold more than one key? We have a conjecture about the appropriate threshold, and provide evidence for it, using a new fast algorithm for assigning keys to buckets (faster than a matching algorithm).

I should point out that since I originally "announced" we had this result two other papers have appeared that also have nailed down the cuckoo hashing threshold (Frieze/Melsted ; and Fountoulakis/Panagiotou). Each seems to have different takes on the problem.

Tuesday, November 24, 2009

4-year Masters

Harvard, like many other places, has an option by which students (with "Advanced Standing" from AP classes) can obtain a Master's (in some programs) as well as their undergraduate degree in 4 years. The School of Engineering and Applied Sciences, and CS in particular, offers this option.

Every year some students I advise are interested in and want advice about the program. My first question back is always why do they want to do it -- what do they think they'll get out of it? Often, they don't have a good reason; it seems they just see it as an opportunity to get an additional degree, which I don't think is a particularly good reason in and of itself. (Harvard students are, after all, high-achievers and trained to respond to such incentives.) Some possible reasons that come up include:

1) They're going into industry, and they believe that the Master's will give them a higher starting salary (which may, over a lifetime, translate into a substantial benefit).
2) They're going into graduate school, and believe the degree will allow them to move faster through their graduate program, or at least allow them to take fewer classes.
3) They're coming from another area, like economics or chemistry, but have found late in the game that they like computer science, and would like to have a credential in that area in case that is where they move in terms of their career.
4) Their parents have figured out that they can technically graduate in three years and are unwilling to pay for four, but can be convinced to pay for it if there is a Masters degree involved.

Are there other reasons? And how good are these reasons? Is there still a "Master's premium" in starting salaries, even for a "4th year" class-based Master's program? Does starting with a Master's of this kind really get you through graduate school faster anywhere? (Personally, I already think graduate students don't take enough classes, so I'm biased against reason 2.) The third reason -- a student coming in from another field -- is reasonable, but Harvard does now have minors (called "secondary concentrations" here) instead. Pressuring parents is not a reason I can throw support behind, but I can certainly understand it. Maybe it's the best reason on my current list.

Of course, it's important to consider what, if any, are the downsides. Here's my starting list:
1) A loss of flexibility in choosing classes. Doing the two degrees in four years saddles a student with so many requirements they lose the chance to take that art or history class, learn a language, or do some other exploratory things in college. Isn't that part of what college is supposed to be about?
2) It often ends up taking the place of other college experiences, like doing a senior thesis or other research. For students who want to go to graduate school, I'd personally recommend doing a senior thesis over a 4th year master's degree, so that they can begin to get some insight into how research works -- and if they'll really like it.
3) It can be hard. Most graduate courses have large-scale final projects; trying to 6-8 such courses in two or three semesters can be a real challenge, and is certainly a time-commitment.

As always, my biggest goal in advising students on such matters is to make sure they're well-informed and have thought through the various implications of their choices. I'd appreciate any thoughts anyone has on the matter, either from their own personal experience or their experience with students who have done such programs.

Friday, November 20, 2009

In Reverse

The Crimson is reporting that the Harvard Faculty of Arts and Sciences will plan to decrease the size of the faculty* in response to budget woes. The key point seems to be that "There are now 720 associate professors and professors in FAS, an increase of 20 percent since 2000." The reductions will occur in the standard way -- not filling some positions after retirement, and offering some sort of early retirement package for faculty. (Harvard has, comparatively, a much older faculty than most institutions. Here's a 2004 Crimson article on the subject that pointed out that 7% of the tenured faculty was over age 70.)

I admit, I'd like to see some concrete statements that there's to be a similar if not more extensive effort to decrease the size of administration, although to be fair some of that has also been occurring.

* The first comment, by one menckenlite, to the article seems so funny I have to quote it here:
"Disappointed to learn that Harvard is just reducing the number of faculty members. I thought they were going to get smaller professors so that they did not overload the sidewalks and streets with their over sized egos and girths."

Wednesday, November 18, 2009

Job Advice?

Not for me, thank goodness.

A very talented graduating senior (who may or may not be Harvard...) has obtained a number of job offers, and asked me if I had any advice on what job would make the best place to start a career -- or look best on a resume. (Similar questions come up most every year.) I explained that besides having some potential conflicts of interest, I was removed enough from the job circuit these days to not have any useful advice. But I could ask others....

Let's consider a number of possible jobs a talented student might easily obtain:

Software Developer at Google
Program Manager at Microsoft

Developer at Facebook
Entry-level position at a tech-oriented "boutique" consulting firm
Something else you'd like to suggest

What advice would you give them on what to choose? Or how to choose, which is probably more useful?

A warning to students: free advice is often worth what you pay for it....

Monday, November 16, 2009

Conference Reviewing Update

A few weeks ago, I talked about the reviewing process for NSDI and LATIN. I suppose now is a reasonable time for an update.

LATIN is nearing the end of the reviewing process. I think it went well -- my top ranked papers seem to be being accepted, my low ranked papers are not. There's been some electronic discussion of papers where there was wide disagreement, but we're not having an on-site PC meeting, and overall there's been surprisingly little discussion on my papers. Because LATIN is a "2nd tier" conference, I had previously suggested that I expected there would be some wide deviations among review scores, "corresponding to different opinions about how interesting something is". There were in fact some wide scoring discrepancies, though this may not have been the primary reason. I was a reviewer on multiple papers where one reviewer really didn't seem to "get" the paper -- in most cases, ranking it high when I thought the ranking should be lower. (I imagine the scores will change before the reviews come back to the authors.) I've seen similar problems even in other, stronger theory conferences -- selecting 3 reviewers who are expert on the subject of paper in a broad theory conference is very difficult to consistently get right, especially when subreviewers come into play -- though I think it was more problematic here, where the papers are weaker on average in any case. Finally, I still don't like the easychair interface that much.

The NSDI reviews have been, for me, substantially more interesting, no doubt in part because the papers are more interesting (to me). The "first round" is nearing the end, and at least on my papers, the review scores are remarkably consistent. In cases where they aren't consistent, there's usually a clear reason for the discrepancy that comes out in the reviews, which tend to be longer and more detailed for systems conferences. While that's all very satisfying, at this point I'm hoping to be offered some dramatically more controversial papers to look at for Round 2, or I'll be finding the PC meeting pretty boring. (I should note I have a paper submitted to NSDI, so I reserve the right to either completely trash the reviewing system, or sing its praises ever-higher, depending on the eventual outcome.) Finally, I still like the hotcrp interface a lot.

I get asked to serve on a number of PCs, and usually, I make efforts to serve, because I believe such service is important to the community. But I must say, doing these two at roughly the same time has led me to think I'll be more circumspect in the future. The older I get, the more precious time seems to become, and perhaps I've just reached a point where I think I've done enough PC service that I can be choosier about what I agree to, aiming for things that are more enjoyable. At the same time, I wouldn't want everyone to start acting that way, since then I imagine it would be tough to get good PCs together for the many conferences we have.

Thursday, November 12, 2009

Surveys

This post is about surveys. It's motivated by one of my tasks last night, as I has to spend some time going over the final proofs for the survey Hash-Based Techniques for High-Speed Packet Processing, written with Adam Kirsch and George Varghese, which happily will finally be "officially" published (as part of a DIMACS book). The link is to a submitted version; I'll try to find a "final" version to be put up when possible, but the delta is small. I'll take the liberty of complimenting my co-authors on the writing; if you want a quick guide on the connection between hashing and routers, it should be a good starting point.

It occurs to me that I've written a number of surveys -- I mean, a lot*. Indeed, I'm sure in some circles I'm known mostly (and, perhaps, possibly only) for some survey I've written. That is not meant as self-promotion; indeed, I'm well aware that some people would view this as quite a negative. After all, as comments in a recent post felt it important to point out (and argue over), the mathematician Hardy wrote: Exposition, criticism, appreciation, is work for second-rate minds. I would disagree, and I would hope to encourage others to write surveys as well.

I've found writing surveys a useful tool in both doing and promoting my research. I've done surveys for multiple reasons. It's a good way to learn a new topic. It's a good way to bring some closure to a long line of work for oneself. It's a good way to frame and popularize a research direction or a set of open problems. And finally, I've found it's a good way to provide a bridge between the theoretical and practical communities.

Earlier in my career, there didn't seem to be much of a home for publishing surveys. I was fortunate that the journal Internet Mathematics started when it did, and was willing to take surveys. Otherwise, I'm not sure where my surveys on Bloom filter and power laws -- my two most cited (and I would guess read) would have ended up. These days, surveys seem to have become more acceptable, thankfully. The Foundations and Trends series, in particular, have provided a natural outlet that has spurred a number of impressive and useful surveys. I admit these booklets tend to be a bit longer than what I have usually aimed for, but I was usually hoping just to find some journal (or conference) that would take a survey, so length was actually a negative. I imagine someday I'll get up the energy to write something for this series.

But perhaps there are now other mechanisms for producing and publishing surveys. I view Dick Lipton's blog as providing one or more well-written mini-surveys every week (a truly amazing feat). Wikipedia provides a means for what seem to be essentially collaborative mini-surveys to also be written on technical topics; perhaps some Wiki-based tool or archive could be developed that would allow for richer, growing and changing surveys with multiple contributing authors.

In any case, when somebody suggests to you that exposition is for a second-rate mind, keep in mind that not everybody agrees. Writing a survey has become downright respectable. If you feel like disagreeing strongly, in the most vocal way possible, then please, go ahead and write a survey as well.

* Here's a possibly complete list. Co-author information and other related information can be found on my List-of-Papers page. Current links are provided here for convenience.

Some Open Questions Related to Cuckoo Hashing
Hash-Based Techniques for High-Speed Packet Processing.
A Survey of Results for Deletion Channels and Related Synchronization Channels
Human-Guided Search
Toward a Theory of Networked Computation
Network Applications of Bloom Filters: A Survey
Digital Fountains: A Survey and Look Forward
A Brief History of Generative Models for Power Law and Lognormal Distributions
The Power of Two Random Choices: A Survey of Techniques and Results

Tuesday, November 10, 2009

Graduate School? How to Decide...

What do people think about students going to work for a year or two and then applying to graduate school? Or applying but then deferring to work for a year or two?

It's that time of year when seniors are thinking about graduate school. (I have multiple requests for NSF letters pending...) So, naturally, the other day I talked with a student who, essentially, had the question, "Should I go to graduate school?"

In this case, the question wasn't one of talent; the student would, I'm sure, do very well in graduate school. But he also has a job offer from a top company in computing where he could do interesting work and, I'm sure, also do very well.

In these tough situations, I try my best not to give direct advice, but instead try to get a student to talk about their own concerns and issues to help them realize which way they really want to go. While I feel positive about the outcomes from my having gone to graduate school, I'm a very biased sample, and I know lots of others -- very bright, talented, capable people -- who found it wasn't worth it for them. I don't think I would attempt to give advice even if I thought I could perfectly distinguish those who would find great personal success from graduate school from those who won't, and it's perfectly clear to me that I'm far from a perfect distinguisher.

Where possible, I try to give facts. Inevitably, people who find both work and graduate school compelling options want to know how difficult it would be to switch from working back to school. My take was that at the application level, a year or two working generally, at worst, does minimal harm to an application. Your professors still remember who you are well enough to write useful and informative letters, and your academic skills are assumed to have not gotten rusty. Coming back after an extended period, however, might make the application harder to judge.

The greater difficulty in switching is that the longer you work, the harder it can become. You get used to a real paycheck instead of a subsistence wage. Who wants to move again, uprooting their life (friends, relationships, etc.)? And you probably start to become attached to your job and your co-workers in various ways. [Interestingly, the same sorts of issues can arise for people who are thinking about academic jobs vs. research labs/other jobs after their PhD.]

Happily, the student seemed to not need my most important advice -- that both possibilities offered him great opportunities for success and happiness, so he should not stress about making a choice that was "wrong".

Does anyone have further, general advice for those facing this decision?

Friday, November 06, 2009

Ranking

An anonymous commenter asked an insightful question, worthy of a real answer: "Hi Prof, Why are you so obsessed with ranking things?"*

Honestly, I don't think I am. I have 3 children, and I have thus far avoided assigning them a preference ordering.** If you asked me for my favorite TV shows (or movies, or songs, etc.) I could think of some off the top of my head, but I haven't ever thought hard about coming up with a list of favorites.*** Same with restaurants, food, vacation destinations, whatever. I don't spend my time giving rankings for Netflix or things like that. I could probably come up with rankings with some thought, but it's not like I go around ranking things constantly.

That is, I don't do that in my personal life. In my professional life, come to think about it, I spend an awful lot of my time ranking things. I serve on multiple program committees each year where I'm asked to rank papers. (And I send my papers to conferences, where they are in turn ranked, and my submission is, implicitly a ranking of sorts on the conference.) I serve on NSF panels to rank grants. I write letters of recommendation which, implicitly or explicitly, provide a ranking of students (and, occasionally, faculty). I interview and evaluate faculty candidates. I grade and assign grades in my classes, and similarly grade senior theses. I serve on a Harvard committee that decides undergraduate thesis prizes. And I'm sure if I thought it about some more, I could come up with even more examples.

My blog is meant to be a professional blog, about my professional life. If it seems that I'm obsessed about ranking, that is a reflection of my professional life. I am asked to rank a lot as part of my job.

So I think I can turn the question back -- why are all of you so obsessed with ranking, that I end up having to spend so much time doing it?

* This comment came up in my last post about the possibility of FOCS/STOC asymmetry, where ranking was at most a tangential concern. But my previous post was on ranking networking conferences, so I can understand where the comment comes from.
** That's meant to be humorous.
*** Well, that's perhaps not quite true. Any undergraduate who has taken my algorithms class can correctly tell you that my favorite TV show of all time is Buffy the Vampire Slayer, so I'd best admit to it before it comes up in the comments.

FOCS/STOC and Asymmetry

I had a funny conversation with Madhu Sudan yesterday, with him relaying an idea he said he heard from Umesh Vazirani (and perhaps the trail goes on further from there) -- roughly that FOCS should double in size and STOC should halve in size. Or, I guess vice versa -- the point is that right now the two are pretty symmetric, and it's not clear that's the best setup.

The idea (or my interpretation of it) is that in theory we could use a more selective "top" conference -- one that people felt they should really try to go to, even if they didn't have a paper in it, because it would have the major results from the year. Hence we halve one of the conferences and make it more selective (and, naturally, make it single-session, maybe have some special tutorials or other activities). At the same time, we don't want to lessen the number of papers that currently are taken in FOCS/STOC -- indeed, since the community (or at least the number of papers being written) has expanded, we should probably accept more. (So maybe people wouldn't feel the need to start yet more conferences, like ICS.) So we double the other. Again, this would be a conference that, ideally, more people would attend, because more people would have papers in it. Indeed, this could help get papers out of the system faster (instead of papers being resubmitted quite so frequently). By introducing asymmetry, perhaps we could make both conferences more appealing and better attended.

I pointed out that one community I know of already does this -- this is very similar to SIGCOMM and INFOCOM in networking. I think that model works, though there are certainly tensions and problems with it -- as you can see in the comments on my recent post on Ranking Networking Conferences. (Bigger conferences are more variable in quality, primarily; also, they require large-scale parallel sessions.) Again, we'd have asymmetry -- the larger conference might become perceived as "weaker", but it would play the important role of bringing the community together and being an outlet for more papers.

Interesting as though the idea is, I have trouble imagining the theory community moving in that direction. Big changes are always hard to get moving, and it's not clear how many people really think the current system is broken -- though the ICS movement clearly seemed to think something was wrong. I'd be willing to try it, myself, but of course I also like the "two-tiered" (or maybe 1.5-tiered) SIGCOMM/INFOCOM system.

Thursday, November 05, 2009

Harvard Financial Aid

This post will talk about Harvard's financial aid program, and why it's a perfectly good thing to give money to Harvard, despite what you might read in the New York Times.

I am motivated to write about this also because some weeks ago, I got into a blog-argument with some Chronicle of Higher Education writer who gave an incoherent argument that Harvard should have been spending its endowment increasing its undergraduate class size. (See the bottom of this post for the starting point if you want.) One point I argued was that Harvard had in fact been spending its endowment to make college more affordable through its financial aid program, and that that was probably doing more to open Harvard up to a wider talent pool than simply admitting more students would do.

Certainly one can argue whether teaching more students or making Harvard financially available to more students is a more important goal. But one thing that became clear is that that author, the author of the New York Times opinion piece, and I presume many other people, just don't understand the financial aid picture at Harvard. So I'll say something about it, that's actually based on facts and numbers.

Let me start with a back of the envelope calculation. (I recently got access to some official numbers, but they may be confidential, and the back of the envelope calculation is easy and accurate enough.) About 2/3 of Harvard undergraduates get financial aid from Harvard, and on average it covers about 2/3 of their tuition. That's approximately 4000 students, getting an average of about $35,000 per year in aid from Harvard, for about $140 million per year. Let's call it $125 million in case my numbers are off and to make the math easier.

Long-term endowment spending rates are about 5%. (This seems to be a standard rule across most major universities, but I haven't seen an economic analysis to explain this number. Please give pointers in the comments.) So Harvard's undergrad financial aid corresponds to roughly $2.5 billion of endowment money.

This is a much bigger proportion of the endowment than people realize. Usually people bandy about a figure of $27 billion or so post-crash for Harvard's endowment, but the endowment for the Faculty of Arts and Sciences -- that is, for the undergrads, as opposed to the law/business/medical/graduate/etc. schools -- is only about $11 billion. So Harvard is now using, by my estimates, well over 20% of its annual endowment spending (for FAS) for financial aid. I've argued in the past that Harvard should make itself tuition-free for undergraduates -- but even I'm impressed by and happy with these numbers.

Think of it this way: the projected deficit for FAS over the next few years, roughly speaking, could disappear entirely without any budget cuts if we just turned off financial aid. Of course that's a terrible idea, and financial aid is one area where Harvard, so far, is making sure not to cut. But that gives an idea of the scope.

So when I hear people say that Harvard isn't doing enough to open its educational doors, or suggesting that giving to Harvard is not morally sound, I admit I feel obliged to politely correct them. (Or, sometimes, less politely correct them.) If you believe that affordable education is important, there are of course many institutions deserving of support. Harvard remains on that list.

Tuesday, November 03, 2009

Conference Reviews

I promised at some point to get back to discussing the reviewing process for two conferences I am currently on the PC for, NSDI and LATIN. Since I happily just finished my "first drafts" of the reviews for both conferences, now seems like a good time. As usual, I've finished a bit early and most reviews are not yet in, so I'm writing this without benefit of seeing most of the other reviews yet.

I should point out that comparing NSDI and LATIN is definitely an apples and oranges comparison, and not just because one is systems and one is theory. LATIN is a "2nd tier" conference (and one would probably argue that was being polite), held every other year, with no specific theme other than theory; the acceptance rate is probably in the 25-35% range. That is not to say the papers are bad, but generally the papers generally utilize known techniques, and the question is whether the underlying question seems interesting, the paper was written well, etc. I'm not looking for papers that everyone would want to read; I'm looking for papers that I think somebody wants to read. Since interests vary greatly, I suspect there may be some substantial score deviations among reviewers, corresponding to different opinions about how interesting something is. I don't mean to sound negative about the conference; some very nice papers have appeared in LATIN, with my favorites including The LCA Problem Revisited, and On Clusters in Markov chains. But I don't think it's a first choice destination for many papers -- unless, of course, an author lives in Latin America or wants to go to Latin America.

NSDI is arguably a "1st tier" systems conference for networks/distributed systems. While it doesn't have the prestige of a SIGCOMM, it's certainly aiming at that level -- although I think perhaps even more than SIGCOMM there's a bit of bias at NSDI for concrete implementations demonstrating actual improvements. In the last two years the acceptance rate has dropped below 20% and I expect it to be there again. Generally I'm looking for a solid, well-explained idea or system design, with some experimental evidence to back up that the idea really could be useful. I admit I would prefer to have some definitions, equations, theorems, or at least well-structured arguments in these submissions -- this is something I push on regularly -- as for me these are highlights of having a well-explained idea, but a paper can still possibly be good without them (and sometimes a paper that is too theoretically oriented wanders too far off from reality, even for an open-minded idealist such as myself).

Now for concrete differences. For LATIN I only have 10 or so papers to review; there's a big PC and the meeting will all be electronic. I imagine I might get asked to read one or two more papers where the reviews don't agree but that's probably it. Most papers will probably have 3 reviews. There's a 0-5 point scale, from strong reject to strong accept, but no "percentages" assigned to the ratings. There's also a whole lot of other scores (originality, innovation, correctness, presentation, difficulty) I have to give that I think are overkill. Even though the number of papers is small, it seems a number of people are using outside reviewers. (I generally don't, unless I feel I'm so far from the area of the paper I need someone else to read it.) We're using Easychair, which these days seems passable, but is far from my favorite.

For NSDI, we have a first round of 20 or so papers. Each paper is getting 3 reviews in the first round, and then we'll probably cut the bottom X% (about 40-50%?). Everyone reviews their own papers. In the second round papers will probably get 1-2 more reviews (or more), and outside reviewers will be used if it's thought their expertise could help. (Usually the chairs, I believe, assign outside reviewers, often based on comments or suggestions by the first-round reviewers.) After the second round of reviews are in we have a face-to-face PC meeting. We're using the standard 1-5 networking scale with 1 being "bottom 50%", and 5 being "top 5%". I've actually found that helpful; I was going over my scores, realized I had bit less than 50% with scores of 1, and went back and decided that there were papers I was being a bit too generous to. (Giving scores of 1 is hard, but if everyone tries to follow the guidelines -- unless they really believe they had a well-above-average set of papers -- I find it makes things run much more smoothly.) We're using hotcrp, which I like much better than Easychair -- I can easily see from the first screen the other scores for each paper, the average over all reviews, how many other reviews have been completed, etc.

Once all the reviews are in, we'll see how things work beyond the mechanics.

Monday, November 02, 2009

ICS Papers Announced

As pointed out many places, the paper for the (strangely named) new theory conference Innovations in Computer Science are out, with the list here and list with abstracts here.

I suppose the future will tell how "innovative" these papers are compared to, say, the normal collection at FOCS/STOC/SODA. I'm not surprised to see the trendy areas of game theory and quantum fairly heavily represented. I was a bit shocked, however, to see a number of papers on what I would consider "mainstream" coding/information theory, in that I wouldn't be at all shocked to see papers with similar abstracts (but different authors) at say an International Symposium on Information Theory. The example nearest and dearest to me would have to be

Global Alignment of Molecular Sequences via Ancestral State Reconstruction
Authors: Alexandr Andoni, Constantinos Daskalakis, Avinatan Hassidim, Sebastien Roch

which, while sounding all biological, is really just studying trace reconstruction problems on a tree. I'm a fan of the under-studied trace reconstruction problem, as it's tied closely to insertion and deletion channels; I was a co-author on a paper on a different variant of the problem back in SODA 2008. (I also cover the problem in my survey on insertion/deletion channels.) I guess I'm glad to see that work on this very challenging problem is considered "innovative".

Wednesday, October 28, 2009

Ranking Networking Conferences

I'm curious if various readers out there would be willing to offer their ranking of networking conferences. The issue has come up in some conversations recently, and I was wondering what other possibly more informed sources think.

Besides your ranking, of course, I'm interested in the reasons behind the rankings. Is it just acceptance rate? Do certain networking conferences specialize in subareas where they are remarkably strong? How does/did such a ranking get built and maintained; does it ever get lost?

Friday, October 23, 2009

Major (and Minor) Happenings : Gu-Yeon Wei

I'm thrilled to announce that my colleague Gu-Yeon Wei, in EE here at Harvard, received tenure.

I feel this is worth a mention because:

1) Strangely, people sometimes seem to forget we have EE here are Harvard. They shouldn't. (Gu, for example, is one of the leaders of the RoboBee project, and with David Brooks in CS, has been writing a slew of papers that spans the circuits and architecture divide.)
2) Strangely, people sometimes seem to have kept the notion that Harvard EE+CS do not tenure their junior people. That's an outdated impression. Like most other universities, we aim to hire people who will eventually earn tenure.

That's the major happening of the day. The minor happening, from yesterday, was that I visited Yale and gave my talk on Open Questions in Cuckoo Hashing. I had a great day, but will pass on just one great insight : if Joan Feigenbaum recommends a restaurant, it's worth listening to. (Dinner after the talk was a true treat.)

Tuesday, October 20, 2009

Old References

One interesting aspect of our WSDM paper is that we have multiple references from the 1930's and 40's. It turns out our problem is related to some of the problems from the early (earliest?) days of experiment design.

This was actually a stumbling block for us for a while. In one sense, we had a very positive starting point, in that I knew there was something out there related to our problem. As a youth (literally, back in high school) I had seen some stuff on the theory of combinatorial design, and while it was too abstract for me to find a direct connection, I knew there must be some stuff out there we better be aware of. We eventually found what we really needed by random searching of keywords; some variation of "experiment design" led us to the Design of experiments Wikipedia page, which used Hotelling's problem as an example. Once we had this (our magic keyword!), we could forward track to other relevant references and information.

In many cases, the problem is not only that you don't know what you should be referencing -- you may not even know you should be referencing something at all. This happens a lot in problems at the boundaries -- econ/CS problems, for example. Most notably, this was a big problem in the early work on power laws, as I pointed out in my survey on power laws -- that's the most egregious example I know, where a lot was "re-invented" without people realizing it for quite some time.

I still get the feeling that, despite the great tools we now have available to us, people don't do enough searching for related work. I can understand why. First, it's not easy. If you don't know what the right keywords are, you have to use trial and error (possibly helped by asking others who might have a better idea). For multiple papers I have written, I have spent multiple hours typing semi-random things into Google and Google scholar, looking around for related work. (In the old days, as a graduate student, I actually pulled out lots of physical books from the library shelves on anything that seemed related -- I like this new system better.) It can seem like a waste of time -- but I really, really encourage authors to do this before submitting a paper. Second, in many cases there's a negative payoff. Who wants to find out (some of) what they did was already done? (In fact, I think everyone who expects to have a long research career would actually prefer to find this out as soon as possible -- but it still can be hard to actively seek such news out.)

On the positive side, I can say that good things can come out of it. Reading all the original work and related problems really helped us (or at least me) better understand the right framework for our variation of the problem. It also, I think, can help get your paper accepted. I feel we tried hard to clearly explain the historical context of our problem -- I think it makes our paper richer than it would be without it, exposing some interesting connections -- and I think it paid off; one reviewer specifically mentioned our strong discussion of related work.

Monday, October 19, 2009

WSDM Paper : Acceptance Rates

I'm happy to announce our paper "Adaptive Weighing Designs for Keyword Value Computation" -- by me, John Byers, and Georgios Zervas -- was accepted to WSDM 2010 -- The Third ACM Int'l Conference on Web Search and Data Mining. (The submission version is available as a technical report here.) The abstract is at the bottom the post for those who are interested.

The paper's acceptance gives me an excuse to discuss some issues on paper writing, research, conferences, and so on, which I'll do this week. To start, I found it interesting that WSDM had 290 submissions, a 70% increase in submissions over 2009. Apparently, Web Search and Data Mining is a healthy research area in terms of the quantity of papers and researchers. They accepted 45, or just about 15.5%. This turns out not to be too far off from the first two years, where acceptance rates were also in the 16-17% range. I'm glad I didn't know that ahead of time, or I might not have submitted!

I'm curious -- why would a new conference, trying to establish itself and gain a viable, long-term group of researchers who will attend, limit itself to such small acceptance rates when starting out? Apparently they thought the key to success would be a high quality bar, but I find the low acceptance rate quite surprising. I can imagine that the rate is low because there are a number of very poor submissions -- even the very top conferences, I've found, get a non-trivial percentage of junk submitted, and although I have no inside knowledge I could see how a conference with the words "International" and "Web" in the title might receive a number of obviously subpar submissions. But even if I assume that a third of the submissions were immediate rejects, the acceptance rate on the remaining papers is a not particularly large 23.3%.

The topic of low acceptance rates for CS conferences has been a subject of some discussion lately -- see Birman and Schneider's article at the CACM, Matt Welsh's thoughts, Dan Wallach's thoughts, and Lance Fortnow's article at the CACM for instance. Here we have an interesting example case to study -- a new conference that starts out with an accept rate in the 16% range, and an apparent abundance of submissions. Anyone have any thoughts on why that should be? (I'll see if I can get some of the conference organizers to comment.) Or opinions on if that's the way it should be?

Now for that abstract:

Attributing a dollar value to a keyword is an essential part of running any profitable search engine advertising campaign. When an advertiser has complete control over the interaction with and monetization of each user arriving on a given keyword, the value of that term can be accurately tracked. However, in many instances, the advertiser may monetize arrivals indirectly through one or more third parties. In such cases, it is typical for the third party to provide only coarse-grained reporting: rather than report each monetization event, users are aggregated into larger channels and the third party reports aggregate information such as total daily revenue for each channel. Examples of third parties that use channels include Amazon and Google AdSense.

In such scenarios, the number of channels is generally much smaller than the number of keywords whose value per click (VPC) we wish to learn. However, the advertiser has flexibility as to how to assign keywords to channels over time. We introduce the channelization problem: how do we adaptively assign keywords to channels over the course of multiple days to quickly obtain accurate VPC estimates of all keywords? We relate this problem to classical results in weighing design, devise new adaptive algorithms for this problem, and quantify the performance of these algorithms experimentally. Our results demonstrate that adaptive weighing designs that exploit statistics of term frequency, variability in VPCs across keywords, and flexible channel assignments over time provide the best estimators of keyword VPCs.

Sunday, October 18, 2009

Harvard Finances

For those who are interested in such things, Harvard's latest financial report appears to be available. Rumors have it that the report was made (widely) public in part because of a Boston Globe article, showing that Harvard was doing some unwise things with its "cash" accounts. (Our own Harry Lewis gets a quote.) That on top of previously reported news (in the financial report) that Harvard had to pay about $500 million to get out of some bad hedges on interest rates. I'm sure many will get a kick out of it.

I hope, at least, that people will make use of the information properly when discussing things Harvard. A couple of weeks ago I pointed to a truly fact-impaired Kevin Carey opinion in the Chronicle of Higher Education (that I still can't bring myself to link to). One mistake he made, which is common, is to refer to Harvard's $37 billion endowment (now about $26 billion) as though it was all for undergraduate education. In fact, the Faculty of Arts and Sciences (FAS. the "home" for undergraduates) "owns" about $11 billion of the $26 billion; the med school, business school, law school, and various other sub-organizations within Harvard all have their pieces. Also, of this $11 billion, only a fraction is in money that can be used for "general purposes"; much of it is tied to specific purposes (chairs for faculty, financial aid, libraries, etc.). Anyhow, when someone comes along and spouts off about how Harvard should spend its money, I'll have a new pointer for where to start an informed discussion.

Also, on Friday Cherry Murray, Dean of Harvard's School of Engineering and Applied Sciences, had an all-hands meeting, where naturally the topic of SEAS finances was part of what was addressed. (The budget for SEAS is independent of FAS, approximately.) While we're not in great shape, we appear to be somewhat better off, as less of our budget comes from the endowment distribution, and we've had a bit of a buildup in our reserves the last few years that will help us through the next few. This should mean that SEAS will be (slowly) hiring again soon; I'm hoping that computer science and/or applied mathematics will be areas where we'll be advertising for new faculty.

Friday, October 16, 2009

Welcoming Myself to CACM

I'd like to welcome myself to the Blogroll for the Communications of the ACM! My colleague Greg Morrisett suggested I get my blog into the CACM Blogroll, so a few e-mail messages later, and apparently I'm in. Just goes to show, they must have a pretty low bar. Actually, since I'm a regular reader of most of the blogs on their Blogroll, it's a pleasure to join the list. It's not clear how this will affect the tone and style of my blog posts -- probably not at all -- but perhaps it will encourage me to branch out into yet more topics of more general interest.

While poking around the CACM I was pleased to see some press on the Harvard RoboBee project, one of the 3 NSF Expeditions awards from this year. While I'm not on the RoboBee team, it's already getting some of my attention; I'm co-advising a senior who wants to do her undergrad thesis on some algorithmic problems related to RoboBees. I'm imagining I'll be drawn into other related sub-projects, as there seems to be lots of possible algorithms questions one might want to tackle in developing artificial insects. Perhaps that's the power of these large-scale, Expeditions style projects: by setting seemingly very distant, almost impossible goals, they push people to think and do new things.

Also of note is Lance Fortnow's article on the P versus NP problem is still on their list of top articles, as is his viewpoint on Time for Computer Science to Grow Up. And their front page has a review article on Smoothed Analysis from this month's issue.

I've said it before but it bears repeating: it's amazing how CACM has changed to become, in my mind, a really relevant resource for computer science and computer scientists. And I'm not just saying that to welcome my new blog overlords.

Wednesday, October 14, 2009

New Book on Concentration Bounds

I spent an hour or more today perusing the book Concentration of Measure for the Analysis of Randomized Algorithms, by Devdatt Dubhashi and Alessandro Panconesi (that Alessandro was kind enough to send me). It's a very nice book covering a variety of tail bound arguments and applications, with a number of exercises. I'd recommend it for use in a graduate-level seminar, or as a general reference for people working in probabilistic analysis of algorithms. Theory graduate students should have a copy nearby if not on their shelf.

It treats very well the standard approaches -- Chernoff-Hoeffding bounds, martingales, isoperimetric inequalities, and so on, but I think what particularly stands out in this book's treatment is the consideration of what to do when the random variables are not quite so nice. Tail bounds tends to be "easy" to apply when all the random variables are independent, or when your martingale satisfies a nice simple Lipschitz condition; it's when the variables are dependent or there's some special side case that wrecks your otherwise pleasant martingale that you need to pull out some heavier hammers. This book makes those hammers seem not quite so heavy. Chapter 3 is all about Chernoff-Hoeffding bounds in dependent settings; another chapter has a subsection on martingale bounds for handling rare bad events. I've had these things come up in the past, so it will be nice now to have a compact resource to call on with the appropriate bounds at hand.

I don't think this book is for "beginners"; I'd recommend, for instance, my book, which covers all the basic Chernoff-Hoeffding bounds and martingale bounds for people who just need the basics. But if you really need something with a little more power in your analysis, look here. While it's a bit steep at $56.00 at Amazon for a book that comes in under 200 pages (including bibliography and index), I'm sure it will be showing up in the references of some of my papers down the line.

Tuesday, October 13, 2009

SOSP congratulations

One conference I've never had a paper in -- though I'd like to someday -- is SOSP, the Symposium on Operating Systems Principles, one of the flagship conferences in systems. A friend pinged me from there today, so I went to look at the program. Besides learning that Microsoft Research is dominating big systems work, I found a paper co-authored by a number of well known theorists:

Quincy: Fair Scheduling for Distributed Computing Clusters : Michael Isard (Microsoft Research), Vijayan Prabhakaran (Microsoft Research), Jon Currey (Microsoft Research), Udi Wieder (Microsoft Research), Kunal Talwar (Microsoft Research), Andrew Goldberg (Microsoft Research. (pdf)

Congrats to all the authors, and especially Udi, Kunal, and Andrew.

Friday, October 09, 2009

PCing

This week I got my batches of papers to review for NSDI and LATIN. If I'm quiet for a while, I'm busy reading (and writing reviews).

Needless to say, I didn't quite realize I'd get the papers for the two within a couple of days of each other. But it actually seems fine. They're just so different from each other, it's almost refreshing to go from one type of paper to the other.

This will also give me a chance to experience HotCRP and EasyChair "head-to-head". It looks like EasyChair has improved since the last time I used it but HotCRP still seems easier to use so far.

Wednesday, October 07, 2009

More Harvard Classes Available Online

Thanks to the Harvard Extension School, the lectures for several more Harvard courses have been put online. My understanding is that these are classes taught at Harvard that are also given through the extension school. I suspect my course may end up here too next time it is offered.

The list of courses available right now includes:

Concepts of the Hero in Greek Civilization, by Gregory Nagy and Kevin McGrath
Bits, by Harry Lewis
Intensive Introduction to Computer Science Using C, PHP, and JavaScript, by David J. Malan
Shakespeare After All: The Later Plays, by Marjorie Garber
Organizational Change Management for Sustainability, by John Spengler and Leith Sharp
China : Traditions and Transformations, by Peter Bol and William Kirby
World War and Society in the Twentieth Century : World War II, by Charles S. Maier
Sets, Counting, and Probability, by Paul Bamberg
Abstract Algebra, by Benedict Gross

Monday, October 05, 2009

Job Competitions

Stefan Savage made an insightful comment related to the issue of jobs:

I've long felt that its a fallacy that there exists a fine-grained Platonic ideal of "goodness" for researchers (so too for papers), but its an even bigger fallacy is to expect that decision makers would abide by such a scale even if it existed. In my experience, job offers are job offers, just as paper acceptances are paper acceptances. Trying to analyze such results at a finer or deeper scale is unlikely to reveal many useful truths.

The whole comment, well worth reading, can be found somewhere in here.

There seems to be in the previous comments (mostly from anonymous commenters) the idea that getting a job is like those contests many of us did back in high school -- you get more points than the next person, you get the prize. This idea, in my mind, requires some underlying assumptions. First, that merit can be precisely measured -- if you get a high enough score, you get the corresponding job, and anything else is a failure of the system. Second, merit [for a position at a top research university] corresponds explicitly to quality of research, and again, using other considerations is a failure of the system. (I should point out these ideas are in no way novel; indeed, this argument seems to arise constantly in debates on undergraduate admissions, regarding admission of underrepresented minorities/legacies/athletes and so on.)

I think both assumptions are invalid in the setting of faculty hires. First, even if you think research quality is the sole criterion on which to base a hire, how do you measure it? Number of papers? Number of citations? Practical impact/number of actual users? Convene a panel of experts to assign a score? There can be, and will be, disagreements; in some cases, only the test of time will tell. Of course it's often easy to separate "the top" as a rough equivalence class, but going beyond that to a rank ordering is often difficult, especially when comparing people in even slightly different research areas.

Second, I don't think research output alone is the sole measure for a faculty position. Obviously, there's teaching, advising, and administration to consider, but there are other less tangible issues as well. Joining a faculty is like joining a team, and the question is what person can best help the team -- the quality of a team is not merely the sum of the quality of the individual members. Will the potential hire collaborate with others, fill in an area where the department needs someone, or offer useful leadership? Can they fit into, and enhance, the department culture? And yes, the question of is this someone everyone can get along with for a couple of decades also comes to mind. Certainly research quality is a primary consideration -- really the primary consideration -- but most or all of the people brought in for interviews have passed a very high bar for research already, and the other issues can come into sharp focus in the late hiring stages. People might skip such considerations for a suitably good researcher -- I imagine many departments, for instance, would take a Turing award winner, even if the person had a destructive personality, assuming the benefits would outweigh the costs. (I don't actually know of a case like that, but the issue has come up, as a purely theoretical issue, in discussions on hiring in the past.)

This may not be the way some people wish things would work, but it's counterproductive to not recognize that this is the way it generally works -- as Stefan suggests. Further, I strongly suspect that the idea that a pure "merit-based" system, whatever that means in this context, is the universally right approach to faculty hiring is based on assumptions that are faulty in both theory and practice.

[Interestingly enough, I recall a similar topic comes up in the Justice class I posted about before; I'll have to review those lectures!]

Saturday, October 03, 2009

"Core" TCS

Enough time has perhaps passed from Mihai's controversial post to consider, constructively I hope, some of the comments that arose here on this blog from it.

One issue that arose is what a PhD in theoretical computer science should know -- what's the "core" of theoretical computer science? The issue arose as some anonymous commenter claimed to have a PhD in TCS but not know of Voronoi diagrams, range counting, etc. after some other commenter claimed that these were topics one learned as an undergraduate. For the record, I think it's the rare undergraduate that learns about these data structures/algorithms; I learned about them in graduate school, and only because I took the computational geometry course (from the fantastic Raimund Seidel, with the occasional guest lecture from fellow-graduate-student Ernie Jeff Erickson).

As TCS expands, so that it's harder and harder to have exposure to everything as a graduate student, the question of what is the TCS core will become more challenging. The same problem, of course, happens at "all levels of the tree" -- what is the core knowledge that a PhD (or undergraduate) in CS should learn across all areas of CS, or the core for an undergraduate in a liberal arts college? Anyone who has served on a faculty committee to deal with this issue knows that this is a challenge -- what is "core" is usually defined as the area the person on the committee is working on. (From my point of view, how could a PhD in TCS not know Azuma's inequality, the fundamentals of entropy, and Bloom filters?... But I am sure there are many that don't.) Arguably, TCS has been small enough until fairly recently that one could describe a core that most everyone knew most of, but I think that's increasingly less true. (I imagine it's been true of mathematics for some time.)

In any case, I think the people who expressed disbelief and dismay that a PhD in theory might not know a fair number of this list of things ("Range counting, predecessor, voronoi diagrams, dynamic optimality, planar point location, nearest neighbor, partial sums, connectivity.") should consider that they might have overreacted -- I believe most PhDs in TCS learn them, but I don't think it's by any means close to mandatory.

This leaves two several questions for comments:

1) What should be the "core" of TCS that (almost) all PhDs should be expected to know? This is going to be a moving target, of course, but what topics would you place on it now? [It's not clear to me whether one would want to put specific examples or thematic concepts in the list -- for example, would you put "Azuma's inequality" or simply "probability tail bounds" -- feel free to suggest both.]

2) How do we enforce that this core gets learned? I find increasingly PhDs expect to get right to research, and view classes as a hindrance rather than an opportunity. I have long found this problematic. I personally find that courses are a great way to inculcate core material. After all, it's because of that course in graduate school that I learned about Voronoi diagrams, and they've proven useful enough that they appeared in a paper I co-authored.

Friday, October 02, 2009

Semantic Communication, Madhu Sudan

Madhu Sudan gave a colloquium at Harvard yesterday on his work on Universal Semantic Communication and Goal-Oriented Communication (both with Brendan Juba, the latter also with Oded Goldreich). The papers are available here, and here are the slides (pdf). One of the motivating examples for the work is the following : you're visiting another department for the day, and need to print something out. You have access to a printer, but your machine doesn't speak the same language as it does. So you have to get a driver, install it, and so on, and all of a sudden it takes 1/2 an hour to do a simple print job. Why can't the machines figure out how to get the simple job done -- printing -- without this additional work?

More abstractly, one can pose the high-level idea in the following way: Shannon's theory was about the reliable communication of bits, and we've solved a great deal about those types of communication problems. But even if we assume that bits are transmitting correctly over a channel, how can we ensure the meaning of those bits is interpreted properly, particularly in the sense of if those bits represent a task of the form, "Please do this computation for me," how do we ensure the other side performs the computation we want done if we don't have a prior agreed-upon semantic framework?

I've seen in various settings criticism of this line of work, which is quite abstract and certainly a bit unusual for CS theory. The original paper is often referred to as "the aliens paper" because it set the question in terms of communicating with aliens (where there may naturally be no shared semantic framework), and my impression is that several people felt it is too far removed from, well, everything, to be of interest. It was, I understand, rejected multiple times before being accepted.

I have to say the impression that this paper is "too far removed" is incorrect based on the reaction at the talk Madhu gave. Part of it may be a change in message -- no talk of aliens, and more talk of novel devices connecting to the Internet makes the problem more tangible. But it was remarkable how many people were interested -- our computational linguist and programming languages faculty seemed really intrigued, and there was also great interest from some of our other systems people. (It helps to be in a department where faculty outside of theory are generally perfectly comfortable seeing PSPACE-completeness and reductions show up on slides -- is that usual, or are we spoiled here? Multiple questions on the details of the definitions were asked by non-theorists...) Many people shared the impression that this was a new way to think about some very challenging problems, and while the work so far is too theoretical to be of practical use -- indeed, it's arguably still at the stage where perhaps the right frameworks or questions aren't entirely clear -- they seemed to view it as a start worth pursuing further.

I think this sort of paper is a rare beast, but perhaps it does serve as an example that a new conference like ICS is needed as an outlet for this kind of work. In particular, it's not clear to me that FOCS/STOC always has a good handle on where theory could be of large interests and have a significant impact on other communities. My complaint generally takes the form that FOCS/STOC (and even SODA) weighs mathematical "difficulty" far, far greater than practical utility when judging work in algorithms and data structures, but this seems to be a related issue.

Anyhow, thanks to Madhu for an excellent talk.

Thursday, October 01, 2009

GPU News

Since I've now co-authored a paper on GPUs, I'm now "in-the-loop" (thanks to my co-author John Owens) on the news of NVIDIA's announcement of its "next generation" of GPUs, code-named Fermi. (To be out in 2010? Competitors are Intel's Larrabee and AMD's Evergreen.) Some articles on it are : Ars Technica, the Tech Report, PC Perspective. I'm still trying to figure out what it all means, myself, but it seems like there's a future in figuring out how to do high-performance computing (algorithms, data structures) on GPU-style chips. Expect more workshops of this type (Uzi Vishkin's workshop on theory + mulit-core from earlier this year).

Wednesday, September 30, 2009

Justice!

Harvard is putting the lectures (and other materials) online for a fantastic course, Justice, taught by Michael Sandel. It's a class on moral reasoning, exactly the sort of thing you'd hope a college freshman or sophomore would take to get them thinking about how to think. For years, it has been one of the most popular courses at Harvard. It is billed as "the first course Harvard has ever made available to everyone, online and on the air." (The lectures will, I understand, also be appearing on public television.)

I can vouch for the class, since I indeed took it as a sophomore, some large number of years ago. I can also vouch for it in that I've watched the first lecture online, and the production quality is extremely high. Funnily enough, the first lecture was exactly as I remember it 20 years ago -- the script hasn't changed that much. (If you watch the lecture, you'll see the examples are quite memorable -- I honestly do remember them from when I took the class. But I won't spoil them for you here.) I'm going to watch all the lectures, and see how the class stands up after all these years. I hope you'll join in for some of the fun.

Tuesday, September 29, 2009

Blog Posts of the Day

A blog post worth reading is Mihai Patrascu's post on, essentially, coming in second, if only for the chance to play armchair psychologist and try to deconstruct Mihai based on his blog posts. Of particular interest to me was his reaction to being offered a job at UCSD as the second-choice candidate -- an offer which he turned down, and apparently would have taken if offered first.

This is interesting to me because this very issue came up in our last search (which I was leading), where we ended up making 6 offers (and got 3 acceptances). We (the hiring committee) recognized that we were making a rather significant request to have 6 simultaneous outstanding offers. We also recognized the dangers in trying to sequentialize these offers. First, there was the internal danger -- the complex discussions (we had such a great committee, we wouldn't have argued) we would have had to undertake to rank-order everyone we wanted to make an offer to. And second, there's the external danger that the candidate -- who will, of course, find out they were the "second choice" -- takes the ordering as a negative signal and becomes more inclined to take another offer. One can argue whether or not a candidate should take such an ordering as a signal, or whether such a reaction is a purely emotional response. (See Mihai's post, for example, and judge for yourself in that case.) But it was clear to us that, even if no such signal was intended, there was a high risk that would be the interpretation from the standpoint of the candidate.

Mihai's post provides a solid empirical data point that we were right to have this concern; it's something I will keep in mind (and if necessary point to) in future hiring discussions. I'm glad we were able to make 6 simultaneous offers, and give all of the candidates we made offers to the right signal.

Something not worth reading is Kevin Carey's article in the Chronicle of Higher Education, where he seems to be saying that Harvard should be doing more for undergraduates and in particular admitting more undergraduate students. It's so bad, I can't bring myself to link to it. Without judging the point of criticism, I've pointed out that his rant is pretty much devoid of an actual argument; if you care (and really, I'd suggest reading Mihai's stuff first!), you can see that I'm somehow now embroiled in an argument with him here and here.

Saturday, September 26, 2009

UC President Interview

Luca Trevisan points to this NY Times Magazine interview with UC president Mark Yudof. Is it just me, or is this guy just completely tone deaf to the current situation in the UC system? If I were a faculty member, or student, in the UC system, this would do the opposite of inspire me. (I checked with my wife, as I often do in such situations, to check my reading -- she had a similar reaction to the piece.) In fact, I'm trying to think of the last time I heard of an administrator that seemed so out of touch... oh, wait, that's right, I can think of that...

Thursday, September 24, 2009

Extending the Sketching of Sketches Result

I'm putting online a paper with Zhenming Liu and Kai-Min Chung (both graduate students at Harvard) that extends one of the results from Declaring Independence via the Sketching of Sketches by Indyk and McGregor (SODA 2008).

They consider a model where a stream of pairs (i,j) in [n]^2 arrive, giving a joint distribution (X,Y), and the goal is to determine how close the joint distribution is to the product of the marginal distributions under various metrics, which naturally corresponds to how close X and Y are to being independent. All the normal goals in the streaming setting hold (small space, small computation per new pair, one or small number of passes, etc.). We extended one of their main results to higher dimensions (a problem they had left open) : the stream is now k-dimensional vectors in [n]^k, and we wish to approximate the L_2 distance between the joint distribution and the product of the marginal distributions in a single pass. We give a randomized algorithm that is a (1 plus-minus epsilon) approximation (with probability 1-\delta) that requires space logarithmic in n and the stream size m and proportional to 3^k. The paper should be of interest to those who know and like the original Indyk/McGregor paper, which I assume is a signficant part of the streaming community.

The Indyk/McGregor proof required a clever application of the Arithmetic Mean-Geometric Mean inequality. To move the proof up into higher dimensions, we end up having to use the AM-GM inequality twice, along with a clever partitioning of the objects (edges in [n]^k) to get the right constant factor 3^k out. It also seems that this is the "right" dependence on k, at least for this algorithmic approach. (A previous result by Braverman/Ostrovsky also tackled this problem, but had a 2^O(k^2) dependence on k.)

We submitted this to SODA, where it didn't get in, but seemed to be on the borderline. The reviews were on the whole quite reasonable. My take is that in this paper, we (by which, of course, one should read "Kai-Min and Zhenming") took a nice previous SODA result, presented a non-trivial extension, and really wrote it up very clearly and nicely -- I'd like to think we really clearly explained the essence of the argument and why it works. The reviewers, I think, recognized this, but on the other hand had to deal with the issue that it was just a single result. The original paper had more results, and, at the risk (or hope) of instigating commentary, I'll suggest that SODA as a whole is more interested in quantity over writing quality ; a hodgepodge of lemmas and theorems stands a better chance of getting in than a single well-written idea. (The original Indyk/McGregor paper, I should be clear, both has a number of results and excellent ideas, and is not what I'm referring to in this statement.) I understood going in that this would probably be borderline for SODA; I had hoped it would make it in, but this is certainly not a case where I would complain about the decision or the review content by the PC. (One might argue that perhaps SODA should take more papers, and in particular seek out more papers of this type, but that's a different, higher-level argument.)

There are other conferences we could submit to, but the conferences with upcoming deadlines didn't seem particularly suitable. (In particular, they weren't in the US, and would require either extensive travel for me, or visa issues for the students.) Since the result seemed to us essentially done as it was, we're submitting to a journal. But it's available now for anyone who is interested.