Thursday, March 06, 2008

What is the Purpose of Program Committees?

The subject of Program Committees and conferences seems to be a timely one; besides my recent posts, here an interesting post on the theme by Mihai Patrascu, and some counter-thoughts by Daniel Lemire. Here's some more thoughts.

It's actually important that, as a community, we have a good and relatively consistent story about what conferences are for, for many reasons. Funding of conferences, certainly. So students know the rules of the game coming in. So we all know how publications in conferences should, or should not, affect hiring decisions.

As a practical matter, it is also useful to have a reasonably consistent story for specific conferences about what their goals are so that the Program Committees can perform its function appropriately. A reasonable question is why have PCs at all? Many other fields don't.

When I'm on a PC, I think my primary job is to prioritize the papers to help determine which ones make it in. In a way, it feels somewhat depressing that this is (in my mind) the main job of the PC. I do believe quality control and helping guide the direction of the community are important jobs, and that this is a powerful method for both of these things. But there is, in the end, always a non-trivial bit of arbitrariness, if you have 60 good papers for 40 slots (or if you have 20 good papers for 40 slots), around the boundary. [Joan Feigenbaum has suggested to me that we should be much more explicit about this as a community; otherwise (and this is my thoughts, not Joan's), we start finding the false notion that conferences are to be perfectly fair and essentially correct in their decisions, a standard which is impossible to reach and leads to time-wasting measures like endless PC discussions and, shudder, rebuttals for authors.]

I also think my secondary job is to offer what feedback I can to the authors. But really, there isn't sufficient time for detailed criticism, given the way theory PCs are set up. I once told an AI person I was working with that I was on a PC and had 50 papers to read, and he couldn't believe it. Apparently for AI PCs something like 10-20 papers is the norm, and 20 would be considered high. If we're going to made feedback a higher priority in the role of the PC, we're going to have to increase PC sizes dramatically, and restructure how they work. The way they're set up now, there's hardly time to read all the papers, never mind read them in sufficient detail to offer significant constructive suggestions. (That's what peer-reviewed journals are supposed to be for.)

With this in mind, I'll also throw out two wacky ideas that I'd like to see conferences try.

1) Instead of numerical scores, each PC member just gives a ranking of the papers they've read. Then use some ranking algorithm to give a first cut of where papers fall (instead of numerical averages, like PCs use now). I think this would reduce arbitrariness, since the variance in how people assign numerical scores would disappear, but it would take an experiment to tell.

2) Rather than assign each paper to three people for a detailed review, initially assign each paper to five (or more) people for quick Yes/Maybe/No vote, and chop off the bottom 50% (or whatever the right percentage is). My idea is that statistically speaking a larger number of less accurate votes is as accurate or more than a small number of more accurate votes, accurate enough that we can pre-process the bottom 1/2 or more and then spend more time on the quality papers. The negative of this is that the bottom 1/2 would necessarily get even less feedback than they do now. (I think I heard something like this idea was used in a networking conference; in my limited experience, networking PCs are much more ruthless than theory conferences about quickly finding and putting aside the bottom 1/2 or more of the papers to focus on the good ones.)

8 comments:

David said...

But there is, in the end, always a non-trivial bit of arbitrariness

Ken Arrow agrees with you.

Isabel Lugo said...

Alternatively to #1, let people assign numerical scores (since it sounds like that's what they're used to), but then normalize them to z-scores -- i. e. measure how many standard deviations X's score for a certain paper is above the mean of the scores X gave. This allows people to communicate the relative sizes of gaps between papers, but in the end everybody's ratings have roughly the same effect due to the normalization.

This might be silly -- I don't know enough about the process to know whether it makes sense.

Also, re #2, it sounds a lot like how a lot of admissions processes work.

Daniel Lemire said...

Here is an algorithm.

1) Ask reviewers to identify which papers are wrong.

2) Discard them.

3) Choose randomly x papers amount the "not obviously wrong papers".

I bet it would work surprisingly well.

rgrig said...

Ranking has its problems.

Anonymous said...

But there is, in the end, always a non-trivial bit of arbitrariness, if you have 60 good papers for 40 slots (or if you have 20 good papers for 40 slots), around the boundary.

It would be nice if the last 10-15 spots in a conference were chosen at random among the remaining top 30 unselected papers. This would make it explicit that there is randomness in the process and reduce bias around the area where it can play a larger role (after a really good paper will get in, bias or no bias).

Anonymous said...

To #3, most papers are not obviously wrong. A number of papers are always obviously correct, for example, because they are trivial, perhaps following from a previous result that conveniently was not cited.
I think your suggested algorithm would work quite poorly, because there are always some papers that stand out above the others.

To #5, arbitrariness is not the same as randomness. The program committee uses its judgement to put together the best conference it can with the submissions it has. Your first motivation, "[making] it explicit that there is randomness in the process" is not an advantage at all. Your second motivation makes more sense.

David Molnar said...

The Principles of Programming Languages (POPL) committee this year had an A,B,C,D system. From what I understand as a submitter and a non-PC meeting, A means "I will champion," D means "I will anti-champion," and the middle grades state a preference but not a willingness to argue for or against. The idea is that this allows discussion to focus mainly on papers that need discussion: those which have both a champion and anti-champion. Mostly positive reviewed papers (all As and Bs) are provisionally accepted, mostly negative (all Cs and Ds) are rejected. I don't know how the POPL PC felt about this, but the system made sense to me reading about it later.

In my own reviewing/PC discussions, I do find the ranking helpful, as well. There's always a grain of salt, though, because papers can bunch up around specific places in the ranking. That is, my #2 and #3 are probably a lot closer to each other than the #8 and #10. I'm not sure what to do about this other than note it in the comments to PC.

Michael Mitzenmacher said...

David,

I like the A,B,C,D scheme. In a similar vein, I really like the scheme from the year I did SIGCOMM (which I think is still in use -- and I think I'm remembering correctly):

Bottom half
25-50th percentile
Top 10 to 25 percent
Top 5 to 10 percent
Top 5 percent

This similarly gets the focus on where there are sharp disagreements, and allows easy provisional accepts/rejects.