Wednesday, February 27, 2013

Discussing STOC 2013 PC with Joan Feigenbaum

Joan Feigenbaum is the Program Committee Chair for STOC 2013, where papers decisions were recently announced;  I served as part of the Executive Committee.  Joan did an excellent job running the entire process, and experimented with a "two-tiered" PC.  We agreed that it would be interesting to talk about her experience on the blog, and she agreed to answer some questions I posed.  We hope you'll find the discussion interesting.

1.  You're now completing your stint as Program Committee Chair for STOC 2013.  How do you think the program looks? 
I think it looks great.  We had roughly 20% more submissions than last year, and many of them were excellent -- an embarrassment of riches.  Once we decided to stick with the recent STOC practice of a three-day program with two parallel tracks of talks, we were faced with the usual problem for STOC PCs, namely having to reject many clearly acceptable submissions.  I guess that's a much better problem to have than an insufficient number of clearly acceptable submissions, but I still have reservations about this approach to conferences.  (There's more on that in my answer to questions 2 and 5 below.)

2. You tried a number of new things this year -- a "two-tiered" PC being the most notable.  How do you think it worked?  Where do you think it improved things, and where did it not work as you might have hoped?
When Lance Fortnow, SIGACT Past Chair, asked me to be the Program Chair for STOC 2013, he strongly encouraged me to "experiment" and, in particular, strongly encouraged me to try a two-tiered PC.  I agreed to do so, but it was a strange "experiment" in that it was not clear to me (or to anyone, for that matter) what problem a two-tiered PC might solve.  There was no hypothesis to test, and the whole exercise wasn't a controlled experiment in any well defined sense.  Nonetheless, I was able to reverse engineer my way into some potential advantages of a two-tiered PC and hence some good reasons for trying it.
      Before I get into those reasons, however, I should state the primary conclusion that I drew from this experience: Given the extraordinarily high quantity and quality of STOC submissions, it's extremely easy to put together a good program, and any reasonable PC structure will do.  That is, assuming that you don't want to change the nature of the product (where the product is a three-day, two-track STOC that has a fairly but not ridiculously low acceptance rate), you have a lot of latitude in the program-committee process that you use to produce it.  There's nothing sacred about the "traditional," 20-person PC with one chair and no PC-authored submissions; there's nothing definitively wrong with it either.
      Now what did we try this year, and what were some of its potential advantages?  First of all, we briefly considered changing the product, e.g., by having three parallel sessions, but decided against it; we set out to put together a STOC program that was similar in quality and quantity to other recent STOC programs but to do so using a different process.  We had an Executive Committee (EC) of nine people (including me) and a Program Committee (PC) of 62 people.  PC members were allowed to submit, but EC members were not.  The job of the PC was to read the submissions in detail and write reviews, and the job of the EC was to oversee and coordinate the reviewing process.  For example, EC members reassigned submissions that HotCRP had assigned to inappropriate reviewers, looked for submissions that required extra scrutiny because they might have subtle technical flaws, and, most importantly, looked for pairs of submissions that were directly comparable and needed to have at least one reviewer in common.  In order to promote high-quality reviews (which I thought should be attainable, because each PC member had fewer submissions to review than he would have in a traditional PC), I put together a list of suggested review questions and regularly reminded PC members to flesh out, revise, and polish their reviews based on committee discussions.  We made accept/reject decisions about a hefty fraction of the submissions fairly early in the process, based on two reviews of each submission.  For the rest of the submissions, we got additional reviews or asked the original two reviewers to consider them in more detail or both; for each set of comparable submissions that survived the first cut, an EC member conducted an online "meeting" (using both email and HotCRP comments) of all of the reviewers of submissions in the set.
      One potential big advantage of this way of doing things over the traditional way is that PC service can be much less burdensome.  Each PC member can review far fewer submissions than he would for a traditional program committee and can also submit his own papers.  He can devote considerably more time and attention to each submission assigned to him and still wind up spending considerably less total time and effort than he would under the old system.  He's also less likely to have to review submissions that are outside of his area(s) of expertise, because there are many more PC members to choose from when finalizing assignments.  The hope is that almost everyone in the theory community will be willing to serve on a STOC PC when asked if the workload is manageable, that PC members will be more satisfied with the quality of their work if they can spend more time on each submission and don't have to review submissions outside of their area(s), and that authors will get higher quality reviews.
     A second potential advantage is that the managerial and oversight responsibilities can be shared by the entire EC and don't all fall on the chair.  In almost every traditional program committee I've served on (not just STOC committees), there has been a great deal of last-minute scrambling.  In particular, I've been in many face-to-face program-committee meetings at which we discovered that various pairs of papers needed to be compared but had been read by disjoint sets of reviewers.  That's not surprising, of course, when everyone (except the chair) had spent the previous few months trying to read the 60 submissions assigned to him and hence hadn't had a minute in which to at least skim all of the other submissions.  These relationships among submissions can be discovered early in the process if there are enough people whose job it is to look for them.  Having an EC that can facilitate many parallel, online "meetings" about disjoint sets of gray-area submissions is also a big win over a monolithic face-to-face program-committee meeting.  The latter inevitably requires each PC member to sit through long, tense discussions of submissions that he hasn't read and isn't interested in; our procedure enabled everyone to participate in the discussions to which he could really make a contribution -- and only those.
     I think that most of these hoped-for improvements actually materialized.  Certainly almost everyone whom I invited to serve on the PC said yes, and many said explicitly "OK, I'll do it because the workload looks as though it won't be crushing," or "I really appreciate the opportunity to submit papers!"  Similarly, we had no last-minute scrambling, and I attribute that to the oversight work done by the EC.  All of the potential technical flaws in submissions that we discovered were discovered early in the process and resolved one way or the other (sometimes with the help of outside experts); similarly, all of the pairs of submissions that, by the end, we thought should be compared were assigned to common reviewers early in the process.
      Unfortunately, the effect of the lower workload on quality of reviews was disappointing.  There was some improvement over the reviews produced by traditional STOC PCs but not as much as I had hoped for.

3. In my experience, our major PCs -- STOC and FOCS -- have small amounts of institutional memory and even smaller amounts of actual analysis of performance.  What data would you like to have to help evaluate whether the PC process went better this year?
For this year, I'd like to hear from PC members whether they did in fact spend less time overall but more time per submission than they have in the past on "traditional" PCs.  I'd also like to know whether they found the whole experience to be manageable and unstressful (if that's a word) enough to be willing to do it often, by which I mean significantly more often than they'd be willing to serve on traditional PCs.  Finally, I'd like to know whether the opportunity to submit papers was a factor in their willingness to serve and whether they found it awkward to review their fellow PC members' submissions.
      If future PC Chairs continue to experiment with the process or even with the product, as I suggest that they do in my answer to question 5 below, then I hope they'll capture their PC members' opinions of the experimental steps they take.

4. Are there things you did for the PC that you would change if you had to do it again?
Because the goals of this "experiment" were so amorphous, I and the rest of the EC members made up a great deal of the process as we went along.  If I were to run this committee process again, I would start by creating a detailed schedule, and I would distribute and explain it to the entire PC at the beginning of the review process.  I'd also lengthen the amount of time PC members had to write their first round of reviews (used to make the "first-cut" accept/reject decisions) by a week or two.  I'd also assign second-round reviewers at the beginning, rather than waiting as we did until after the first round of decisions had already been made; we wound up losing a fair amount of time while we figured out whom to ask for additional reviews, and I suspect that many PC members wound up losing interest during this down time.  So each submission would still receive just two reviews in the first round, but third (and perhaps fourth) reviewers would have their assignments and be ready to start immediately on all submissions on which early decisions weren't made.

5. Are there things you would strongly recommend to future PC chairs?
I hope that the theory community as a whole will consider fundamental changes to the form and function of STOC.  As I said in my answer to question 2, if we want to continue producing the same type of product (a three-day, two-track conference with an acceptance rate somewhere between 25% and 30%), then there are many PC processes that would work well enough; each PC chair might as well choose the process that he or she thinks will be easiest for all concerned.  The more interesting question is whether we want to change the product.  Do we want more parallel sessions, no parallel sessions, different numbers of parallel sessions on different days, more invited talks, more papers but the same number of talks (which could be achieved by having some papers presented only in poster sessions), or something even more radical?  What do we want the goals of STOC to be, and how should we arrange the program to achieve our goals?
     The community should discuss these and other options.  We should elect SIGACT officers who support experimentation and empower future PC Chairs to try fundamentally new things.
      More specifically, I recommend that future PC chairs include, as we did, a subcommittee whose job it is to oversee the reviewing process rather than actually to review submissions; in our case, this oversight function was the responsibility of the executive "tier," but there might be other ways to do it.  As I said in my answer to question 2, giving oversight and management responsibility to more people than just the PC Chair really helped in uncovering problems early and in making sure that related submissions were compared early.
      Finally, I'd of course recommend that future PC chairs not make the same mistakes I made -- see my answer to question 4.

6. In my experience, the theoretical computer science community is known for comparatively poor conference reviewing.  Having been PC chair, do you agree or disagree?  Do you think the two-tiered structure help make for better reviews? Do you have any thoughts on how to make reviewing better in the future?
In my experience, reviews on submissions to theory conferences range enormously in quality.  The worst consist of just a few tossed-off remarks and the best of very clear, well thought out, constructive criticism.  As I said in my answer to question 2, I had hoped that the two-tiered PC and its concomitant lighter reviewing load (together with my suggested review questions and regular prodding) would lead to a marked improvement in the quality of reviews, but we got only a small improvement.  I was extremely disappointed.  Frankly, I don't know what the theory community can do about review quality.  Maybe we should start by discussing it frankly and finding out whether people really think it's a problem.  If most people don't see it as a serious problem, then perhaps we don't have to do anything.

7. As you know, I'm a big fan of HotCRP.  How did you like it?
I've used three web-based conference-management systems: HotCRP, EasyChair, and Shai Halevi's system (the name of which I don't remember).  In my experience, they're all reasonable and certainly capable of getting the job done, but none of them is great; HotCRP is the best, but not by a wide margin.  Part of my problem was that I had unrealistic expectations going in.  I'd been told that HotCRP was almost infinitely flexible and configurable, and I thought that it would be easy to set things up exactly as I wanted them; that turned out not to be true.  On the other hand, if you use HotCRP exactly as it was designed to be used, it works quite well.  I have the feeling that it is a "system builder's system" in that it's very powerful and very efficient but not all that easy on users; the UI is not great.  Anyway, you and I do agree on one thing: HotCRP's "tagging" feature is amazing; PCs of all shapes and sizes should make heavy use of it.

43 comments:

Anonymous said...

I have heard that the PC was encouraged to perform reviews themselves and not seek external reviewers. Is this accurate?

I think such encouragement could be dangerous. For example, the few experts in the area might not be on the PC (or be, but have COIs), and the PC reviewers might then make mistakes. Indeed, I know at least one paper in this program whose results follow from previous work, and a scan of the PC, despite it being larger than normal, shows no experts in the relevant research area (any expert external reviewer would hopefully have seen this immediately).

One issue is that it is hard to gather statistics about incidences like the above (one example does not amount to a trend), since people are generally hesitant about pointing out such papers and potentially creating bad relations with colleagues. However, I do think that there are some large benefits of not only having external reviewers, but strongly encouraging it, to ensure high-confidence and expert evaluations of submissions.

Anonymous said...

I found the reviews of my papers (either accepted or rejected) contained much more details than my previous STOC/FOCS submission reviews. While I may still disagree with some specific comments, I appreciate the dedicated effort of PC members, and in particular, the PC chair who did all the experiments in order to improve STOC/FOCS. (Somehow to my disappointment, the coming FOCS dates back to the one-tier PC system.)

Anonymous said...

One thing not addressed in this post is the submission format. We heard of quite a lot noise of it. One thing I don't understand is that why so many people complained that no one would read this double column toy. As far as I know, Science, Nature, PNAS, all use double column.

Either simple or double column is okay for me. My point is simple: we should keep the submission format the same as proceedings format. If adopted simple column in the final proceedings one day, then peace.

Anonymous said...

I don't think we can change the problem of poor reviewing quality overnight. Receiving poor reviews for many years sets certain precedents and expectations in people's minds. When the same people write reviews, this is what they expect because these are the reviews that they have always seen!

Improving reviewing quality will be a long process, and making life easier for PC members is a step in the right direction. I would also suggest providing reviewers with detailed guidelines on how to write a review, including examples of what constitutes good and bad reviews. Finally, instituting a best reviewer award might help; each review could be rated on a 1-5 scale by the authors, and the reviewer who gets the most points would get an award.

Anonymous said...

20-person PC with one chair and no PC-authored submissions; there's nothing definitively wrong with it either.

I was a bit surprised by this comment. Having a single person read 40-60 submissions for something as career changing as a STOC acceptance seems rather wrong to me.

Aside from increasing the noise, it leads to a timid reviewing process. Time constraints means that incremental improvements over well known problems are favored over deeper research which takes longer to digest.

As well the reluctance to change the program size is also very surprising. The percentage of papers in the field that STOC captures has gone down by an order of magnitude (if not two) within our lifetimes, yet for some reason the "radical" choice is to forestall this decline. Going to a three session conference would put us back roughly to where STOC was in 1999.

Jeff Erickson said...

I can answer Joan's questions for PC members:

Yes, I spent less significantly time overall but significantly more time per paper.

Yes, I found the whole experience significantly less stressful than a traditional PC.

No, the opportunity to submit was not a factor in my decision to serve.

I don't think I reviewed any submissions from other PC members, but in light of HotCRP's handling of conflicts of interest, I would not have found it awkward.

The only thing that was really awkward were discussions about whether a particular paper should really be accepted given the competitiveness of this year's submissions. Because I only reviewed a dozen papers (and looked briefly at maybe twice that many), and discussions were limited to individual papers (or sometimes pairs), it was essentially impossible to have the global view necessary to make that judgement.

But this is a relatively minor issue. The program would still have been excellent if all the final-round decisions had been made by independent coin flips.

Michael Mitzenmacher said...

I'm enjoying all of the comment; I especially appreciate Jeff's, as I know Joan is hoping to see feedback from the PC on these issues. Of course, you can always mail her directly, but if you feel like commenting on your PC experience as Jeff did, it's very welcome.

(Of course comments from all others -- those who submitted papers or people in other communities who want to relate their PC experience to this discussion -- are eagerly welcomed to continue commenting also!)

Giorgos said...

I am not sure if that was the case for STOC, but one thing that strikes me as odd is that I often have to rate my own reviews by providing some confidence level. I find this rather useless, and I wonder if it ends up making any difference. (Why would I volunteer to write a review if my confidence is low in the first place?)

Instead, to improve review quality, I think reviewers should be rating each others' reviews, possibly only revealing these scores to the EC/PC chair. This would both incentivize better reviewing, and provide useful feedback to the EC/PC chair. Not to mention that it would be very little extra work for the reviewers who have just spent a few hours each reading the paper in question.

Anon 11:05 AM: I think authors rating reviews won't work: it's done too late in the process to make a difference, and I guess it would be severely biased. Would the reviewer of a rejected paper ever win the award you are proposing?

Anonymous said...

After serving in many committees over the years, I've been in two or three where the number of submissions was below the historical average. As a consequence the PC load was lower than expected. In all cases the quality of the reviews by the PC members was invariably much better and the papers were discussed to much greater depth.

Joan Feigenbaum said...

Thanks to all of you who have posted these great comments. I will read them once a day and post replies to all of those that pose questions I can answer or otherwise cry out for a direct reply.

Anonymous at 9:52 AM: PC members were not allowed to delegate unconditionally to "subreviewers," but they were not discouraged from asking students, colleagues, and outside experts for input. In fact, there were some submissions on which we were totally dependent on outside experts. It turns out that the scope of STOC is so broad that even 62 very highly credentialed PC members don't collectively have all of the expertise that's needed.

Anonymous at 10:42 AM: I'm glad that you got some detailed reviews! I wish that everyone had. Still don't know how to accomplish that, though.

Anonymous at 10:50 AM: Noise indeed. Everyone (literally!) who submitted obeyed the submission instructions. The STOC 2013 Executive Committee agrees with you that the most straightforward thing to do is to use precisely the same page limit and format for the submissions as we use for the proceedings. Many other CS communities do the same thing; it's not a barrier to entry for anyone who really wants to submit to STOC. So the real question is what the page limit and format should be for our proceedings. ACM told me that there's a one-column format in the works and that it will be suitable for reading on screens, including tablet screens and small screens. Hallelujah! I hate two-column format, and it's a vestige of the print era; we'll be rid of it soon. As far as the number of pages goes, I think we should preserve a clear distinction between conference publications and journal publications; the former should clearly be "extended abstracts" (and hence have strict and fairly small page limits), and the latter should be full papers (and hence not limited wrt number of pages). This should be part of the general discussion about the form and function of STOC that I advocated in my answers to MIchael's questions.

Anonymous at 11:16 AM: I had a very weak interpretation of "nothing definitively wrong" in mind, to wit: If all we want to do is continue producing three-day, two-track conferences with 20-minute talks each one of which corresponds to one high-quality proceedings contribution, then we can stick with traditional PCs if we want to. We can also use two-tiered PCs like the one I used, and we can probably follow lots of other committee procedures as well. There's a plethora of good submissions, and it's easy to choose a good subset. Each PC Chair should just follow the committee procedure that he or she likes best. The interesting question is whether all we want for the future of STOC is traditional three-day, two-track programs of 20-minute talks. I hope we'll consider many other possibilities.

Jeff E.: THANKS!

Joan Feigenbaum said...

Anonymous at 11:16: I forgot to reply to your comment about three parallel tracks vs two. As I said in my answers to Mike's questions, I'm strongly in favor of experimentation with the format of the conference, including the number of parallel sessions. I think that some of the STOC 2013 EC members agree with me about that. We simply thought that it would be cleaner to experiment with exactly one of the product or the process but not both simultaneously. Future program committees and their chairs should conduct more experiments. We may need a groundswell of demand for more experimentation; there's a very conservative contingent among those who wield power in the theory community.

Russell Impagliazzo said...

As a PC member, my general impression of the experiment is that it was a failure. I was somewhat biased against it to start, but worried more about the results than the experience. It surprised me how negative the actual experience of being a PC member was.

1. I spent about half as much time as I would have for a standard STOC PC, reviewing about 1/4 of the papers I would have. This is about twice as much time per paper. But I ended up feeling like I had spent an inadequate amount of time per paper.

2. The process was immensely stressful, much more than being on a PC. That is because I felt blinded by not having a global view. I was asked to make recommendations about where a paper stood relative to others when I hadn't read or even seen the abstracts to the others we were comparing it to.

3. This year, being able to submit papers was not a big issue, although I took advantage of it. In steady state, I expect it would become very important, since I would expect to be asked to be on a STOC/FOCS committee three times as often as with a committee 1/3 the size.

I didn't ``lose interest'' in the process, but not having a PC meeting meant it became harder to keep track of the schedule, and other obligations began to take priority. It was like Jaws 2: just when I thought I was safe, something else came along. I really feel bad that I didn't do a better job of writing clear reviews and spending more time getting the context of a paper right.

I think, as others have said, the high quality of submissions made the process fault-tolerant. Any heuristic would lead to a pretty good conference, if not the ``optimal'' one.But papers were only initially reviewed by two people, who were usually not the world's experts on the particular subject since sub-referees were discouraged. And if neither reviewer thought it was great (which was the standard for being further considered), it was then rejected without further review at all. So I think probably many deserving papers got rejected through getting the wrong reviewers.

I also feel bad that the reviews I wrote are, e.g., nowhere near as helpful as if I were writing a review of a paper I was really familiar with as a sub-referee. I was assigned papers that met my general interests, but not usually where I had detailed knowledge of the subject in advance. So I might not know about related work, for example. Like on a normal committee, I was mainly reviewing for interest not technical correctness, and didn't have time or expertise to verify most of the claims.

After serving on a normal PC, I have a vastly improved vision of the state of ToC as a whole. Right now, I really don't have much idea even about what papers got into STOC, never mind their significance. So the experience is personally less rewarding and more frustrating.

In short, I doubt I'd agree to be on such an extended PC again, even with the ability to submit papers.

Russell

Maurice said...

As a PC member, I can testify that I did spend more time on each paper, I did find it less onerous than usual, and I would have been reluctant to sign up for the usual PC load, given the other demands on my time.

Maurice said...

As a PC member, I can testify that I did spend more time on each paper, I did find it less onerous than usual, and I would have been reluctant to sign up for the usual PC load, given the other demands on my time.

Suresh Venkatasubramanian said...

Regarding Russell's comments:

* on the feeling of still not having enough time per paper.

I wonder if the experiment is incomplete in that regard. The typical data mining conference has 700-800 submissions (and increasing!). They have two-tier PCs, and while the functions of the EC and PC are not exactly the same, they are essentially similar. However, a typical PC is 300+ people, and a typical "EC" is 35-40 people. Given that the number of papers is about 2.5X, we're not seeing the right PC size or EC size.

* on the frustration over not getting a global view

I wonder if this is a consequence of the blending of the two modes of PC. In our traditional 1-tier model, PC members participate actively in discussions on papers. In the two-tier model (again referring back to the PCs I've been on) the "PC" members act like external reviewers, and it's the "EC" that does the deliberation. In other words, one should not expect to be getting a total view from reviewing 12 papers, and one should not be asked to rate papers in that way. that's the (larger) EC's job.

Joan Feigenbaum said...

Suresh: I'm not sure where you got this idea, but it's not true that "the 'EC' [did] the deliberation." PC members absolutely "deliberated" (at great length!) about the submissions that were assigned to them and, in some cases, about other submissions as well. The main responsibility of the EC during the period between the first round of (relatively easy) accept/reject decisions and the final (much harder) round of decisions was precisely to facilitate discussions by PC members who disagreed about the submissions whose fate hadn't been decided. Groups of PC members used the HotCRP "comment" facility extensively both to discuss individual submissions and to discuss sets of comparable submissions.
I think that there are also some misapprehensions out there about two aspects of the procedure:
(1) As I said in response to Anonymous at 9:52 AM, we did get extensive input from people who were not on the PC. Although PC members were not allowed simply to delegate a review unconditionally to a "subreviewer," and they were required to come up with their own numerical scores, they were advised to ask for comments from outside experts whenever such outside expertise was needed and to incorporate outsiders' comments into their reviews. EC members also sought comments from outside experts when they were needed; I myself obtained long, detailed reviews from outside experts on a few submissions that no PC member felt comfortable evaluating. Both reviews and HotCRP comments for this conference were full of prefatory remarks by PC and EC members of the form "the following comments were provided by so-and-so."
(2) Although each PC member was *required* to review and/or comment on considerably fewer submissions than he would have been on a traditional PC, all PC members were allowed (indeed explicitly encouraged) to comment on any submissions that they wanted to comment on (avoiding conflicts of interest, of course). Indeed, some PC members chimed in on submissions that hadn't been assigned to them, and in several cases their comments led to long discussions, to requests for outside experts' input, and to unexpected decisions.

On the "global view" question: Indeed, there is a stark tradeoff between individual PC member's having a very light workload and their naturally developing a global view of all of the submissions. To the extent that there are any precise "hypotheses" to what I hope will be ongoing experimentation with the process, one of them is definitely that many members of our community who are reluctant to commit the level of time and effort required for traditional committee service (even a "global view" in the offing) would be glad to make the much more modest commitment of time and effort that it takes to serve on a two-tiered committee (even if they have little prospect of a global view). Let's hope that we can continue to experiment with this and other tradeoffs.
I should add that, over many years in the field, I've heard from many members of the community that they don't feel capable of a global view -- they think that the total scope of STOC is huge and that they can understand only a small portion of it. One of the most common reasons given for declining an invitation to serve on a traditional STOC or FOCS committee is that one is bound to be asked to review numerous submissions outside of one's area(s) that one is not interested in. Many people strongly prefer just to concentrate on fewer submissions all of which they understand and care about deeply.

ryanw said...

I'm not as pessimistic as Russell...

1. I definitely spent less time overall and more time per submission. In pretty much all cases, I felt I had a good sense of what the papers were about and what distinguished them from prior work. (In contrast, I was on simultaneously on the PODS program committee, and there I had a much longer/tougher time...)

2. The experience was fine, I could easily do it again, but next time there should be a little more flexibility regarding subreviewers. There were a few cases where consulting an outside expert not at my institution would have been best for everyone. Perhaps a mix of "traditional" PC member roles (who may consult subreferees in the usual way but may not submit) and "extended" PC member roles (as in the experiment under discussion) would help.

3. For me, the ability to submit papers was indeed a factor in joining. I had some ideas cooking on low boil for several months and I figured they may be ready around November. It wasn't awkward, as the topics/areas of papers I reviewed had essentially no relation to that of my submission. That was perhaps lucky, and could have gone differently -- it is not hard to imagine cases where recusing myself from reviewing a paper would be the best course of action, even though I should be reviewing it.

Suresh Venkatasubramanian said...

Joan, maybe I wasn't clear. I didn't imply anything about the STOC EC - I was merely saying that in *other* venues that have a two-tier format, the EC does most of the deliberation, and this may be as it should be.

As you rightly point out, there's a tradeoff between review load and the "global view" and it might very well be that the right balance is towards a less global view for the PC (and therefore lower load) combined with an active EC (or meta-reviewing body) that does have the global view.

Anonymous said...

Reasonable objections by well-regarded members of the community: "Noise."

Confirmation bias: No one risked outright rejection by submitting in the wrong format, thereby validating our choice.

robert-kleinberg said...

My answers to Joan's questions:

1. I spent significantly less time overall than when I'm on a traditional PC, and I spent more time per paper. The time spent per paper was in between (a) the amount of time I spend on a paper that I assign to a subreviewer when I am on a traditional PC; (b) the amount of time I spend when I am a subreviewer for someone else serving on a traditional PC. The time spent per paper was much closer to (b) than (a).

2. I'd be willing to do it more often than serving on a traditional PC, because the workload is so much lighter. On the other hand, given a choice between serving on one traditional PC or two PCs like this one, I would greatly prefer to serve on one traditional PC (even if it was somewhat more than twice the workload) for the reason Russell explained: "After serving on a normal PC, I have a vastly improved vision of the state of ToC as a whole." This new PC format doesn't afford the same benefit to its members.

3. Once again, Russell's answer sums up my view: "This year, being able to submit papers was not a big issue, although I took advantage of it. In steady state, I expect it would become very important, since I would expect to be asked to be on a STOC/FOCS committee three times as often as with a committee 1/3 the size."

4. For this conference, none of papers assigned to me had an author on the PC. I've served on other PCs for conferences that allowed PC-authored submissions and didn't find it excessively awkward.

Mikkel Thorup said...

This was my 10th STOC/FOCS PC. I think Joan did an excellent job and introduced many good ideas, but I do think that the 2-tier system is a fundamental mistake that we will suffer from in the long run if it is continued.

CRITICISM

I will go straight to what I see as the main problem, and return with some praise in the end.

The easy job of a PC is to compare like papers on similar subjects. The hard part is to compare accross areas, attempting some uniform accept level. This is where STOC/FOCS traditionally do much much better than journals.

What makes cross-area comparision work with a traditional PC is that many of the more senior members have worked in several areas. Handling papers from multiple areas they offer direct
cross-comparisons. Here, by a direct comparison of two papers, I mean a comparison done by a PC member who is responsible for both papers.

One of the ideas in the 2-tiered EC/PC was to minimize the number of cross-comparisons. It was attempted to let PC members handle like (same area)
papers, so most of the direct comparisons where not across areas.

On top of that, note that with the same total reading load, we get only half as many direct comparisons if we double the PC size, giving each PC member only half the load; namely (2k choose 2) versus 2*(k choose 2).

The result is that we have an EC doing cross comparisons indirectly, relying on the scores from different PC members for different fields. This means that we rely on PC members from different fields to have a uniform sense of quality. This is pretty much the problem with journals, where standards are much more fluctuating across fields.

I do not think the problem is big yet, because thanks to the work of past STOC/FOCS PCs, we all have a pretty good feeling for STOC/FOCS standards. I am not saying that these are even remotely perfect, and
there will always be lots of papers in the gray zone, but compared with the journal situation, I think STOC/FOCS has a far more uniform accept level (the fact that conference papers are not refereed for
correctness is a totally different issue).

Without direct cross-comparisons, subfields get to evaluate themselves, and establish their own standards.

So stepping back: the hard part of selecting a program is to select across subflields. Normally, we have a PC of 20 members doing this in a discussion involving many direct cross comparisons. Instead this was done by a hardworking EC with only 10 members relying on different PC members
for reviews on different fields.

Joan was very focussed on getting good direct comparisons of like papers, but I think that has been a main emphasis of any PC chair I have served under... while it is the easier thing to do, it is also
the one that is most embarrassing to get wrong. As mentioned above, with more PC members handling fewer papers, we simply had much fewer direct
comparisons than usual.


Because of character limit, praise and conclusion follows in second comment.

Mikkel Thorup said...

This is second part of my comment..

PRAISE

I think Joan was successful reducing the work load per PC member. One of the interesting new ideas was to have an initial round with only two reviewers per paper, thus trying to handle all the easy cases with minimal resources. I do think such an adaptive approach is a good idea. Many of us procastinate, doing things as close to deadline as possible. Having an initial round has two advantages: (1) spreading out the work, getting some done early for the initial round, and (2) minimizing the work on easy cases. I later got called in on other papers needing more reviews.

About PC members submitting. I have no strong opinion. Obviously this meant that we got some extra submissions to this STOC. Next FOCS will be missing the submissions that it normally get from the preceding STOC PC.

Also, I was quite happy with the electronic PC. Have tried electronic PCs both at SODA and at one previous STOC.

All in all, this time I felt like a glorified referee, with very little feel for the overall program. With more effort, I could have have helped much more with the global program, but getting a global view would imply spending the same time as on a normal PC, and having 50 PC members do that would hugely increase the total work load for the community. Moreover, since I knew the EC would make all the final decisions, I didn't feel very motivated getting too involved outside my batch. That being said, I was totally happy with all decisions made
on the papers I handled.

Russell Impagliazzo said...

As someone posted earlier, the quality of reviews is affected by the number of submissions relative to the expected number. In other words, if you get less than you expected, you do a better job, and more than you expected, you do worse. The time required here was less than a normal PC, but more than I expected, and more than I budgeted for. If I did it again, I would budget more time, and maybe it would be less unpleasant and I would do a better job.

Maybe the actual sub-referee policy was less strict than what I thought. The papers were in my general areas, but for only a few would I (or someone I work with) be a suitable sub-referee. In particular, what I worried about was being ignorant of related work that was not cited in the
submission.

Anonymous said...

This post is an utter shame, and this experiment is an utter failure.

As few (or no?) external reviewers are summoned, while the committee is not representative enough, especially when the COI is taken into account, the review process is terribly bad. Many submissions were reviewed by totally irrelevant and incompetent reviewers, though they're real experts in their own area.

As a result, we might expect the lowest quality in general since the inception of STOC. It will surprise nobody if FOCS'13 turns out to contain better papers than STOC. WHAT A SHAME!

Anonymous said...

With most high-quality conferences, I accept that random choices (after filtering) lead to a good program (albeit not optimal), and I don't mind (that much) even if my own deserving papers get rejected.

That is how I feel about the normal reviewing structure. But in the case, I was left unsatisfied by a 2-review rejection of one of my submissions: One review was by an admitted nonexpert, and hence probably had little chance of receiving the strongest score. The other review was one of the lowest-quality reviews I have ever received. An outright rejection at this point is disappointing.

In my opinion, if this process is continued, the first-round rejections should be augmented to account for the quality of reviews. A paper should not be rejected based on first-round scores if the scores themselves are meaningless.

Anonymous said...

I was a member of the outer PC. (I prefer not to identify myself for reasons that will be obvious from the contents of this comment.)

I am afraid that my experience is largely consistent with that of "Anonymous March 1, 2013 at 5:49 PM". A significant fraction of the papers I dealt with received at least one very superficial and low-quality review. Such reviews almost always corresponded to mediocre (though usually not really low) scores. I think that in the normal three-review process one such review, if matched by two positive and more thorough reviews, would not torpedo a paper. This year, with only two reviews, even in cases where the other review was significantly more positive and more thorough, the paper rarely survived to receive a third review.

Like Russell and Mikkel, I felt that I did not get a 'global' view of the submissions pool, so the experience of being on the PC was somehow unsatisfying (while certainly less time-consuming than it would have been to manage 40+ papers under the usual process).

Apart from other aspects of the process, I think the fact that so many papers received only two reviews is a significant drawback of this STOC as opposed to previous conferences. This feels to me like a step in the wrong direction -- a colleague recently told me over lunch how his submission to a top systems conference received *nine* detailed reviews. I was ashamed to say that our flagship conferences in theory are experimenting with two reviews, down from three, and that those reviews are sometimes very superficial.

Anonymous said...

(Same commenter as previous comment) I would like to emphasize that I don't at all blame Joan for the negative aspects I described above. She obviously worked extremely hard and repeatedly urged the committee members to do good reviewing, but this didn't always stick.

I think it's good that STOC did this experiment, but my sense is that having many papers receive only two reviews should be avoided in the future. The real issue is how to increase the overall cultural norms of reviewing quality in our field.

Michael Mitzenmacher said...

While I'm making an effort not to respond, I do have to clear something up for the last anonymous.

Papers were initially assigned 2 reviewers. After the first stage, they were rejected only if they got two very low (clear reject) scores. Even if these papers had gotten a 3rd review, they were never going to make the top list. They didn't have one positive score where another review might give them two positive scores. They had two (very) negative scores; they weren't getting accepted.

Now, I do think that the papers left after the 3rd round should have consistently obtained a 3rd (or even 4th) review; that's a different matter, and I do think Joan discussed that if she did it again she would have had additional reviewers pre-assigned for papers in the 2nd round.

I think there's nothing wrong (and a lot right) with a system that does 2 initial reviews and cuts a number of papers off at that point, and then focuses on the remaining X%, for some reasonable X and cutoff method.

Joan Feigenbaum said...

Thanks again to all who have commented on the STOC 2013 experiment. There are three themes in these comments that I'd like to respond to. I'll address two here and one in my next message.

"Failure": I am taken aback by Russell's and other's use of this extremely derogatory word. As I explained in my original post, the goals of this experiment were (unavoidably, given how the experiment was undertaken) extremely modest and extremely vague. To the extent that I can sum them up briefly, they were to answer the question "Can we produce a high-quality STOC program with a different committee process from the one we've traditionally used"? and "Which changes to the process should we consider"? We now have one (multidimensional) data point with which to start answering those questions. If the answer to the first question is "no," then the community has the option of simply reverting to the traditional process. That doesn't make the experiment a "failure."

Are some of you using "the experiment was a failure" to mean "I did not enjoy participating in this process"? If so, I think that's nonsensical. One thing we can learn from experimentation is what type of process potential committee members prefer participating in. Eliciting that information is a sign that the experiment was a success, not that it was failure.

(BTW, unsurprisingly, the PC members who really liked this alternative process have, for the most part, been emailing their views to me privately rather than blogging about them. I feel sort of like a professor reading course evaluations -- it's the dissatisfied students who feel the deepest need to make their views known. :=))

Early accept/reject decisions based on two reviews: No decisions were made based only on numerical scores; the EC read the reviews carefully and in many cases asked for additional input from PC members and outside experts before making any decisions. Moreover, all PC members were empowered to prevent early decisions. Approximately a week and a half after the first round of scores and reviews were due, I sent email to the entire PC telling them (1) tentative early decisions had been made, and PC members had 12 days to raise objections and concerns before those early decisions were finalized, and (2) they were (as I had told them earlier) welcome to comment on any submissions they wished to comment on, regardless of whether those submissions had been assigned to them (except, of course, for those with which they have CoIs). In fact, some PC members did raise objections and concerns during those 12 days, and some decisions were postponed for exactly that reason. If there are PC members who now think that particular submissions got short shrift, we need to ask why those PC members did not raise these concerns during the deliberation process when they were asked for their input. Perhaps that problem can be avoided in the future by (as I also suggested in my original post) setting a firm schedule at the beginning and making sure that everyone is aware of it.

That being said, I certainly agree (and said in my original post) that the quality of reviews in our community has always been inadequate and was inadequate this time as well. I hoped that the markedly light load on each PC member would result in (almost) uniformly thorough reviews, but it did not. Rather than focus on whether we should demand two or three reviews on each submission, I think we should figure out how to get good reviews on submissions to theory submissions. Does anyone have any ideas?

Joan Feigenbaum said...

This is the continuation of my previous post.

Global view: As I said in an earlier post, it is impossible for a PC member to have both an extremely light work load and a global view. The main value proposition to potential PC members whom I invited to serve on this committee was the chance to handle few papers but to do a very thorough job on each one. Almost everyone whom I invited accepted the invitation, and many expressed delight that they'd be handling few submissions and could do a great job on each one. It never occurred to me to warn them that they would not have a "global view"; I thought that that was obvious. Recall that PC members who wanted a broader view than just their assigned submissions were welcome to read and comment on other submissions. Some of them did so; perhaps that's a disjoint set from people now regretting that they didn't have a global view!

Anyway, this is another example of valuable information that we've gained from this experiment: Even though they didn't think about it when they accepted the invitation to serve, some people care a lot about "having a global view." Another way in which the experiment was a success!

Note that all of the so-called "subreferees" that traditional PCs enlist for detailed reviews also don't have a global view. (BTW, I think that we should simply call these people "reviewers" or "outside experts" or something else more accurate. We don't do full-fledged "refereeing" for conferences, and so there is no "referee" to which they people are sub'd.)

There are several important functions that must be performed in order to put together the program. They include:

1. Detailed (written) reviewing.
2. Discussion of individual submissions (especial those about which reviewers disagree).
3. Discussion/comparison of comparable submissions to determine which are better than others.
4. Overall balance, apportionment of space to various technical areas, and attempts to compare, for lack of a better word, incomparable submissions. Here is where one needs the vaunted "global view."

Somehow, we need to assign these functions to PC members and outside experts. Traditionally, we have a monolithic PC that takes on all of 1 - 4 and delegates some of 1. I maintain that there's no reason that that division of labor is sacrosanct, and I think that the STOC 2013 experiment, which produced a high-quality program, is evidence that I'm right. I think we should continue to experiment with a variety of divisions of labor.

Mikkel Thorup said...

I agree with Joan that it was a important experiment, and that it is not a failure to perform such experiments.

My fundamental objection is that we had only 10 EC members with the global view, relying on different local PC members for reviews for different areas.

This it to be contrasted with 20 traditional PC members being involved both globally and directly in the papers.

As Joan pointed out, having 50 local PC members work globally would defeat the purpose of reduced work load for local PC members.

I don't think more thorough reports are going to have any significant effect on acceptance/rejections. The significance of a paper should be explained in the abstract and introduction of a paper. The body of the paper (and possibly appendices) serves mostly to justify the claims made in abstract and intro. If there are some particularly interesting technical details, then the intro should at et least point to them, so the referee knows to check them out. I normally only read further details if the claims made surprise me, and very often this is when I end up finding a bug in the submission.

I myself had a paper rejected based on two reviews. While I disagree with the conclusion of the referees (otherwise I would not have submitted), I do not think it would have made any difference if these particular referees had read and reviewed my paper in any more detail.

Conference refereeing is extremely taxing for the community, making it very hard to get reviewers for the proper journal refereeing, which is where the detailed refereeing is really needed.

Finally, getting a good program is not hard based on the submissions (this time with the extra PC submissions that would otherwise have gone to FOCS'13). The hard part is to try to be as fair as possible in the rejections.

Let me also say that I am proud of our community. I have had experience with systems conferences (both accepts and rejects), and while they may generate more reviews, the quality of those reviews in terms of understanding has typically been abysmal. I don't think we have much to learn from them.

Mikkel Thorup said...

Concerning participating in the PC, I participated for the sake of STOC. I was skeptical about 2-tier, and thought it very important that such an experiment was not run by only the people believing in it. I also wanted to help STOC'13 as much as possible, giving the 2-tier system the best chances. I told Joan up-front about my skepticism, and that I would write my opinion about the experiment.

My expectations were even worse, and I think Joan did a surprisingly good job.

Suresh Venkatasubramanian said...

Joan's framing of the different tasks a PC performs and how one might refactor them is incredibly useful, and is an excellent basis for continued discussion. I don't see how this experiment is a failure at all. It's an experiment, and it has generated observations and thoughts for the next experiment. Of course there's something both hilarious and ironic about theoreticians arguing about the nature of an experiment :).

Russell Impagliazzo said...

I apologize to Joan for using loaded language. I only meant that, for me, the answer to the question ``Does the new process produce good results with less work and stress?'' was no. Thus, the hypothesis the experiment was designed to test (if all the other runs were to be like mine) was not confirmed, and hence the experiment ``failed''. I didn't mean to imply that performing the experiment was not worthwhile, nor that it was not correctly designed, only that the outcome did not verify the hypothesis.

Russell

Michael Mitzenmacher said...

So, just to be clear Russell, while you have your opinion, it seems far from universal.

Anonymous said...

Just as one more data point: this was the first time I got a STOC/FOCS review from a reviewer who showed no understanding of the field, and barely any understanding of the results of the paper. The review was close to nonsensical. The other review was extremely high quality and very useful.

The obvious explanation is that there was only one expert on the PC qualified to review the paper (checking the list of PC members confirms this). I do understand the usefulness of making a first round of quick decisions after two reviews. However, I think that once a paper makes the cut and is eventually accepted, it should receive an additional review -- a great review and a bad one felt like less feedback than I get from other STOC/FOCS committees.

Also, it seems that even though reviewers could ask the opinion of outside experts, some felt disinclined to do so, even when they were ill prepared to judge a paper.

Anonymous said...

While I do find this open discussion useful and interesting, I think it would be good to collect some harder data. Joan could run a poll among the members of the PC and EC and then release the numbers (maybe separately for PC and EC). For instances, we could ask a question like the following.

Do you think STOC should adopt a two-tier PC approach in the future:

a) no
b) yes, possibly with minor changes from this year
c) yes, but with major changes

This would give us some quantitative feedback complementing the more qualitative feedback. In particular since, as Joan pointed out, the feedback received so far might suffer from a selection bias.

Joan Feigenbaum said...

Russell: Apology accepted.

Anonymous on March 1, 2013 at 11:47 PM: Nine reviews per conference submission is just about the most asinine thing I've ever heard about a CS conference! We simply do not need nine reviews in order to make sound accept/reject decisions and to give constructive feedback to authors. Perhaps systems people don't mind wasting their time on this type of overkill, but wasting STOC and FOCS people's time in that manner would be criminal.

Anonymous on March 4, 2013 at 9:24 AM: In fact, after Jeff E. took the lead and, in his comment, answered the four questions I mentioned in my original post, I emailed the entire PC and asked them either to blog their answers to those four questions or to email me their answers. I plan to summarize the results and distribute the summary in some medium or other. However, I must warn everyone that the majority of the PC members have not answered. Somehow, this feels analogous to the fact that many PC members did not write thorough reviews. There seem to be some things that STOC PC members feel they must do and some that they don't. Fortunately for all of us, they do seem to feel obligated to put together a high-quality STOC program by whatever means necessary!

Anonymous said...

I completely agree with Anonymous on March 1, 2013 at 11:47 PM: Accepting and more importantly Rejecting
papers based on only two reviews was
a failure of this two-tiered PC. Indeed this was my first experience of a CS theory conferenc with only two reports.

Aleksander Madry said...

1) I spent probably half as much time as on the SODA 2013 PC I just had been on before.

Also, it was a much more enjoyable job as most of the papers I got to look at I truly wanted to read.

Needless to say, this resulted in me spending more time on reading the submissions and ability to make much more knowledgeable recommendation on them.

2) It was definitely less stressful and more manageable. So, indeed, I should be willing to do it more often.

3) Yes, being able to submit was a factor in my decision (although, I ended up not submitting anything). I don't like this aspect in traditional PCs when you have to decide half-a-year in advance if there is something you will want to submit or not. (And if you have some co-authors this becomes even more complicated.)

I did not feel awkward when reviewing PC member's submission. (I do not see much difference between reviewing PC member's submission and evaluating submission by my other colleagues.)
Also, the fact that I had time (and expertise) to make a fairly knowledgeable assessment - and thus stand by my decision - for most of the submissions I was handling, was a factor here too.

Finally, one point I would like to make is that - similarly, to others commenting here - I really felt that the process behind making the final accept/reject decisions by the EC committee was very opaque to me. Of course, I could feel that our reviews were taken into account and there was a chance for us to discuss some of these preliminary decisions. But, overall, the thought process of the EC - especially, behind deciding between papers from different areas - was not disclosed to the PC.

I am not saying that this is necessarily bad, but I would feel better if the invitation to the PC was more upfront about the fairly limited role of PC members in the whole selection process.

However, I must say that I am happy with the final program of STOC (at least, regarding the papers I have some knowledge about).


Joan Feigenbaum said...

Alex: Thanks for your comments. I agree with you that the PC Chair should explain the entire process that will be followed in making accept/reject decisions when the invitations to serve on the PC are issued; I certainly would have done so had I known exactly what the process would be. This time, we were making much of this "experimental" process up as we went along, and we really didn't know exactly what we were going to do until we did it. If the community continues to experiment with two-tiered PCs, I expect future chairs to benefit from our experience, to determine the entire process and schedule before getting started, and to make potential PC members aware of them.

Sanjeev Arora said...

I enjoyed reading about this experiment. I have told SIGACT chairs before that there is no reason why traditional PCs have to stay at a size of 20. (The main reason as far as I can tell is to keep down the budget for PC meeting.) Sigact has a big reserve fund! Use it to subsidize the PC meeting and raise the size to 30. That would again reduce load per PC member to a more manageable 30-35.

I'll also send this suggestion to Paul Beame (new SIGACT chair) privately.

On a different note, I think experimenting with the product (to use Joan's term) is more important.

Mikkel Thorup said...

Hi Sanjeev,

About the magic size 20, it is of course not magic, but it is a balance based on experience.

As discussed by many above, the cost of a global view is an additive cost per PC member. The total global work load is thus proportional to the PC size, so for this cost, the work load on the community is minimized by a small PC.

Joan realized this, which is why she for a larger PC, introduced a small EC, where only the EC had the global view.

The large PC was focussed on the refereeing, which parallellizes perfectly since the total refereeing load only depends on the number of submissions and the number of reviews per submission (Joan cut this to an initial 2, which is something that could also be done with a normal PC).

One prize payed was that for selecting papers from areas in different fields, the EC relied on reports from different PC members. In my opinion, this makes it much harder to make fair selection across areas. In a traditional PC, besides the simple comparison of like papers, you have PC members handling papers from different areas, and who are all involved in the global outcome.

Direct comparisons is also an argument for fewer PC members taking many papers (with k papers you make k^2/2 direct comparisons).

All in all, from a total work load and direct comparison perspective, we are better off with a small PC, but this has to be balanced with how much work a single person can take.

To match the regular load of 20 PC members, in the new system with 10 EC and 50 local PC members, for each regular PC, we now have to be
on 0.5 EC and 2.5 local PC.

Best, Mikkel