Tuesday, August 19, 2014

Reviewing Scales

I'm just about finished reviewing for CoNEXT (Conference on Emerging Networking Experiments and Technologies), and am starting reviewing for ITCS (Innovations in Theoretical Computer Science).  One notable variation in the process is the choice of the score scale.  For CoNEXT, the program chairs chose a 2-value scale: accept or reject.  For ITCS, the program chair chose a 9-point scale.  Scoring from 1-9 or 1-10 is not uncommon for theory conferences.

I dislike both approaches, but, in the end, believe that it makes minimal difference, so who am I to complain?

The accept-or-reject choice is a bit too stark.  It hides whether you generously thought this paper should possibly get in if there's room, or whether you really are a champion for the paper.  A not-too-unusual situation is a paper gets (at least initially) a majority of accept votes -- but nobody really likes the paper, or has confronted its various flaws. (Or, of course, something similar the other way around, although I believe the first case is more common, as it feels better to accept a close call than to reject one.)  Fortunately, I think the chairs have been doing an excellent job (at least on the papers I reviewed) encouraging discussion on such papers as needed to get us to the right place.  (Apparently, the chairs aren't just looking at the scores, but reading the reviews!)  As long as there's actual discussion, I think the problems of the 2-score solution can be mitigated.

The 9 point scale is a bit too diffuse.  This is pretty clear.  On the description of score semantics we were given, I see:

"1-3 : Strong rejects".

I'm not sure why we need 3 different numbers to represent a strong reject (strong reject, really strong reject, really really strong reject), but there you have it.  The boundaries between "weak reject", "a borderline case" and "weak accept" (scores 4-6) also seem vague, and could easily lead to different people using different interpretations.  Still, we'll see how it goes.  As long as there's good discussion, I think it will all work out here as well.

I prefer the Goldilocks scale of 5 values.  I further think "non-linear" scoring is more informative:  something like top 5%, top 10%, top 25%, top 50%, bottom 50%, but even scores corresponding to strong accept/weak accept/neutral/weak reject/strong reject seem more useful when trying to make decisions.

Finally, as I have to say whenever I'm reviewing, HotCRP is still the best conference management software (at least for me as a reviewer).

4 comments:

Paul Beame said...

I have been on only one PC that used a geometric percentage scale and I really didn't like it. The problem is that it doesn't distinguish between huge ranges of papers - leaving a whole bunch of papers in the region around what should be the acceptance threshold with indistinguishable numerical ratings despite very different PC member attitudes and reviews.

PC members probably do a bad job of estimating the overall percentages anyway given their small samples.

The high rated and low rated papers don't need quite so much distinction. The most important area where numerical ratings matter is in helping the committee understand the attitudes of their fellow PC members near the borderline. In theory conference this is in the 20%-40% range given that PCs generally accept between 25 and 33% of submissions. In other communities it may be at a different level.

Very often I find that PC members end up not wanting to make firm statements but having leanings that have a big affect on their behavior in PC discussions. The more that numerical ratings can tease these out at the start, the better.

Anonymous said...

Several years ago some conference decided to simplify the scoring system -3 ... 3 to 0 to 3 (or something like that), with the comments that anything that was in the range -3...0 should be now names 0, since in the past all papers with the scores in that range were rejected anyway; this is somehow similar to your criticism of the scores 1-3. PC chair was very direct and told all PC members that they should think about the scale -3 ... 3 and give 0 for papers with scores in the range -3 ...0. However, the outcome was that the average score was higher than 1, since everyone was calibrating to the range 0..3.

Anonymous said...

Below is the scale a machine learning conference uses. I like it because it assigns particular interpretations of each score (if you give a paper a 1 and it gets in you should boycott the conference).
10: Top 5% of accepted NIPS papers, a seminal paper for the ages.
I will consider not reviewing for NIPS again if this is rejected.
9: Top 15% of accepted NIPS papers, an excellent paper, a strong accept.
I will fight for acceptance.
8: Top 50% of accepted NIPS papers, a very good paper, a clear accept.
I vote and argue for acceptance.
7: Good paper, accept.
I vote for acceptance, although would not be upset if it were rejected.
6: Marginally above the acceptance threshold.
I tend to vote for accepting it, but leaving it out of the program would be no great loss.
5: Marginally below the acceptance threshold.
I tend to vote for rejecting it, but having it in the program would not be that bad.
4: An OK paper, but not good enough. A rejection.
I vote for rejecting it, although would not be upset if it were accepted.
3: A clear rejection.
I vote and argue for rejection.
2: A strong rejection. I'm surprised it was submitted to this conference.
I will fight for rejection.
1: Trivial or wrong or known. I'm surprised anybody wrote such a paper.
I will consider not reviewing for NIPS again if this is accepted.

From http://nips.cc/Conferences/2013/PaperInformation/ReviewerInstructions

Michael Mitzenmacher said...

Paul:

What you see as a problem I see as a benefit. The geometric scale rather consistently divides the papers into 3 categories:
Definitely accept
Definitely reject
Discuss

Papers that need to be discussed should, actually, be discussed.

I find that your suggestion that the numerical rankings help "tease out" what is going on with borderline papers (at least with a 10 point scale) just isn't the case in my experience; different people don't use the number consistently in these ranges, and they are not sufficiently accurate scorers in their first pass. More scores just increases the arbitrariness. And, I believe, it makes it harder to clarify the "clearly accept" and "clearly reject" papers (although probably more the latter than the former). (As I said, a 5-point non-geometric scale still seems fine; I just prefer it a little less.)

Anonymous #3: The NIPS scale is really funny. Thanks for posting it!

But again, it highlights the point that scores of 1/2/10 are probably rarely used (and fairly arbitrary; they're more likely to be used by PC members that like to give strong opinions rather than based on the actual merit of the paper). Also, the distinction between scores 4 and 5 (and 6 and 7) seems so vague and reviewer-dependent that it again increases arbitrariness.