Sunday, March 01, 2009

Double-blind reviewing

The issue of double-blind reviewing came up a number of times when I was blogging about the STOC PC. I had wanted to wait and write a well-thought out post on the subject, and happily, was beaten to it by Sorelle, who was written an excellent post outlining the various arguments and counter-arguments (with more in the comments). Sorelle has asked this conversation to continue, so I'm continuing it here.

My primary concern against double-blind reviewing is what restrictions is puts on author dissemination. I have limited experience with double-blind reviewing as an author, and my impression was that authors were not to disseminate their work in any public way when submitting to a conference with double-blind reviewing. Perhaps times have changed, or my memory/understanding was faulty, but such strict dissemination restrictions do not seem to be the default.

For SIGCOMM (which is double-blind) the important statements in the call seem to be:

"As an author, you are required to make a good faith effort to preserve the anonymity of your submission..."

but at the same time,

"The requirement for anonymity is not meant to extend beyond the submission process to the detriment of your research. In particular, you may circulate your submission among colleagues or discuss it on a mailing list if you see fit. "

I wish this second statement was clearer, but my understanding (and apparently the understanding of at least some others) is this wouldn't prevent an author from putting the paper on their web page or on arxiv or giving a talk somewhere about it.

If we have this understanding, we should move to double-blind reviewing. To try to limit myself to something new in the argument -- again, go read Sorelle's post and comments! as a baseline -- let me say that after serving as a PC chair, it's absolutely clear to me that biases based on the authors' names do regularly enter into the picture. Let me take a hypothetical situation. Three initial reviewers give bad reviews to a paper. Another PC member takes a look -- because they know the author -- and argues that they've heard a bit about the work, that the reviewers might not realize how this is tying together two areas (quantum and e-commerce auctions! -- again, that's hypothetical) in a novel way, and think an outside reviewer is in order.

At this point, bias has almost surely entered the picture -- unless you think with high probability the other PC member would have noticed this paper without the name on it. It's a limited and seemingly harmless entry of bias -- all that's being asked is someone else look at the paper! -- but what about all the other papers that aren't going to get a second (well, fourth) look because the "right person's" name isn't on the paper? In my mind, it's bias pure and simple. At this point, it seems to me, if you want to argue against double-blind reviewing you have to argue that the cost of this solution is outweighed by the actual benefit. (Hence my concern about the "costs" of limiting authors' ability to disseminate -- which for now I'm understanding is a less severe issue than I might have originally thought.)

And by the way, I think there are much more severe examples of bias that regularly come to play in PC meetings. Much of this readily admitted, by people who give the argument that we in fact should give more trust to "name" authors that unknown authors, since they have a potentially bigger loss of reputation to protect. (See comments at Sorelle's page for those that give this argument.) People presenting this argument are at least being open that in their mind they are considreing a cost-benefit analysis on the bias they admit is there. I don't think this argument holds up -- though we can leave that discussion for the comments if needed.

To wrap up, I should say that I personally don't recall a setting where I thought there was specific bias against female authors in a theory PC, which was an issue Sorelle brought up. I'm not saying it doesn't exist, but that unlike many other clear issues of bias that I have seen, I can't recall a specific one where I felt sexism was playing a role. Of course that doesn't change my opinion that double-blind reviewing, implemented suitably, would be of positive benefit for our major conferences. And if there is bias against female authors, an additional benefit is it would likely mitigate that as well.

25 comments:

Anonymous said...

While we're discussing potential imports from system conferences to theory conferences, allow me to mention: rebuttals. It can be quite annoying when a reviewer claims that technique X from some paper he read a while back already solved your problem (or some non-trivial part of it) and therefore gives you a bad review, when actually X fundamentally doesn't work in the current setting. You could say that the authors should explain why previous approaches don't work, but it requires (infinitely?) more pages to explain why "for all previously known X, X doesn't work here" vs. "we give a Y that works", and even if you make an attempt at arguing the coNP statement, it won't help if you don't manage to mention the particular X which the reviewer had in mind!

Anonymous said...

Wrt bias against female authors, in my PC experience I have often noticed that a single female authored paper that is borderline tends to be "torn apart" in the reviews. I have never seen an instance when such a paper was rejected when I thought the paper should be accepted, but the reviews seemed to be much more negative than other borderline rejected papers.

Jeffe said...

I'm exactly backwards on the fame-bias axis; I always encourage reviewers to be harder on famous names, not easier. Well-known authors have more responsibility to make the paper stand on its own merits -- they, more than anyone else, understand (or even define) the community standards. It's the newer authors, especially students, who should be given the benefit of the doubt in borderline cases.

Perhaps that's an argument that I should only be given anonymous papers to review. <shrug/>

Yes, anonymous #1, it is the author's responsibility to explain why previous techniques don't work. Yeah, that means more work. So what?

Yuriy said...

How can a paper be a borderline reject if the reviews "tear it apart"? Is the paper of excellent quality but the reviews are terrible and thus it is judged to be a borderline paper? Or are you using some metric other than the reviews to decide that the paper is borderline?

Anonymous said...

Yes, anonymous #1, it is the author's responsibility to explain why previous techniques don't work. Yeah, that means more work. So what?

Yes, but which previous techniques? If I'm writing a paper in area A, and the reviewer conjures up some random technique from area B which he thinks would solve our problem, and he's wrong, then what? What I'm saying is, even if you make a good faith effort to discuss previous techniques, it is not possible to explain why "for all previous techniques X, X doesn't work here". It may be possible to explain why "for all previous techniques X for related problems, X doesn't work here", but that's not what I'm discussing.

Anyway, the short story: people make mistakes, including reviewers, so I think it makes sense to allow rebuttals to deal with the cases when reviewers goof (after all, we already have reviewers to deal with the case when authors goof up!). Delaying a paper's visibility for half a year because of one human error doesn't make sense when there's an easy way to avoid that error (rebuttals).

Michael Mitzenmacher said...

I think it's off-topic, but regarding rebuttals -- anon #5 claims rebuttals are an "easy way" to avoid errors. I think they're so rarely used precisely because that's not actually the case. At a high level, rebuttals increases the number of rounds of communication from 1 to 2. Rounds of communication are expensive, in terms of time delay. They also serve as a forcing function (it makes no sense to have rebuttals if you don't have all the reviews) in a setting where, quite frankly, many PC members are not good at meeting deadlines.

Again, we discuss rebuttals in terms of cost-benefits. But your assumption that the cost is low seems way off base as a starting point.

11011110 said...

Re the cost of rebuttals: along with the cost to PC members that MM mentions, there's also a cost to submitters: all must be prepared to set aside a day of time to prepare a rebuttal, usually at very short notice. And all authors will feel obliged to defend their paper against whatever criticism they see in the initial review. The number of people who end up preparing rebuttals is far greater than the number of cases in which a genuine error in a review is found and corrected.

As for the "why doesn't technique X solve this problem more easily" type criticisms, they are difficult to answer even when the criticism is misguided. The critiques of this form usually don't provide enough detail to determine exactly what the reviewer means, and it may take more time than provided in the rebuttal process to answer questions like that. If it was an obvious technique to try, the authors should have already addressed it in their paper, and if it wasn't obvious then the possibility that it works shouldn't be very critical.

Anonymous said...

Though I have submitted papers to conferences with double-blind reviewing I have never been on a PC for one.

One item, which has come up on almost every PC I have served on, is the question of closely-related papers that solve very similar problems or use some common idea. If there are several such papers by disjoint sets of authors, should one treat them differently from the same papers with the same authors or largely overlapping sets of authors? I believe so.

Do any conferences PC you have been involved in that use double-blind reviewing do hashes of author names or some other method for the PC to resolve such issues should the need arise?

Anonymous said...

We already have a system for rebuttals: Good conferences spread throughout the year.

You can write your rebuttal in our resubmission. Theory is not like many CS fields where there is only one major annual conference.

Lev said...

I'm not against double-blind reviewing, but here's a strange possible objection that I haven't seen brought up: I'm guessing some people would think twice before sending a marginal paper to a good conference, partly because it may damage their reputation to spam conferences with bad papers. When reviewing is anonymous, there's less incentive to refrain from sending bad papers on the hope they by chance get in. I wonder if this would increase the number of submissions and place more burden of conference PCs. The statistics are probably out there for conferences that switched to double blind reviewing in recent years...

Anonymous said...

I hadn't thought of Lev's objection, but I really like it. I don't know how much of an effect there would be in the real world, but certainly it demonstrates how difficult it is to predict the effects of a change.

How about adding more openness, rather than more secrecy? Right now there's not much accountability for the PC, since few outsiders know which papers were rejected.

Suppose all submissions were required to be made public at the time of submission, for example on the arXiv or ECCC, and suppose that the fact of submission were also public. Then anybody who cared could form their own opinion of the decisions, and the PC would know that any biases or mistakes would be visible to everyone.

Michael Mitzenmacher said...

Anon #11 (and Lev): I'm certainly willing to yield that the Law of Unintended Consequences could take effect. If someone were to try this, it would be nice to have some post-analysis afterward. (I tried to offer post-analysis for the stuff I tried as PC chair; but it's clear not everyone necessarily agreed with my post-analysis.)

However, I think bad authors already send in bad papers. Good authors ostensibly avoid sending in bad papers on the off chance they'll get in (with either system) because it would damage their reputation to have them accepted! I don't see people associating much cost to sending in what end up being judged "less-good" papers currently.

That being said, you both certainly have a valid and interesting point. Thanks!

Warren said...

As someone pointed out in comments in one of the blog posts about this (I forget who and where), there are two general classes of arguments against DBR:
1) The author list has little influence on acceptance currently so it's not worth the trouble
2) The author list has a significant influence currently, and that's a good thing.

I don't know how to reply to viewpoint #2. Regarding viewpoint #1, the only way to really determine if author names matter is to run a controlled experiment. Here's how I'd imagine such an experiment working.

1) Authors submit two versions of their submission PDF, one with author names and one without. They also submit some information that is hidden from the PC but used for experimental evaluation:
* the max, over all authors X, of the number of papers X published in STOC or FOCS in the past 10 years
* How many authors are from [list of 10 institutions that publish the most in STOC and FOCS]
* Whether an author has won a Godel prize or Turing award
* A guess as to the score the paper will receive
* Gender and ethnicity of the authors
2) Papers are randomly assigned to either an anonymous group or an eponymous group. The author-provided information can perhaps be used to cluster the population, reducing the noise inherent in random sampling.
3) The PC deliberates on the eponymous papers in the traditional manner using the eponymous PDFs and selects 50 eponymous papers for definite presentation and 15 alternates.
4) The PC deliberates on the anonymous papers using the anonymous PDFs and double-blind procedures. The PC selects 50 papers from this group for definite presentation and 15 alternates
5) The PC is told the authors of the borderline papers in the anonymous group of alternates. The PC then selects an additional 15 papers for presentation from among the alternates.


Notes:
1) The number of papers that "should be accepted" that are assigned to the anonymous group is a random variable with mean 50 and standard deviation around sqrt(50) = 7. The "alternates" are designed to correct for this, so if the null hypothesis that names are irrelevant is true then this experiment should not affect which papers are accepted.
2) To analyze the data, compare the fraction of papers with (for example) big-name authors that get accepted in the anonymous and other track.
3) With so many possible correlations and so little data, there's a risk of testing 20 hypotheses and getting 1 true by chance. Therefore we might need to run two experiments, one to develop specific hypotheses and a second to test them. Fortunately the overhead is relatively low (30% perhaps?), so while this isn't trivial it's not impossible either.
4) Is anyone reading this an expert in design and analysis of experiments of this sort? To my knowledge no controlled experiment has been done before that measures bias throughout a real PC process, not just the opinions of undergrads taking an intro to psychology class. One might therefore interest a social scientist in helping with this experiment.

Anonymous said...

To my knowledge no controlled experiment has been done before that measures bias throughout a real PC process, not just the opinions of undergrads taking an intro to psychology class.

It's almost impossible to do a proper controlled experiment for phenomena like this. One key problem is that the experimental half-anonymous process isn't really the same as either process done in isolation. For example, suppose no statistical difference shows up. That could be because single-blind reviewing is in fact unbiased, or it could be because participating in the new process sensitized PC members to the issues and changed their behavior even in the non-anonymous half. One can make up lots of stories of this sort, some admittedly less likely than others, going in either direction. The net effect is that it's just not worth the effort of running the experiment when the results couldn't be considered in any way definitive. (Plus there's the issue that even relatively small effects, involving only a handful of papers in a typical conference, could be important here but are difficult to detect in a small experiment.)

In general, experimental results can be seriously biased if anybody involved even knows they are participating in an experiment. For example, this comes up often in education reform: it appears that merely trying something new and experimental (regardless of what it is) is exciting and motivational and causes gains that don't necessarily survive beyond the experimental stage.

The best experiment I've thought of regarding academic evaluation could be carried out beautifully by the NSF. NSF review panels are somewhat secretive (word of who is on them may leak out but doesn't become widely known) and rank lots of grant proposals without consulting anyone off the panel. They could easily invite two panels to review exactly the same proposals, and probably nobody would find out until after everything was over. It would be fascinating to see how different the two rankings turned out to be. It would also be interesting to see what sorts of factors (gender, prestige, seniority) were correlated with disagreement between the panels.

Anonymous said...

But NSF panel judgements are based on info about the applicant like CV, are they not?

I think judging NSF panels is very different from judging papers. Why can't we just run DBR for a major conference in 2010 and see what happens?

Warren said...

Dear Anonymous March 3, 2009 8:59 PM:
Of course my experimental design is imperfect. The question to ask is not whether this experiment would be "definitive", but whether it would improve our knowledge of bias in TCS reviewing more than any other study of the same cost. Can you supplement your criticisms of my proposal with a proposal of a better way to get data? I can think of several alternative ways to study bias in TCS reviewing, but they all have even worse issues than the one you mentioned IMHO. For example:

1) There's plenty of anecdotal evidence of bias or lack thereof, but anecdotes are an especially poor methodology for studying unconscious bias! Without an alternate experimental proposal you are in effect arguing that anecdotal evidence is the best we can do.

2) One could try to deduce something from a comparison of math journals with and without DBR, replicating a previous study of that sort in Ecology. The flaws with such an approach include the confounding factors inherent in a non-controlled study and the possibility that math and TCS may have different levels of bias.

3) One could run an entire TCS conference double-blind and then compare to a comparable single-blind conference, but:
* people would know that double-blind was an experiment so the results would still be distorted by the effects you mention
* The non-random assignment of papers to the two conferences would make analysis trickier.

Anonymous said...

But NSF panel judgements are based on info about the applicant like CV, are they not?

I think judging NSF panels is very different from judging papers.


Definitely, that experiment is to answer a very different question (and has nothing to do with double-blind reviewing). It's just a tangent.

Why can't we just run DBR for a major conference in 2010 and see what happens?

What do you expect will happen? Some people will love it and will get terribly excited. Some will hate it and will complain bitterly about it. Very few people will decide that the experience has changed their opinion. There will be huge debates about what the effects were, if any, since in any given conference they are sure to be small enough to be debatable. In the end, we'll have to make the same decision with very little useful information gained, while both sides triumphantly claim that all their beliefs have been validated.

This is like a lot of government policy decisions (should we lower taxes, etc.). There are enough variables that it is hard to convince people by trying an experiment, while lots of people will view any experiment as a precedent.

Anonymous said...

Make FOCS have DBR and STOC not. Run for five years. See what happens.

Anonymous said...

Without an alternate experimental proposal you are in effect arguing that anecdotal evidence is the best we can do.

I'm arguing that I can't think of any practical experiment in this area that I consider likely to change many people's minds or to be worth the trouble of trying.

One could try to deduce something from a comparison of math journals with and without DBR, replicating a previous study of that sort in Ecology.

I don't know of any math journals that currently use double-blind reviewing. I'm told that some have experimented with it in the past but have given it up. I couldn't find any good web references, beyond vague references like this one:

http://sci.tech-archive.net/Archive/sci.math/2007-01/msg05349.html

Hearsay like this suggests the experiments in mathematics journal were considered failures, but I don't know anything about them. Does anybody know more?

Anonymous said...

Make FOCS have DBR and STOC not. Run for five years. See what happens.

What could we learn? Suppose FOCS ends up with measurably more papers from marginalized groups. Maybe those people were just sending their better papers to FOCS. Even if overall submission rates don't change, the better or worse papers may be getting shuffled between the two conferences, and there's no way to tell how much of a role the actual blinding played. (This is what Warren was addressing by randomizing things within a single conference.)

In fact, I suspect that this experiment would end with accusations that people were deliberately biasing the results in this way, by sending their best papers to their favorite conference, not because they really expected the review process would make a big difference for them, but rather to cast a vote in favor of the system they liked. And it wouldn't even be a fair vote, since each side could accuse the other of "voting" more vigorously.

Anonymous said...

"I'm arguing that I can't think of any practical experiment in this area that I consider likely to change many people's minds or to be worth the trouble of trying."

How do you know that Warren's suggestion wouldn't change someone's mind? Why do you use this dismissive tone as if nothing can be done?

The fact is that alot of people think that something can be done to make the system fairer or convince people that it is. Whether or not it is "trouble" to run an experiment is for everyone to decide. We should take some sort of organized vote (like when people vote for SIGACT officers).

Anonymous said...

Wow, no wonder you people are theoretical computer scientists. If you ran a real computational experiment and it didn't give you the answer right away, you'd just argue that no experiment could ever give you the answer.

I really think if we ran Warren's experiment, it's very likely we could discover something unexpected, perhaps good, perhaps bad, but definitely useful.

Warren said...

Something similar to giving two NSF panels the same proposals has apparently been done before. See "Chance and consensus in peer review" by
S Cole, Cole JR, and GA Simon in Science, Vol 214, Issue 4523, 881-886

abstract and full text

Anonymous said...

Something similar to giving two NSF panels the same proposals has apparently been done before. See "Chance and consensus in peer review" by S Cole, Cole JR, and GA Simon in Science, Vol 214, Issue 4523, 881-886

Great, that's really interesting and somewhat depressing (about 25% of decisions would have been different given different reviewers). The paper actually predates the panel system and involved mail review instead. I wonder whether panel review increases or decreases the instability.

If you ran a real computational experiment and it didn't give you the answer right away, you'd just argue that no experiment could ever give you the answer.

This is different - it's more of a social/emotional issue than a statistical issue. People have been publishing papers comparing double-blind and single-blind reviewing for decades, but the process hasn't converged to a consensus and I don't expect it ever will. The evidence is always problematic, the people involved always have an agenda, and nobody can give any principled estimate for just how bad the evidence is because it depends on too many unknowables. Lots of people have their beliefs reinforced by the statistics but very few people change their minds. Without some powerful new idea (which I don't think exists), we won't be able to settle the issue by real-world statistical evidence and we'll just have to fall back on judgement and gut feelings. Warren's approach is much better than the analyses I've read in the literature, but I still don't think it will be anywhere near decisive.

The fact is that alot of people think that something can be done to make the system fairer or convince people that it is. Whether or not it is "trouble" to run an experiment is for everyone to decide. We should take some sort of organized vote (like when people vote for SIGACT officers).

Community-wide direct democracy is tricky (who gets to decide what to hold a vote on?). One possibility is to put pressure on the organizations that run your favorite conferences. If you want double-blind reviewing, I don't think FOCS/STOC is the right place to start. I'd pick a somewhat less prestigious conference that may view a superior reviewing process as a competitive advantage vs. the top conferences. This may arouse less resistance than starting at the top, too. At the very least, this approach will give a chance to gather evidence and get people more comfortable with the whole process; you can then use that experience as an argument for extending it to more conferences, if it works out well. (I don't think the evidence will prove compelling, but the increased comfort level might.) In the best case scenario, the reviewing process will work markedly better and the conference will start to attract some really excellent papers, which will certainly make FOCS/STOC take notice and change their ways.

There's also the approach of starting a new conference with very high standards and double-blind reviewing. That's hard to pull off well, but it would be doing the community a real service. Double-blind reviewing might be a good hook to get people involved, since it seems to create a lot of interest. I'm actually serious about this - I don't want double-blind reviewing, but I do want more first-rate conferences (FOCS/STOC dominance is a bad thing) and psychologically it's easier to set the bar high for a new conference than to raise it for an old one. Double-blind reviewing may be what it takes to raise enough interest to get a new conference off the ground.

Warren said...

If you want double-blind reviewing, I don't think FOCS/STOC is the right place to start.

I disagree because:
1) It is quite plausible that there might be a big name effect in top conferences but not in secondary ones (or vice versa). We need to run the experiment that answers the question we really care about, which is bias in FOCS/STOC.
2) FOCS and STOC are quite broad, forcing lots of tricky apples to oranges comparisons by a PC that is not completely expert, which is a great opportunity for bias. More narrow conferences have a more focused PC compare similar papers, making objective decisions easier.

There's also the approach of starting a new conference with very high standards and double-blind reviewing.
In algorithms at least there are three major conferences spaced pretty evenly throughout the year (FOCS/STOC/SODA). I doubt there's room in the year for another conference of comparable prestige. When would you put its submission deadline? Perhaps one could move the STOC deadline to October and then have a new conference deadline in January.