Tuesday, January 09, 2018

Double-blind, ALENEX

I wanted to point people to a pair of blog posts (Part 1, Part 2) by Suresh Venkatasubramanian at the geomblog discussing the experience of having the ALENEX conference go double-blind for reviewing this year, from the perspective of him and his co-chair Rasmus Pagh.  The high order bits are that they thought it worked fine despite the additional work, and they recommend doing it again in future years.  

I am in favor of double-blind reviewing, which is standard for many other conferences in many other areas, but is somehow considered impossible for CS theory.  (Suresh does an excellent job addressing why that has been historically, and why the arguments against double-blind reviewing do not really hold up.)  The theory community hasn't historically taken "conflicts of interest" issues seriously, as I've written about before (I guess a rather long time ago!).  Double-blind reviewing helps with CoI issues, but as Suresh discusses, also deals with the huge implicit bias issues that I think are also a problem in the theory community. 

(Except that, really, calling it implicit bias is something of a misnomer -- in many cases it's really explicit bias.  As Suresh notes, in providing responses to common arguments:
But how am I supposed to know if the proof is correct if I don't know who the authors are. 
Most theory conferences are now comfortable with asking for full proofs. And if the authors don't provide full proofs, and I need to know the authors to determine if the result is believable, isn't that the very definition of bias?   
This is a real argument commonly presented at PC meetings.  I've had people say, "I haven't verified the proofs, but XXX is one of the authors, so I'm sure it's all right.")

I encourage people to go read Suresh's discussion, and I would like to thank Suresh and Rasmus for taking it upon themselves to perform this experiment. 

(If you're interested, you can also read the paper by Tomkins et al discussing double-blind reviewing experiences recently at WSDM as well.) 

14 comments:

James R Lee said...


About the statistical confidence in proof correctness based on the authors' identity, Suresh's argument is: "And if the authors don't provide full proofs, and I need to know the authors to determine if the result is believable, isn't that the very definition of bias?"

Sure, that's the definition of "bias." But it's not "bias" that's generally considered negative or unfair (lest you start disliking anti-bias bias, say), it's bias for or against things that shouldn't matter.

One might argue that this data is not used responsibly, but it's hard for me to understand--on the face of it--how the expertise of the author is not relevant in making a statistical judgement about correctness.

Here is a thought experiment. We all have only limited time. Most of us can't read, for instance, every proof that claims to resolve the P vs. NP question. I don't read those papers at all, but if Razborov posted such a preprint to the arxiv, I would start reading it that night. Am I making a sound decision in basing allocation of my attention on credibility in a field? Isn't the role of conferences essentially "attention allocation"?

Michael Mitzenmacher said...

James --

When you say that that's not bias that's considered negative or unfair, I think you'll find a good number of people that disagree with you. Particularly the ones that are being biased against.

I would say that blindly trusting the expertise of the author is not a good idea for our long-term science. You try to dress it up by saying just that it is relevant in making a "statistical judgement", but in reality, I don't believe that's how it works. I've never heard anyone say at a PC meeting, "My prior odds for believing this result are only 1:1, but because XXX wrote it, I'm updating my odds to 3:1, and that's really good enough, isn't it?"

You have gone on to posit an extreme case, regarding P vs NP proofs. You'll notice that such proofs are generally NOT handled by conferences, as a matter of course, precisely because they require such attention. They are more properly -- and fairly -- handled outside the conference mechanism. So I think your example doesn't shed light on this situation.

In short, your argument is pretty much the standard sort of argument I've heard in response to promoting double-blind reviewing, which I find deeply uncompelling.


Suresh Venkatasubramanian said...

James, just to add to Michael's response, consider a scenario where it's not as "clear cut" as P vs NP. The fundamental problem with reasoning based on author identity is not that it provides no signal. It's that it isn't applied consistently. In other words, imagine we had a numerical confidence score for each author, a rule that says that the confidence of a paper is some function of the confidence scores of the authors, and a well defined rule to update our prior confidence in the paper based on these scores. And that each and every one of us agreed on this scoring and also agreed on the same update model for our beliefs. And had the same beliefs.

Then I might be willing to entertain the idea even briefly that author identity might be of some value.

But of course such a scenario doesn't exist in reality. We have different ways in which we might update priors based on author identity and that lack of consistency is a real problem, because that's precisely where subjective implicit biases appear to kick in.

Anonymous said...

The argument about correctness has nothing to do with "xxx is an author so surely it is correct." The point is the following: for conferences, correctness is the responsibility of authors rather than the PC. This works because it is very embarrassing to publish faulty proofs - you lose reputation. But this only works if you have a reputation within the community to lose. For everybody else, the PC should work harder to verify the proof. In addition, even within the community, some authors have an history of bugs. It make perfect sense to hold it against them. There are many more arguments against double-blind reviewing, and I have stories to back them up (from my experience reviewing in crypto). Nevertheless, there is an argument in the other direction as well.

Anonymous said...

To add to this discussion, I think that focusing on the example of determining correctness obscures a more important issue: author identity may affect reviewer's general credulousness when evaluating author assertions regarding issues like:

1) novelty of ideas and techniques
2) generally arguing for the "importance" of whatever direction the paper is pursuing.
3) reviewer optimism regarding future applications or progress that the new ideas may enable
4) making sure the authors haven't missed some prior work that solves or nearly solves the problem, or missed a much simpler/obvious solution

I can see the argument for letting author identity affect issue 4), since an expert on a topic may be less likely to miss prior work or simple solutions. But it seems much more dangerous to let author identity affect the other issues: these aspects are inherently somewhat subjective, and in the absence of an error, these are precisely the issues that papers get evaluated on.

In short, if well-known researchers' claims of novelty and importance are treated with more credulousness than others, then it's easier for them to sell their papers. And that's a huge part of the game. So I think that a major argument for blinding is to help keep the playing field level in this regard.

James R Lee said...


Michael:

I realize I was unclear: My comment about "bias" is only that "bias" is generally used as shorthand for certain negative types of bias. "Bias" itself cannot be negative (hence my joke about "anti-bias bias"), and thus instances of bias need to be argued on their merits. I did not see the argument. I still don't see you making one, and "I've heard all this before" is not a great way of dismissing a position.

Positing an extreme case is what one does in an existence proof. I exhibited the existence of a situation in which such bias is productive. The point of the extremity is so that everyone can agree. From there, one can argue about the relative merits, and whether the value of such bias outweighs its potential harm. But I find it disingenuous to act as if this bias holds no value whatsoever for the stated mission of the program committee.

Saying that people don't make statistical judgements just because they don't assign explicit probabilities doesn't seem right to me. The conference deliberation process has never been intended to confirm correctness, but reviewers are asked to give their "confidence" about the likelihood that a paper is correct. I would only argue that such confidence is legitimately increased by the author being an expert in the underlying area.

I would personally never advocate for accepting a paper at a theory conference without it being accompanied by full proofs, but that's because I don't think scientific progress needs to be so fast that people can't write their arguments down carefully. Even given full proofs, it is infeasible (and has never been a goal) for the PC to verify them, and so at the end of the day, one is trading off many factors, one of them being the likelihood that the underlying argument is correct.

Suresh:

I have no problem with the supposition that the utility of author identity is outweighed by its allowance for harmful bias. I just think that one should acknowledge the useful aspects openly (as Omer expressed much more succinctly than me).

Piotr said...

> The theory community hasn't historically taken "conflicts of interest" issues seriously, as I've written about before (I guess a rather long time ago!).

That was certainly true historically, but things have changed considerably. In particular, SODA had some forms of conflict of interest management over the last few years. The details varied, mostly due to the idiosyncrasies of the TCS program committee format, but the process seems to be converging. The same (I believe) holds for other major TCS conferences.

Michael Mitzenmacher said...

1) Piotr: Thank you, Piotr, for the reminder that there have in fact been some beneficial changes.
2) James: I was responding to your argument, rather than making one about bias, as my post was just a pointer. What I pointed to contains relevant arguments, and I believe the bias arguments have been well laid out in the past.. In particular, I might point out:

From Suresh's blog:

"...there is now a large body of evidence suggesting that:

All people are susceptible to implicit biases, whether it be regarding institutional status, individual status, or demographic stereotyping. And what's worse that we are incredibly bad at assessing or detecting our own biases. At this point, a claim that a community is not susceptible to bias is the one that needs evidence."

To be clear, I think Suresh is talking about negative biases here (as in the Tomkins quote below), which would include giving the same paper a higher score based on knowledge if it was written by a "famous" author, and potentially biases against underrepresented groups including women. This would also include potential for biases that might be seen as forms of conflict of interest.

From the paper by Tomkins et al:

"Our second point with respect to reviewing is that, whatever
the process that resulted in the reviewers being assigned the paper,
the single-blind reviewers with knowledge of the authors and
affiliations are much more positive regarding papers from famous
authors and top institutions. Again the implications are not cut
and dried, but it is reasonable to raise the concern that authors
who are not famous and not from a top institution may see lower
likelihood for acceptance of exactly the same work ."

They also discuss the (negative) effects (of single-blind reviewing) on female authors.

Suresh Venkatasubramanian said...

Omer, I'd be interested in hearing your critique of double blind review as well as your experiences from crypto. It's definitely good to learn from other experiences, especially since there aren't too many theory-adjacent communities that use double blind.

Sariel in a different thread points to the discussion that ACL had. That's an interesting case because they started off with double blind review, realized the the arxiv undermines this, and now have to decide how to deal with this - which they did by blocking arxiv submissions - a controversial position.

In a sense, because we've waited long enough to deal with double blind review, we're being forced to deal with the arxiv subversion of blinding and double-blind review all at the same time.

James, I fully agree that one should not undersell the potential useful signal from author identity. It is definitely about tradeoffs, and making them explicit perhaps makes the discussion and pressure points clearer.

Michael Mitzenmacher said...

Since Suresh just sort of agreed with James about tradeoffs, let me clarify my disagreements with James. (Again, I think these arguments are standard, but I'll relate them.) Some arguments relate to how we weight certain problems, and are perhaps more subjective. Some are less so.

My understanding is James is perfectly happy in a world where a paper that is some delta better by unknown author X is regularly rejected in favor of a less good paper by known, famous, trusted person Y, because he believes that's necessary for the system. Otherwise, the tradeoff is we'll regularly accept too many buggy or problematic papers by unknown authors, and that's a worse problem. I disagree, and I am unhappy with such a system.

First, I don't think this is really a worse problem, in terms of frequency, or in terms of how it is or can be dealt with by the community after the fact (once bugs are found, have a mechanism for dealing with them). But I don't think either of us have anything more than anecdotal evidence to back the questions of frequency and importance up, so let's chalk that up to subjective judgment. (Of course, it's my blog, so my bias is we should assume that my subjective judgment is correct.)

Second, I think it's clear how to fix this problem in terms of double-blinding. As a community, we need to decide that it's important to spend more time/effort reviewing to avoid the buggy papers. Meanwhile, it's not clear how to solve what I see the bias problem in the other direction without double-blinding. We can't decide as a community to reduce our implicit biases that we don't understand (in particular if some people explicitly think those biases are a perfectly reasonable thing).

Third, I think James and others are ignoring the meta-issues that such biases close off the community, which is not healthy in the long term. Do you think young, bright students want to work in an area where the established incumbents have an inherent advantage in conference acceptances, or that you have to have the right advisor to have a better shot of getting your papers in? This is a variant of the "what works are we never going to see" argument, where we don't realize what we're missing out on because we don't recognize the long-term effects of bias on the community.

Fourth, going back to the back and forth James and I had on statistical thinking, I think he ignores the tendency people have to overweight these biases in decision-making, either implicitly or explicitly. We are not statistical machines, we are individuals with flawed judgments that overreact to past information. (See the works of Kahenman or any behavioral economist author you like.) I think people just naturally overweight the "I'm comfortable with Y, I don't know X" aspect in their judgments, in ways they don't recognize, and I believe here there's scientific evidence (at a general level) to back that up.

I'm sure there are other arguments that I'm forgetting.

Michael Mitzenmacher said...

Whoops, I can't seem to edit that last comment, but I realize that part of James's argument is possibly we need to know author names because there are not only unknown authors but known authors X whose work, for whatever reason, we trust less than even an unknown author. I don't think this substantially changes my points, except perhaps the first one, where again it's a perhaps subjective and at least non well quantified question of how often that information is important.

Boaz Barak said...

Wrote my own post about this https://windowsontheory.org/2018/01/11/on-double-blind-reviews-in-theory-conferences/

Given my experiences in STOC/FOCS and CRYPTO, I don't think trusting proofs is the reason we can't use anonymous submissions, but there are several other problems with that model. In particular, CRYPTO moved to double-blind submissions before the age of arxiv and eprint, but I don't think that model really makes sense today.

--Boaz Barak

Anonymous said...

> imagine we had a numerical confidence score for each author, a rule that says that the confidence of a paper is some function of the confidence scores of the authors, and a well defined rule to update our prior confidence in the paper based on these scores. And that each and every one of us agreed on this scoring and also agreed on the same update model for our beliefs. And had the same beliefs.

> But of course such a scenario doesn't exist in reality. We have different ways in which we might update priors based on author identity and that lack of consistency is a real problem, because that's precisely where subjective implicit biases appear to kick in.

@Suresh, I'm in favor of double-blind (he said, anonymously), but this seems like funny argument to give as the ultimate justification of it.

After all, there are lots of kinds of scenarios where multiple agents collaborate on a joint task, more or less successfully, despite their not sharing identical, formally specified understandings / decision-making methods: think of wisdom-of-crowds stuff like counting jelly beans, or even just of informal language (we don't share formally identical definitions of all the words in our heads, but we often can communicate reasonably well).

Indeed, think of reviewing more generally: nobody would insist that all reviewers share "a rule that says that the confidence of a paper is some function of [some fixed list of paper attributes], and a well defined rule to update our prior confidence in the paper based on these scores." Nobody cites the nonexistence of such a universal scoring function "that each and every one of us agreed on" as a knock-down argument against the idea of collaborative peer review itself.

Instead, isn't the argument against (inappropriate) bias going to have to rest on concerns like A) the concrete injustices done to the individuals involved in false negatives and (less concretely) B) arguments about the net harm done to the system as a whole?

Piotr said...

From my perspective, the issue is rather simple: a conference should move to the DB system if the majority of the conference community is willing to put up with (minor, but non-negligible) inconveniences resulting from the change. I am referring to (a) specifying conflicts with (sub)-reviewers before the submission (can take about 10-15 minutes for large conferences) and (b) anonymizing the paper (again, a few minutes). Essentially, If most of the members of the conference community believe that the existing process unfairly favors well-known authors and other groups (and there is some evidence to back this up, even if it was obtained for different conferences), then continuing the non-DB process is unsustainable. Ultimately, people will end up submitting their papers to venues where they expect a fair reviewing process, and there are many conferences to choose from these days.

That said, it is good to remember that DB reviewing is not a silver bullet, and there are limitations (notably arxiv) and side effects, as listed in other comments. Here are a few more points to keep in mind:

- regarding the issue of "untrusted authors": note that, in the DB system, this reasoning can still be applied, albeit only at the PC chair level. Which might be a good thing, for consistency reasons.

Also, from my experience, there are many tale tell signs of an incorrect paper: exaggerated but vague claims, lack of complete proofs, etc. In fact, I am not sure that we are really losing that much information by anonymizing papers.

- it is also good to point out that the DB process puts an onus on the PC chair to ensure that any conflict of interest rules are followed. In the current system, any PC member can point out that a reviewer is obviously conflicted with the author(s) or that a group of papers is being reviewed by a "mutual admiration circle". In the DB system only PC chairs can identify such cases, which means either more work for the chair, or (if the chair drops the ball), no enforcement.

(This could also be viewed as an argument for having co-chairs, to share the workload).