Monday, June 03, 2013

NSF Reviewing Trial Run

Noam Nisan points to the NSF trying out some new rules for reviewing in its upcoming SSS program. 

There's a lot here to discuss.  First, I'm glad to see the NSF is willing to try out some new reviewing approaches.  They've been using the same approach for a long time now (1 or 2 day in person meetings, a reviewer panel drawn according to who is available and willing);  I really haven't seen any discussion from the NSF as to why it's a good review system, and it's typically got some major cons (as well as, admittedly, some pros).  But as far as I know -- and perhaps some people are more knowledgeable than I am on the topic -- it's not clear at all to me why it's become the stable equilibrium point as a reviewing method.

That being said, there's some clear pros and cons to this experiment.  Some features + initial off-the-cuff commentary.

1.  No panel review.  Proposals will be split into groups of 25-40, and PIs in the group will have to review other proposals (they say 7 here) in that group.  [If there are multiple PIs on a proposal, one has to be the sacrificial lamb and take on the role of reviewer for the team.]   

I kind of like the idea that people submitting proposals have to review.  One of the big problems in the conference/journal system is that there's minimal "incentive" to review.  Good citizens pay back into the system.  Bad citizens don't.  This method handles the problem in a natural way -- you submit, you review.  There are many potential problems with this method to be sure (as we'll see in the proposed implementation below).

2.  A composite ranking will be determined, and then the "quality" of the reviews of the PIs will be judged against this composite;  then the PIs ranking may be adjusted according to the quality of their reviews.

Ugh.  Hunh?  I get the motivation here.  You've now forced people into doing reviews, who may not want to.  So you need an incentive to get them to do the reviews, and do them well.  One incentive is that if you're late in your reviews, your own proposal will be disqualified.  That seems fine to me.  But this seems --- off.  I should note, they have a whole subsection in the document labelled
Theoretical Basis:

The theoretical basis for the proposed review process lies in an area of mathematics referred to as mechanism design or, alternatively, reverse game theory.  In mathematics, a game is defined as any interaction among two or more people.  The purpose of mechanism design is to enable one to “design” the “mechanism,” namely the game, to obtain the desired result, in this case to efficiently obtain high-quality proposal review while providing the advantages noted above.  In mechanism design, this is done by formulating a set of incentives that drive behavior in the desired direction.  The mechanism presented here was devised by Michael Merrifield and Donald Saari [1].
I suppose I now have to go read the Merrifeld and Saari paper to see if they can convince me this a good idea.  But before reading that, there are multiple things I don't like about this.

a)   Why is "reviewer quality" now going to be part of how we make decisions about what gets funded?  I'm not sure to what extent, if any, I want "reviewer quality" determining who gets money to do research.  Here's what the document says:
To promote diligence and honesty in the ranking process, PIs are given a bonus for doing a good job.  The bonus consists of moving their proposals up in the ranking in accordance with the accuracy with which their ranking agrees with the global ranking.  This movement will be sufficient to provide a strong incentive to reviewers to do a good job, but not large enough to severely distort the ranking merely as a result of the review process.  Recognizing that, if all reviewers do an excellent job of ranking the proposals they review, all PIs’ proposals will be moved up equally, which means that the ranking will not be changed, the maximum incentive bonus will be a movement of two positions, that is, a proposal could be moved up in the ranking to a position above the next two higher proposals.
With funding ratios at about 15% (I don't know what the latest is, but that seems in the ballpark), two places could be a big deal in the rankings.  

b)   Why is there the assumption that the group ranking is the "right" score -- particularly with such small samples?  I should note I've been on NSF panels where I felt I knew much better than the other people in the room what were the best proposals.  (Others can judge their confidence in whether I was likely to have been right or not.)  One of the pluses of face-to-face meetings is that a lone dissenter has a chance to convince other reviewers that they were, well, initially wrong (and this happens non-trivially often).  I'm not sure why review quality is judged by "matching the global ranking".

c)   Indeed, this seems to me to create all sorts of game theoretic problems;  my goal in reviewing does not seem to be to present my actual opinion of a paper, but to present my belief about how other reviewers will opine about the paper.  My experience suggests that this does not lead to the best reviews.  The NSF document says:

Each PI will then review the assigned subset of m proposals, providing a detailed written review and score (Poor-to-Excellent) for each, and rank order the proposals in his/her subset, placing the proposals in the order which he/she thinks the group as a whole will rank them, not in the order of his/her personal preference.
But then it says:
Each individual PI’s rankings will be compared to the global ranking, and the PI’s ranking will be adjusted in accordance with the degree to which his/her ranking matches the global ranking.  This adjustment provides an incentive to each PI to make an honest and thorough assessment of the proposals to which they are assigned as failure to do so results in the PI placing himself/herself at a disadvantage compared to others in the group.
So I'm saying I'm not clear myself how their incentive system -- based on the global ranking --- gives an incentive to make an honest and thorough assessment.  Even the document itself seems to contradict itself here.

d)  This methodology seems ripe for abuse via collusion -- which is of course against the rules:
The PIs are not permitted to communicate with each other regarding this process or a proposal’s content, and they are not informed of who is reviewing their proposals.
But offhand I see plenty of opportunities for gaming the system....

e)  This scheme is complicated.  You have to read the document to get all the details.  If it takes what seems to be a couple of pages to explain the rules of the assignment and scoring system, maybe the system is too complicated for its own good.

That came out pretty negative.  Again, I like the idea of experimenting with the review process.  I like the idea that submitters review.  I understand the concept that we somehow want to incentivize good reviews, and that's very difficult to incentivize.

This actual implementation... well, I'd love to hear other people argue why it's a good one.  And I'd certainly like to hear what people think of it after it's all done.  But it looks like the wrong way to go to me.  Maybe in the morning, with some time to think about it, and with some comments from people, it will look better to me.  Or maybe, after others' comments, it will seem even worse.  


JeffE said...

Oh, God. Really? _Really_? So researchers are going to be penalized for having more experience or specialized knowledge about well-written but flawed proposals?

Why does this remind me of the mechanism for SAT essay graders, which rewards "agreement" over sound judgement?

Suresh Venkatasubramanian said...

I read the linked paper. Two things concern me. Firstly, they're proposing the use of a modified Borda count to find the global ranking. It's one of the standard methods, but is not in my understanding the "best" method in any case. The wikipedia article has a helpful list of ways to beat the method:

Secondly, the process by which they encourage good reviewers is very hacky, and not based (or claimed to be based) on any sound method for revealing honest preferences. In the original paper it's perfectly fine - after all, that paper is just a proposed idea for allocating telescope time. But to use the method essentially unchanged for decisions on proposals seems rather odd. The only hope is that the differences "wash" out so that everyone is sufficiently different from the global consensus.

Finally, the main impetus for this idea comes from the telescope reviewing process in which apparently you have to review over a hundred proposals in a matter of weeks. Needless to say, anything you can do to reduce reviewer load is welcome, and the whole point of the paper is to achieve that. But NSF proposal review loads are nowhere close: I'll review about 9-10 proposals in a typical panel, and I have two months to do so !

The main problem with NSF proposals is that if you submit, you can't review, and if you submit every year, then the only people reviewing are ones who don't submit OR aren't in the area. In theory this isn't a huge problem because people tend not to submit every year if they get funding. But in other areas the panels can look fairly crazy if your entire peer group can't review relevant proposals.

So in that sense, this idea has some merit, because it allows for PIs to review as well. But I think the claims that this is based on sound mechanism design are overblown.

Anonymous said...

"[M]y goal in reviewing does not seem to be to present my actual opinion of a paper, but to present my belief about how other reviewers will opine about the paper. My experience suggests that this does lead to the best reviews." Do you perhaps mean "does not lead"?

Incidentally, this mechanism is known as Keynesian beauty contest.

Michael Mitzenmacher said...

Oops -- Thanks Anon 3. Fixed.

Unknown said...

The most interesting proposals are those that dive into contested areas, where there is disagreement on how to proceed. In my NSF panel experience, such proposals receive a diverse response from the reviewers, and the job of the panel is to sort out the arguments pro and con. The incentives here are for this not to happen. One can imagine that among the competent proposals all the reviews will look the same, and it will all come down to the program officers' preferences.

Magda said...

I agree with your point 2b. Discussion - and sometimes very heated discussion - in NSF panels usually leads people to change their opinions and scores. Not saying that's necessarily good, either: often, the paper that generates the least amount of controversy ends up at the top. On the other hand, there's also the 'championing' phenomenon where a panelist clearly knows the field better than others and manages to sway them to his/her pov.

I wonder what would happen if we tried that for conference reviewing?

Suresh Venkatasubramanian said...

Another issue is the role of reputation. If I'm trying to guess what other people think will rise to the top, I might very well weigh reputation heavily, since that's the component that everyone typically agrees on (as opposed to specifics of the proposal). This could easily lead to a situation where "the rich get richer".

Magda said...

Suresh: Isn't this already the case?

Suresh Venkatasubramanian said...

It's not a binary thing. There is definitely a tendency, but I'm arguing that the tendency will get accentuated.

D. Eppstein said...

Re your point "e) This scheme is complicated." I've encountered situations where people were deliberately pushing complicated voting schemes on the grounds that more complication makes things harder to game. But this attitude reminds me of the old lessons from Knuth on how to design random number generators: making it clean and easy to analyze works much better than piling on complications without understanding what they do.

Michael Mitzenmacher said...

To all: Nobody seems to have commented on one of the important high level issues: should reviewing quality (even if done "correctly" -- that is, not by what Anonymous points out is called the Keynesian beauty contest) matter at all in what gets funded? Is that a system we want or don't want? (If I review well, can I increase my chances of my next paper getting into STOC/SIGCOMM/SOSP/SODA...?)

Jeff -- agreed with you (and your comments on Google+). I also have issues with Borda count which I think you've pointed out; namely, it seems like it punishes controversial or high-risk proposals. (I thought NSF had been promoting high-risk research lately?) (Similarly, see Larry Blume's comment -- the incentives just seem off.)

Suresh -- Can I admit to having concerns that a proposal for telescope reviewing is applied to NSF reviewing as though the two were equivalent?

Magda -- "Championing" of papers is something I regularly see in program committees. I know of at least one committee that explicitly said that every PC members' top choice should be accepted.

Suresh/Magda : As your discussion says, it's not like the current system is perfect; things like reputation can have an (undue) effect (some would argue that reputation having an effect is a positive, not a negative!), and the system is open to manipulation. As Suresh suggests, the question is whether this is step in the positive or negative direction. That's why I'm interested to hear what people think afterwards...

David -- Agreed, complexity doesn't do away with manipulation; in fact, it can make it easier to hide. Anyone have other thoughts on how complex the given system is?

Suresh Venkatasubramanian said...

I guess ultimately I'm confused about what problem is being solved here. Is it the undue load on reviewers ? that can't be right, since the proposal reviewing load is usually in the 9-10 range. Is it the lack of quality in proposal reviews ? I've always had very high quality discussions in my meetings: sometimes the summaries don't completely capture the sense of the room, but there's a lot of thoughtful discussion.

I'd like to understand what exactly the problem is: then we can sic our expert mechanism design folks on it to design a solution :).

Unknown said...

Is there anything in the rules that prevents a hostile takeover of an entire area's funding? it seems that submitting a proposal, no matter how tangential to the actual discipline as previously defined, makes you a member of the panel evaluating proposals. There is a limit on the number of proposals, but if we say organize an effort to, e.g., have theory people all apply to a small program in systems, claiming to be doing relevant theoretical work, and we all give each other good reviews, not only do we get the immediate points for our good reviews of each other's proposals, but if we form a majority of proposals, we get bonuses for our rankings agreeing with the global rankings. (And they get penalties for disagreeing).

I think this is a real danger, not just hypothetical. We always get some proposals claiming to do theory from people who are not actually doing ToC-- thinly disguised proposals from mathematicians (without any real applications) or from engineers (without any theorems). Many of these people submit in a scattershot way, to as many programs as they can think of in as many ways as the rules allow. Right now, the low probability of success is a disincentive, but under the new system, tangential work could dominate ``real theory''.

Parinaz Naghizadeh said...

We have been looking at this new NSF review pilot in more detail. It turns out that the idea of rewarding applicants based on their review quality can prevent some dishonest behavior, but at the same time many of the concerns expressed here are indeed valid.

We’ve also come across other interesting observations. For example, even if all reviewers are honest and accurate, they won’t be equally rewarded based on their reviews. For example, let’s assume 6 of the reviewers of a lower quality proposal (say, the one that intrinsically deserves to be ranked 10th out of 25) have this proposal at (or close to) the top of their pile and give it a (almost) perfect score. Then the one reviewer who gets a more evenly distributed pile of proposals, in which this lower quality proposal falls somewhere in the middle, ends up losing his/her bonus points, because it will look as if he/she failed to do a good job reviewing.

Also, even if all reviewers honestly express their opinions about a high-risk, high-yield proposal, this review process itself will put such high-quality proposals at a disadvantage. More interestingly, a low quality controversial proposal ends up having better chances of being funded in this new process.

Here is a link to our work. I’d appreciate hearing your thoughts/comments…