A lot of attention has been focused on the NIPS experiment on its own reviewing system, Eric Price’s blog post on it, and unfortunately my flippant tweet pointing out some inaccuracies in the blog post. So I’ll try to clarify what I meant here.
Edit: folks looking for more info on the NIPS experiment should check out Neil Lawrence’s initial blog post on it and stay tuned, as he and Corinna Cortes will be eventually releasing a more thorough writeup on the results.
Eric Price’s Post
In the spirit of Eric’s post, here’s a TL;DR: the minor inaccuracies in Eric’s post should not detract from the main message that the results from the NIPS experiment are concerning, but we should be careful to get the details right before trying to explain the scary results with faulty information.
The three details that popped out at me as inaccurate were the idea that the program committee was split into two independent committees, that the committee only discussed papers with an average score between 6.0 and 6.5, and that area chairs could not see if a paper was a duplicate. In Eric’s defense, a lot of these details and more were clarified in the comments and discussion below his post, so what I’m writing here is somewhat redundant (e.g., he points out in the comments that NIPS does not have a fixed acceptance rate).
On the first point. I’m not completely sure how the program chairs implemented the duplication. But I don’t think the concept of splitting the program committee in half is correct or makes sense the way NIPS reviewing is organized. Most of the area chairing, fostering discussion, quality control of reviews, etc., is done independently by each area chair, so there no real concept of split independent committees. I’m not even sure what that would mean. The committee is in some sense already split 92 ways, for each area chair, but reviewers may be assigned papers with different chairs, so it’s not really independent. The way Eric’s post describes it invokes an image of two huge conference rooms of area chairs talking about the papers, which is a model that doesn’t scale to the absurd size of the NIPS conference nowadays.
As for the issue of which papers were discussed, I believe this was simply a recantation of a half-joking description of what the review process was, but the tongue-in-cheek tone is lost in writing. First, the reviewers discussed any paper that wasn’t immediately an obvious, high-confidence consensus. Then the area chairs examined all these reviews, papers, discussions, joining in the discussion in most cases. Then the area chairs met in pairs to go over all their papers together (except conflicted papers), making pair judgement calls on each and identifying controversial or tricky decisions that needed to be discussed at the larger meetings. After that, the chairs met in groups of four with one of the program chairs to go over these tricky decisions. Nowhere in this process does the average score even come up other than as a very rough way for area chairs to sort the papers, all of which they have to individually consider. But we also had much better heuristics to sort by, like controversiality, spread of scores, and low confidence scores. I’ll write more about my own personal experience with this process later.
Lastly, area chairs, who had access to author identities (because we are partially responsible for preventing conflicts of interest), could in fact see if a paper was a duplicate. To make the experiment work on CMT, the program chairs had to create a fake additional author, whose name was something like NIPS Dup1. So it was pretty obvious. It’s not clear how this affected the experiment. Different area chairs might have reacted differently to it. I did my best to ignore this bit of information, to preserve the experiment, but there’s no way that my awareness of this didn’t affect my behavior. Did I give these papers more attention because I knew my area chairing would be compared to someone else’s? Or did I give them less attention because I wanted to focus on my actual assigned papers? I wanted to do neither, but who knows how successful I was. This fact surely contaminates the experimental results and any conclusions we might be able to definitely make about it.
I think the reason I was irked by these nitpicky details was partially because the discussion in the comments seemed to suggest that they were the problem with NIPS and the reason for the inconsistency. But I was hasty to criticize on Twitter, because Eric’s post really brings up a few important points and interesting discussion and hypotheses. It is truly great that people are talking about it, and lots of non-machine-learning communities have been made aware of the NIPS experiment through Eric’s post. I’d like some more caution about overgeneralizing from the experimental results, like Eric’s TL;DR does, but I suppose that’s inevitable, so we might as well get right to it with the first public writeup on the results. Hopefully people read on past that, since the rest of his analysis is pretty level-headed and thoughtful.
My own experience as area chair
Since this was my first year on the senior program committee, I got a new perspective on the NIPS review process that might be helpful to share to others. It wasn’t a huge difference from what I’d seen as a reviewer in the past, but since I was responsible for chairing 21 papers, I got a somewhat larger sample size. But it was still only 21 papers, which is much less than the number of papers that the duplication experiment used, and my sample is super biased toward my subject areas. So take this with a HUGE grain of salt. It’s just an anecdote, not a scientific study.
My initial experience with the process was disappointing. I watched the reviewers enter their reviews, and more often than not, these reported opinions that were unsubstantiated, unsupported in the review writeups, late, and unproofread. As a rough guess of the ratio here, each paper had three reviewers initially, and I would say about 3 out of 4 papers had one thoughtful review. After that, the reviewers with crappy reviews became very hard to reach. They didn’t participate in the discussion without a ton of prodding, they didn’t respond to the author rebuttals, and when they did respond, they surprisingly tended to stick to their original, seemingly unsubstantiated opinions.
So that was terrible. The good news is all that happened afterwards. Like I mentioned above, we area chairs met in pairs. (Even though I have only good things to say about the area chairs I met with, I’ll leave names off to preserve the anonymity of the review process.) My AC partner met with me on Skype for (I think) about two hours, maybe three, and we talked through each of our papers. The easy decisions, where all reviewers agreed, reported high confidence, and demonstrated high confidence through their words, were very quick discussions, but most of our time was spent debating about the more difficult decisions. Throughout this discussion, it was clear to me how thoughtfully my partner had considered the reviews and the papers, understanding when reviewers were making valid arguments and when they were being unfair. In many ways, I was reassured that another area chair was trying as hard as I was to find the signal in the noise.
The next stage was a meeting with four area chairs and a program chair. Again, we met on Skype, and this time went through a filtered list of the papers our respective AC pairs had decided needed further discussion. This list mostly included papers on the borderline of acceptance or ones where the reviewers were unwilling to agree on a decision. Reading the reviews and looking at the papers, we did our best to make a decision during the meeting, and again, anytime we couldn’t reach a decision, it was marked for further discussion and consideration as a borderline paper by the next level of the hierarchy: the program chairs.
After that point, I’m in the dark as to what the PCs’ decision process was. I know at some point they had to cut off acceptances for reasons of physical venue space, but it’s my understanding that the acceptance rate before that point is already pretty close to the usual 25%-ish rate.
So what now?
So my experience with the whole process could be summarized by saying that I saw some really disappointing reviewers, but was rescued from losing faith in NIPS by the thoughtfulness of the area chairs I worked with and the program chairs. But even within the sea of bad reviews, there were some standout individuals who anchored (mixed maritime metaphor…) these discussions and decisions in reason, fairness, and perspective. So I think as NIPS continues expanding to accommodate the growing interest in its topics, we’ll have to figure out how to address the growing proportion of bad reviewers that we’ll need to recruit to handle its scale. Maybe the answer is better reviewer training, or maybe the answer is more work for the more competent reviewers, or maybe there is no answer.
One important aspect to consider is, when we talk about peer review being broken, unscalable, or any other complaint, that the primary purpose of these huge processes is not to assign credit for peoples’ work, it’s to decide what content to present in a conference, and despite all the noise, and all the flaws in the system, the quality of the conferences I’ve been attending has always been consistently high. More specifically, to not generalize beyond what I’m discussing in this post, NIPS 2014 was a great conference. Of course, in the real world, assigning credit is a super-important part of this whole deal. (Luckily, the process for journals and for funding decisions tends to happen at a smaller scale, so my guess is it’s less noisy and less crappy, but that certainly isn’t perfect either.) So something does have to be fixed, but it’s not as broken as it may feel at times.
(There are people experimenting with new models of publication to try to address these issues, e.g., see the talks from the ICML 13 Workshop on Peer Review, and the International Conference on Learning Representations (ICLR))
Lastly, I’ll conclude with two questions I’ve been asking myself and my colleagues lately. Of all the peer review experiences you’ve had as a reviewer, a chair, or an author, how often do all the reviewers understand the paper and make a valid decision based on this understanding? For me, the answer is nearly zero percent of the time. Reviewers almost never understand the papers I submit, whether they accept or reject, and when I’m a reviewer or chair, reviewers almost always have different interpretations of what’s going on in the paper, which means they can’t all be correct. So peer review is broken, right? Maybe, but as a second question, how often is the final decision the right decision? For me, the answer is pretty close to always. E.g., the papers I’ve had rejected for dumb reasons have always needed a lot of improvement in retrospect, or maybe the papers reviewers don’t get aren’t written or thought through well enough. Maybe despite reviewers not knowing what they’re reading, their B.S. detectors still work fine. But we shouldn’t just settle for that if it’s true. I dunno. Lots to think about…