On the NIPS Experiment and Review Process

A lot of attention has been focused on the NIPS experiment on its own reviewing system, Eric Price’s blog post on it, and unfortunately my flippant tweet pointing out some inaccuracies in the blog post. So I’ll try to clarify what I meant here.

Edit: folks looking for more info on the NIPS experiment should check out Neil Lawrence’s initial blog post on it and stay tuned, as he and Corinna Cortes will be eventually releasing a more thorough writeup on the results.


Eric Price’s Post

In the spirit of Eric’s post, here’s a TL;DR: the minor inaccuracies in Eric’s post should not detract from the main message that the results from the NIPS experiment are concerning, but we should be careful to get the details right before trying to explain the scary results with faulty information.

The three details that popped out at me as inaccurate were the idea that the program committee was split into two independent committees, that the committee only discussed papers with an average score between 6.0 and 6.5, and that area chairs could not see if a paper was a duplicate. In Eric’s defense, a lot of these details and more were clarified in the comments and discussion below his post, so what I’m writing here is somewhat redundant (e.g., he points out in the comments that NIPS does not have a fixed acceptance rate).

On the first point. I’m not completely sure how the program chairs implemented the duplication. But I don’t think the concept of splitting the program committee in half is correct or makes sense the way NIPS reviewing is organized. Most of the area chairing, fostering discussion, quality control of reviews, etc., is done independently by each area chair, so there no real concept of split independent committees. I’m not even sure what that would mean. The committee is in some sense already split 92 ways, for each area chair, but reviewers may be assigned papers with different chairs, so it’s not really independent. The way Eric’s post describes it invokes an image of two huge conference rooms of area chairs talking about the papers, which is a model that doesn’t scale to the absurd size of the NIPS conference nowadays.

As for the issue of which papers were discussed, I believe this was simply a recantation of a half-joking description of what the review process was, but the tongue-in-cheek tone is lost in writing. First, the reviewers discussed any paper that wasn’t immediately an obvious, high-confidence consensus. Then the area chairs examined all these reviews, papers, discussions, joining in the discussion in most cases. Then the area chairs met in pairs to go over all their papers together (except conflicted papers), making pair judgement calls on each and identifying controversial or tricky decisions that needed to be discussed at the larger meetings. After that, the chairs met in groups of four with one of the program chairs to go over these tricky decisions. Nowhere in this process does the average score even come up other than as a very rough way for area chairs to sort the papers, all of which they have to individually consider. But we also had much better heuristics to sort by, like controversiality, spread of scores, and low confidence scores. I’ll write more about my own personal experience with this process later.

Lastly, area chairs, who had access to author identities (because we are partially responsible for preventing conflicts of interest), could in fact see if a paper was a duplicate. To make the experiment work on CMT, the program chairs had to create a fake additional author, whose name was something like NIPS Dup1. So it was pretty obvious. It’s not clear how this affected the experiment. Different area chairs might have reacted differently to it. I did my best to ignore this bit of information, to preserve the experiment, but there’s no way that my awareness of this didn’t affect my behavior. Did I give these papers more attention because I knew my area chairing would be compared to someone else’s? Or did I give them less attention because I wanted to focus on my actual assigned papers? I wanted to do neither, but who knows how successful I was. This fact surely contaminates the experimental results and any conclusions we might be able to definitely make about it.

I think the reason I was irked by these nitpicky details was partially because the discussion in the comments seemed to suggest that they were the problem with NIPS and the reason for the inconsistency. But I was hasty to criticize on Twitter, because Eric’s post really brings up a few important points and interesting discussion and hypotheses. It is truly great that people are talking about it, and lots of non-machine-learning communities have been made aware of the NIPS experiment through Eric’s post. I’d like some more caution about overgeneralizing from the experimental results, like Eric’s TL;DR does, but I suppose that’s inevitable, so we might as well get right to it with the first public writeup on the results. Hopefully people read on past that, since the rest of his analysis is pretty level-headed and thoughtful.


My own experience as area chair

Since this was my first year on the senior program committee, I got a new perspective on the NIPS review process that might be helpful to share to others. It wasn’t a huge difference from what I’d seen as a reviewer in the past, but since I was responsible for chairing 21 papers, I got a somewhat larger sample size. But it was still only 21 papers, which is much less than the number of papers that the duplication experiment used, and my sample is super biased toward my subject areas. So take this with a HUGE grain of salt. It’s just an anecdote, not a scientific study.

My initial experience with the process was disappointing. I watched the reviewers enter their reviews, and more often than not, these reported opinions that were unsubstantiated, unsupported in the review writeups, late, and unproofread. As a rough guess of the ratio here, each paper had three reviewers initially, and I would say about 3 out of 4 papers had one thoughtful review. After that, the reviewers with crappy reviews became very hard to reach. They didn’t participate in the discussion without a ton of prodding, they didn’t respond to the author rebuttals, and when they did respond, they surprisingly tended to stick to their original, seemingly unsubstantiated opinions.

So that was terrible. The good news is all that happened afterwards. Like I mentioned above, we area chairs met in pairs. (Even though I have only good things to say about the area chairs I met with, I’ll leave names off to preserve the anonymity of the review process.) My AC partner met with me on Skype for (I think) about two hours, maybe three, and we talked through each of our papers. The easy decisions, where all reviewers agreed, reported high confidence, and demonstrated high confidence through their words, were very quick discussions, but most of our time was spent debating about the more difficult decisions. Throughout this discussion, it was clear to me how thoughtfully my partner had considered the reviews and the papers, understanding when reviewers were making valid arguments and when they were being unfair. In many ways, I was reassured that another area chair was trying as hard as I was to find the signal in the noise.

The next stage was a meeting with four area chairs and a program chair. Again, we met on Skype, and this time went through a filtered list of the papers our respective AC pairs had decided needed further discussion. This list mostly included papers on the borderline of acceptance or ones where the reviewers were unwilling to agree on a decision. Reading the reviews and looking at the papers, we did our best to make a decision during the meeting, and again, anytime we couldn’t reach a decision, it was marked for further discussion and consideration as a borderline paper by the next level of the hierarchy: the program chairs.

After that point, I’m in the dark as to what the PCs’ decision process was. I know at some point they had to cut off acceptances for reasons of physical venue space, but it’s my understanding that the acceptance rate before that point is already pretty close to the usual 25%-ish rate.


So what now?

So my experience with the whole process could be summarized by saying that I saw some really disappointing reviewers, but was rescued from losing faith in NIPS by the thoughtfulness of the area chairs I worked with and the program chairs. But even within the sea of bad reviews, there were some standout individuals who anchored (mixed maritime metaphor…) these discussions and decisions in reason, fairness, and perspective. So I think as NIPS continues expanding to accommodate the growing interest in its topics, we’ll have to figure out how to address the growing proportion of bad reviewers that we’ll need to recruit to handle its scale. Maybe the answer is better reviewer training, or maybe the answer is more work for the more competent reviewers, or maybe there is no answer.

One important aspect to consider is, when we talk about peer review being broken, unscalable, or any other complaint, that the primary purpose of these huge processes is not to assign credit for peoples’ work, it’s to decide what content to present in a conference, and despite all the noise, and all the flaws in the system, the quality of the conferences I’ve been attending has always been consistently high. More specifically, to not generalize beyond what I’m discussing in this post, NIPS 2014 was a great conference. Of course, in the real world, assigning credit is a super-important part of this whole deal. (Luckily, the process for journals and for funding decisions tends to happen at a smaller scale, so my guess is it’s less noisy and less crappy, but that certainly isn’t perfect either.) So something does have to be fixed, but it’s not as broken as it may feel at times.

(There are people experimenting with new models of publication to try to address these issues, e.g., see the talks from the ICML 13 Workshop on Peer Review, and the International Conference on Learning Representations (ICLR))

Lastly, I’ll conclude with two questions I’ve been asking myself and my colleagues lately. Of all the peer review experiences you’ve had as a reviewer, a chair, or an author, how often do all the reviewers understand the paper and make a valid decision based on this understanding? For me, the answer is nearly zero percent of the time. Reviewers almost never understand the papers I submit, whether they accept or reject, and when I’m a reviewer or chair, reviewers almost always have different interpretations of what’s going on in the paper, which means they can’t all be correct. So peer review is broken, right? Maybe, but as a second question, how often is the final decision the right decision? For me, the answer is pretty close to always. E.g., the papers I’ve had rejected for dumb reasons have always needed a lot of improvement in retrospect, or maybe the papers reviewers don’t get aren’t written or thought through well enough. Maybe despite reviewers not knowing what they’re reading, their B.S. detectors still work fine. But we shouldn’t just settle for that if it’s true. I dunno. Lots to think about…

Advertisements

14 comments

  1. Pingback: Quick comments on the NIPS experiment | Windows On Theory

  2. Pingback: the NIPS peer review experiment and more cyber zeitgeist | Turing Machine

  3. Thanks for detailed comments Bert and your perspective on the process. It’s so useful to hear the detailed perspective of the AC. I was pretty happy with the overall quality of reviewimg (whilst agreeing there are *definitely* lots of problematic cases!). Do you think in part you feel there were so many because that’s where your attention had to be directed? If you think about absolute %age, what would your estimate be?

  4. That’s a great point. My memory of this from the summer is probably biased and filtered through the reviews I had to pay more attention to. Perhaps my initial estimate of reviews being more often bad than good is overly pessimistic. Maybe the number is on the other side of that line, so around 40% of reviews that came in unsatisfactory. Once we engaged the reviewers in discussions, the reviews improved somewhat (in some rare cases, I directly told some to add more justification to their opinions, including the ones your automated scripts flagged for being too short). The final review quality was reasonable, and perhaps it’s naive for me to expect them to be better, but it took a lot of what felt like pulling teeth to get them into that quality.

  5. Pingback: NIPS 2014 main meeting | Memming

  6. Hi Bert, I wanted to comment on this one:
    <>

    This is certainly a problem. On the other hand, the cause is murky… It might be because the reviewers did not have the expertise, did not take the time to properly digest it, etc. But it could also be because the paper is not written and organized clearly enough. Science writing is tricky. And language is ambiguous anyway. It is also why some brilliant ideas have been largely ignored: they have been poorly presented in the current context… Communication is a two-way channel: both the writer and the reader have some responsibility…

    Some food for thoughts… 😉

    • Another good point! The concepts we study in our field are difficult to both convey and understand, and we ask reviewers to understand them really quickly for the fast conference review cycle, often, as you mentioned, when they are suboptimally presented in the first place. It’s interesting to think about the two axes of technical quality and presentation. If good presentation reduces noisiness of reviews because all reviewers can understand what’s actually going on in the paper, then the best papers, which are clearly written and technically good, should consistently be accepted, but bad papers, which are unclear and not technically good could have mixed reviews. Then good-bad papers, which are clearly written but are crappy science, would be unanimously rejected, and bad-good papers, which are good science but poorly written would be somewhat indistinguishable from truly bad papers. So everyone should spend more time making their papers clearer so we can reject them more confidently!

      • Indeed. And then like everywhere in life, we have multiple tradeoffs:
        1) As an author, we can choose between spending more time to clarify the paper presentation, or to improve the science and/or do different science (though I would argue that presenting ideas clearly is part of the science: there is a famous quote in French from Nicolas Boileau (1674): “Ce que l’on conçoit bien s’énonce clairement,
        Et les mots pour le dire arrivent aisément.” i.e. “Whatever is well conceived is clearly said, And the words to say it flow with ease.”).
        2) As a reviewer, we can choose to spend more time to understand this murky paper, spend more time to make sure the decision is less noisy, spend more time to give suggestions to improve the paper (I tend to do too much of this myself I think — sometimes I feel as if I am doing the work instead of the authors and that they are abusing the system by submitting incomplete work); or spend this precious time to do our own science…

        Which actually makes me think of an important question: what is the reasonable amount of hours that we can expect a reviewer to spend for, say, a NIPS reviewing load of 5 papers? I.e., once a reviewer agrees to review for NIPS, what should be the reasonable time budget commitment that this person has signed up for? A suggested range might be helpful to include in the reviewing instructions…

  7. Pingback: Blogs on the NIPS Experiment | Inverse Probability

  8. Being an author and reviewer for NIPS in year 2015, I have found the entire process rather shabby. I’ve noticed that despite up to 6/7 out of 7 reviewers per paper were positive for some papers I reviewed, the area chairs often posted hasty questions that prompted discussion towards rejection. A lot of papers accepted by the reviewers got simply terminated by the area chairs at their arbitrary discretion with rather poor meta-reviews that would not even provide necessary citations to some arguments given by ACs. Lots of rushed comments suggested that some ACs were not able to even properly formulate their opinions based on reviews and authors’ rebuttals or were rising new issues after rebuttals that authors’ could not respond to. Seriously, dear NIPS, what is the point of asking reviewers to spend their time and then let ACs do as they please? The entire process feels biased and dubious. This is the first time I start appreciating conferences such as CVPR where ACs actually seem to respect reviewers’ scores.

    • And overall, the entire review process seems to be based on petty desire to quickly reject papers based on personal ideas what should/should not be done in the paper. Reviewers look for petty reasons to quickly reject works rather than looking for if a given idea has a potential. I wonder how the reviews can be ever consistent if neither reviewers nor ACs are given any coaching to really understand what to look at and how to weigh various pros and cons. There should be clear and well formulated criteria for the entire process and both reviewers and ACs should go through some kind of training to leave behind their personal opinions, likes, dislikes, views and follow a more unified and fair process. Being an expert in one’s field does not mean one is being objective in any way. Imagine a school and exams where there are no clear criteria of how the assignments are scored … what would be the point of that?

  9. Pingback: The NIPS experiment | A bunch of data

  10. Pingback: The NIPS experiment | iTechFlare


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s