The NeurIPS 2021 Consistency Experiment
Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan
NeurIPS 2021 Program Chairs
In 2014, NeurIPS ran an experiment in which 10% of submissions were reviewed by two independent program committees to quantify the randomness in the review process. Since then, the number of annual NeurIPS submissions has increased more than fivefold. To check whether decision consistency has changed as the conference has grown, we ran a variant of this experiment again in 2021. This experiment was independently reviewed and approved by an Institutional Review Board (IRB).
For a more detailed discussion of the original 2014 experiment and results, please see this recent retrospective analysis of the results by 2014 Program Chairs, Corinna Cortes and Neil Lawrence, this talk by Neil, or this retrospective talk.
How was the 2021 experiment implemented?
- During the assignment phase of the review process, we chose 10% of papers uniformly at random—we’ll refer to these as the “duplicated papers.” We assigned two Area Chairs (ACs) and twice the normal number of reviewers to these papers. With the help and guidance of the team at OpenReview, we then created a copy of each of these papers and split the ACs and reviewers at random between the two copies. We made sure that the two ACs were assigned to two different Senior Area Chairs (SACs) so that no SAC handled both copies of the same paper. Any newly invited reviewer for one copy was automatically added as a conflict for the other copy. We’ll refer to the SAC, AC, and reviewers assigned to the same copy as the copy’s “committee.” The papers’ committees were not told about the experiment and were not aware the paper had been duplicated.
- The authors of duplicated papers were notified of the experiment right before initial reviews were released and instructed to respond to each set of reviews independently. They were also asked to keep the experiment confidential. At the time initial reviews were released, 8765 of the original 9122 submitted papers were still under review, and 882 of these were duplicated papers.
- As in 2014, duplicated papers were accepted if at least one of the two copies was recommended for acceptance and no “fatal flaw” was found. This resulted in 92 accepted papers that would not have been accepted had we not run the experiment. Four papers were accepted by one committee but were ultimately rejected due to what was considered a fatal flaw. In an additional two cases, the committees for the two papers disagreed about whether a flaw was “fatal.” In these cases, the papers were conditionally accepted with conditions determined jointly by the two committees; both were ultimately accepted.
The table below summarizes the outcomes for the 882 duplicated papers:
As we can see from the table, there is especially high disagreement on which papers should be selected for orals and spotlights. More than half of all spotlights recommended by either committee were rejected by the other (13/25 and 13/23).
Note that 118 papers were withdrawn after initial reviews were released. We include these withdrawn papers in our analysis. Authors were most likely to withdraw their paper if both copies were headed for rejection. The withdrawal rate after seeing initial reviews was 45% higher for papers not in the experiment compared with duplicated papers, which we suspect is because authors of duplicated papers had two shots at acceptance.
There are a few ways to think about the results. First, we can measure the fraction of inconsistent outcomes—the fraction of duplicated papers that were accepted by only one of the two committees (as either a poster, a spotlight, or an oral). The number of papers with inconsistent outcomes was 203 out of 882, or 23.0%.
To put this number in context, we need a baseline. There were 206 papers accepted in the original set and 195 papers accepted in the duplicate set, for an average acceptance rate of 22.7%. If acceptance decisions were made at random with a 0.227 chance of accepting each paper, we would expect the fraction of inconsistent outcomes to be 35.1%. While the fraction of inconsistent outcomes is closer to the random baseline than it is to 0, many of these papers could genuinely have gone either way. When ACs entered recommendations, they were asked to note whether they were sure or whether the paper could be bumped up or down. If we don’t consider “Accept that can be bumped down” and “Reject,” as well as “Reject that can be bumped up” and “Poster Accept that shouldn’t be bumped up to Spotlight,” as inconsistent, the fraction of inconsistent outcomes drops down to only 16%.
We can see how the fraction of inconsistent outcomes would have changed if we shifted the acceptance threshold in different ways. For example, if the conference were so selective as to accept only orals and spotlights, the committees would have accepted 29 and 25 of the duplicated papers respectively, agreeing on only 3 papers. To visualize the impact of shifting the threshold, the gray curve in the plot below extends the random baseline to other acceptance rates. Points on the gray curve correspond to the expected fraction of inconsistent decisions if both committees were making acceptance decisions at random with the corresponding acceptance probability. We have added the following points to the plot:
- Accepting only papers recommended as orals or spotlights (only a 3% relative improvement over the random baseline).
- Bumping down all posters marked as candidates for being bumped down (a 25% relative improvement over the random baseline).
- Decisions made by NeurIPS 2021 ACs (a 35% relative improvement over the random baseline).
- Bumping up all rejects that were marked as candidates for being bumped up (a 35% relative improvement over the random baseline).
For comparison, in 2014, of the 166 papers that were duplicated, the two committees disagreed on 43 (25.9%). The acceptance rate was 25% for duplicated papers—a bit higher than the overall 2014 acceptance rate. The random baseline for this acceptance rate is 37.5% disagreement, so this is a 31% relative improvement (with a fairly large confidence interval given the small sample size).
Another way of measuring disagreement is to look at the fraction of accepted papers that would have changed if we reran the review process. This is also the probability that a randomly chosen accepted paper would have been rejected if it were re-reviewed, previously discussed as the “complement to 1 of accept precision” and “arbitrariness” in the context of the 2014 experiment, also discussed here.
In 2014, 49.5% of the papers accepted by the first committee were rejected by the second (with a fairly wide confidence interval as the experiment included only 116 papers). This year, this number was 50.6%. We can also look at the probability that a randomly chosen rejected paper would have been accepted if it were re-reviewed. This number was 14.9% this year, compared to 17.5% in 2014.
Feedback from ACs and SACs
After the review period ended and decisions were released, we gave ACs and SACs who were assigned to duplicated papers (in which there had been disagreement) access to the reviews and discussion for the papers’ other copies. We asked them to complete a brief survey to provide feedback. Of the 203 papers that were recommended for acceptance by only one committee, we received feedback on 99. Unfortunately, we received feedback from both committees for only 18 papers, which limits the scope of our analysis.
Based on this feedback, the vast majority of cases fell into one of three categories. First, there is what one AC called “noise on the decision frontier.” In such cases, there was no real disagreement, but one committee may have been feeling a bit more generous or more excited about the work and willing to overlook the paper’s limitations. Indeed, 48% of multiple-choice responses were “This was a borderline paper that could have gone either way; there was no real disagreement between the committees.”
Second, there were genuine disagreements about the value of the contribution or the severity of limitations. We saw a spectrum here ranging from basically borderline cases to a few more difficult cases in which expert reviewers disagreed. In some of these cases, there was also disagreement within committees.
Third were cases in which one committee found a significant issue that the other did not. Such issues included, for example, close prior work, incorrect proofs, and methodological flaws.
45% of responses were “I still stand by our committee’s decision,” while the remaining 7% were “I believe the other committee made the right decision.” We can only speculate about why this may be the case. Part of this could be that, once formed, opinions are hard to change. Part of it is that many of these papers are borderline, and different borderline papers just fundamentally appeal to different people. Part of it could also be selection bias; the ACs and SACs who took the time to respond to our survey may have been more diligent and involved during the review process as well, leading to better decisions.
There are two caveats we would like to call out that may impact these results.
First, although we asked authors of duplicated papers to respond to the two sets of reviews independently, there is evidence that some authors put significantly more effort into their responses for the copy that they felt was more likely to be accepted. In fact, some authors told us directly that they were only going to spend the time to write a detailed response for the copy of their paper with the higher scores. Overall, there were 50 pairs of papers where authors only left comments on the copy with the higher average score.
To dig into this a bit more, we had 8,765 papers still under review at the time initial reviews were released. The acceptance rate for the 7,883 papers not in the experiment was 2036/7883 = 25.8%. (Note that the overall acceptance rate for the conference was 25.6%, but this overall rate also includes papers that were withdrawn or rejected for violations of the CFP prior to initial reviews being released—here we are looking only at papers still under review at this point.) As discussed above, the average acceptance rate for duplicated papers was 22.7% (206 papers recommended for acceptance in the original set and 195 papers recommended in the duplicate set, for 401 acceptances out of the total of 882*2 papers). The 95% binomial confidence intervals for the two observed rates do not overlap. Authors changing their behavior may account for this difference. This confounder may have somewhat skewed the results of the experiment.
Second, when decisions shifted as part of the calibration process, ACs were often asked to edit their meta-reviews to move a paper from “poster” to “reject” or vice versa, or from “spotlight” to “poster” or vice versa. We observed several cases in which ACs made these changes without altering the field for whether a paper “can be bumped up” or “can be bumped down.” For example, there were nine cases in which it appears that a duplicated paper was initially marked “poster” and “can be bumped down” and later moved to “reject,” ending up marked as the nonsensical “reject” and “can be bumped down.” This could potentially introduce minor inaccuracies into our analysis of shifted thresholds.
The experimental results appear consistent with the 2014 experiment when the conference was an order of magnitude smaller. Thus there is no evidence that the decision process has become more or less noisy with increasing scale.
For program chairs, there is a perennial question: “How selective should the conference be?” With the current review process, it appears that being significantly more selective will significantly increase the arbitrariness (i.e., fraction of accepted papers with a different decision upon rereview). However, increasing the acceptance rate may not decrease the arbitrariness appreciably.
Finally, we would encourage authors to avoid excessive discouragement from rejections as there is a real possibility that the result says more about the review process than the paper.
We would like to thank the entire OpenReview team, especially Melisa Bok, for their support with the experiment. We also thank the reviewers, ACs, and SACs who contributed their time to the review process, and all of the authors who submitted to NeurIPS.