On Tuesday December 14, the last day of NeurIPS 2021, we will have our second and last day of NeurIPS Workshops.
Each workshop follows its own schedule and format, so it’s important for you to visit a workshop’s page for accurate details.
On this NeurIPS virtual site page, you will find the list of all workshops. By clicking on a workshop, you’ll find its page along with the livestream and/or zoom room link for the event.
Also, for your convenience, here is the list of workshops scheduled on Tuesday:
We hope you enjoyed last week’s NeurIPS conference and had a lovely weekend. On Monday 13 we kick off our first day of NeurIPS Workshops.
Each workshop follows its own schedule and format, so it’s important for you to visit a workshop’s page for accurate details.
On this NeurIPS virtual site page, you will find the list of all workshops. By clicking on a workshop, you’ll find its page along with the livestream and/or zoom room link for the event.
Also, for your convenience, here is the list of workshops scheduled on Monday:
On day 5 (Friday) of NeurIPS 2021 we have the last day of the main program.
Session 1 starting at 7 UTC-0 will begin with our second Town Hall. Please remember to send us your feedback via email or in the rocketchat channel #townhall. This is a great opportunity to share your feedback about the conference. The session will conclude with an oral session with two concurrent tracks: theory and vision applications.
Session 2 starting at 15.00 UTC-0 with an interview with Daniel Kahneman, who won the Nobel Memorial Prize in Economic Sciences in 2002 and who is very well known for his research in behavioral economics. This session will close with another poster session.
Concurrent with the main program, there will also be presentations from the Dataset and Benchmark Track (during session 2 but also during the intermission between session 2 and 3) and demos.
Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan NeurIPS 2021 Program Chairs
In 2014, NeurIPS ran an experiment in which 10% of submissions were reviewed by two independent program committees to quantify the randomness in the review process. Since then, the number of annual NeurIPS submissions has increased more than fivefold. To check whether decision consistency has changed as the conference has grown, we ran a variant of this experiment again in 2021. This experiment was independently reviewed and approved by an Institutional Review Board (IRB).
For a more detailed discussion of the original 2014 experiment and results, please see this recent retrospective analysis of the results by 2014 Program Chairs, Corinna Cortes and Neil Lawrence, this talk by Neil, or this retrospective talk.
How was the 2021 experiment implemented?
During the assignment phase of the review process, we chose 10% of papers uniformly at random—we’ll refer to these as the “duplicated papers.” We assigned two Area Chairs (ACs) and twice the normal number of reviewers to these papers. With the help and guidance of the team at OpenReview, we then created a copy of each of these papers and split the ACs and reviewers at random between the two copies. We made sure that the two ACs were assigned to two different Senior Area Chairs (SACs) so that no SAC handled both copies of the same paper. Any newly invited reviewer for one copy was automatically added as a conflict for the other copy. We’ll refer to the SAC, AC, and reviewers assigned to the same copy as the copy’s “committee.” The papers’ committees were not told about the experiment and were not aware the paper had been duplicated.
The authors of duplicated papers were notified of the experiment right before initial reviews were released and instructed to respond to each set of reviews independently. They were also asked to keep the experiment confidential. At the time initial reviews were released, 8765 of the original 9122 submitted papers were still under review, and 882 of these were duplicated papers.
As in 2014, duplicated papers were accepted if at least one of the two copies was recommended for acceptance and no “fatal flaw” was found. This resulted in 92 accepted papers that would not have been accepted had we not run the experiment. Four papers were accepted by one committee but were ultimately rejected due to what was considered a fatal flaw. In an additional two cases, the committees for the two papers disagreed about whether a flaw was “fatal.” In these cases, the papers were conditionally accepted with conditions determined jointly by the two committees; both were ultimately accepted.
Results
The table below summarizes the outcomes for the 882 duplicated papers:
As we can see from the table, there is especially high disagreement on which papers should be selected for orals and spotlights. More than half of all spotlights recommended by either committee were rejected by the other (13/25 and 13/23).
Note that 118 papers were withdrawn after initial reviews were released. We include these withdrawn papers in our analysis. Authors were most likely to withdraw their paper if both copies were headed for rejection. The withdrawal rate after seeing initial reviews was 45% higher for papers not in the experiment compared with duplicated papers, which we suspect is because authors of duplicated papers had two shots at acceptance.
Inconsistent Decisions
There are a few ways to think about the results. First, we can measure the fraction of inconsistent outcomes—the fraction of duplicated papers that were accepted by only one of the two committees (as either a poster, a spotlight, or an oral). The number of papers with inconsistent outcomes was 203 out of 882, or 23.0%.
To put this number in context, we need a baseline. There were 206 papers accepted in the original set and 195 papers accepted in the duplicate set, for an average acceptance rate of 22.7%. If acceptance decisions were made at random with a 0.227 chance of accepting each paper, we would expect the fraction of inconsistent outcomes to be 35.1%. While the fraction of inconsistent outcomes is closer to the random baseline than it is to 0, many of these papers could genuinely have gone either way. When ACs entered recommendations, they were asked to note whether they were sure or whether the paper could be bumped up or down. If we don’t consider “Accept that can be bumped down” and “Reject,” as well as “Reject that can be bumped up” and “Poster Accept that shouldn’t be bumped up to Spotlight,” as inconsistent, the fraction of inconsistent outcomes drops down to only 16%.
We can see how the fraction of inconsistent outcomes would have changed if we shifted the acceptance threshold in different ways. For example, if the conference were so selective as to accept only orals and spotlights, the committees would have accepted 29 and 25 of the duplicated papers respectively, agreeing on only 3 papers. To visualize the impact of shifting the threshold, the gray curve in the plot below extends the random baseline to other acceptance rates. Points on the gray curve correspond to the expected fraction of inconsistent decisions if both committees were making acceptance decisions at random with the corresponding acceptance probability. We have added the following points to the plot:
Accepting only papers recommended as orals or spotlights (only a 3% relative improvement over the random baseline).
Bumping down all posters marked as candidates for being bumped down (a 25% relative improvement over the random baseline).
Decisions made by NeurIPS 2021 ACs (a 35% relative improvement over the random baseline).
Bumping up all rejects that were marked as candidates for being bumped up (a 35% relative improvement over the random baseline).
For comparison, in 2014, of the 166 papers that were duplicated, the two committees disagreed on 43 (25.9%). The acceptance rate was 25% for duplicated papers—a bit higher than the overall 2014 acceptance rate. The random baseline for this acceptance rate is 37.5% disagreement, so this is a 31% relative improvement (with a fairly large confidence interval given the small sample size).
Accept Precision
Another way of measuring disagreement is to look at the fraction of accepted papers that would have changed if we reran the review process. This is also the probability that a randomly chosen accepted paper would have been rejected if it were re-reviewed, previously discussed as the “complement to 1 of accept precision” and “arbitrariness” in the context of the 2014 experiment, also discussed here.
In 2014, 49.5% of the papers accepted by the first committee were rejected by the second (with a fairly wide confidence interval as the experiment included only 116 papers). This year, this number was 50.6%. We can also look at the probability that a randomly chosen rejected paper would have been accepted if it were re-reviewed. This number was 14.9% this year, compared to 17.5% in 2014.
Feedback from ACs and SACs
After the review period ended and decisions were released, we gave ACs and SACs who were assigned to duplicated papers (in which there had been disagreement) access to the reviews and discussion for the papers’ other copies. We asked them to complete a brief survey to provide feedback. Of the 203 papers that were recommended for acceptance by only one committee, we received feedback on 99. Unfortunately, we received feedback from both committees for only 18 papers, which limits the scope of our analysis.
Based on this feedback, the vast majority of cases fell into one of three categories. First, there is what one AC called “noise on the decision frontier.” In such cases, there was no real disagreement, but one committee may have been feeling a bit more generous or more excited about the work and willing to overlook the paper’s limitations. Indeed, 48% of multiple-choice responses were “This was a borderline paper that could have gone either way; there was no real disagreement between the committees.”
Second, there were genuine disagreements about the value of the contribution or the severity of limitations. We saw a spectrum here ranging from basically borderline cases to a few more difficult cases in which expert reviewers disagreed. In some of these cases, there was also disagreement within committees.
Third were cases in which one committee found a significant issue that the other did not. Such issues included, for example, close prior work, incorrect proofs, and methodological flaws.
45% of responses were “I still stand by our committee’s decision,” while the remaining 7% were “I believe the other committee made the right decision.” We can only speculate about why this may be the case. Part of this could be that, once formed, opinions are hard to change. Part of it is that many of these papers are borderline, and different borderline papers just fundamentally appeal to different people. Part of it could also be selection bias; the ACs and SACs who took the time to respond to our survey may have been more diligent and involved during the review process as well, leading to better decisions.
Limitations
There are two caveats we would like to call out that may impact these results.
First, although we asked authors of duplicated papers to respond to the two sets of reviews independently, there is evidence that some authors put significantly more effort into their responses for the copy that they felt was more likely to be accepted. In fact, some authors told us directly that they were only going to spend the time to write a detailed response for the copy of their paper with the higher scores. Overall, there were 50 pairs of papers where authors only left comments on the copy with the higher average score.
To dig into this a bit more, we had 8,765 papers still under review at the time initial reviews were released. The acceptance rate for the 7,883 papers not in the experiment was 2036/7883 = 25.8%. (Note that the overall acceptance rate for the conference was 25.6%, but this overall rate also includes papers that were withdrawn or rejected for violations of the CFP prior to initial reviews being released—here we are looking only at papers still under review at this point.) As discussed above, the average acceptance rate for duplicated papers was 22.7% (206 papers recommended for acceptance in the original set and 195 papers recommended in the duplicate set, for 401 acceptances out of the total of 882*2 papers). The 95% binomial confidence intervals for the two observed rates do not overlap. Authors changing their behavior may account for this difference. This confounder may have somewhat skewed the results of the experiment.
Second, when decisions shifted as part of the calibration process, ACs were often asked to edit their meta-reviews to move a paper from “poster” to “reject” or vice versa, or from “spotlight” to “poster” or vice versa. We observed several cases in which ACs made these changes without altering the field for whether a paper “can be bumped up” or “can be bumped down.” For example, there were nine cases in which it appears that a duplicated paper was initially marked “poster” and “can be bumped down” and later moved to “reject,” ending up marked as the nonsensical “reject” and “can be bumped down.” This could potentially introduce minor inaccuracies into our analysis of shifted thresholds.
Key Takeaways
The experimental results appear consistent with the 2014 experiment when the conference was an order of magnitude smaller. Thus there is no evidence that the decision process has become more or less noisy with increasing scale.
For program chairs, there is a perennial question: “How selective should the conference be?” With the current review process, it appears that being significantly more selective will significantly increase the arbitrariness (i.e., fraction of accepted papers with a different decision upon rereview). However, increasing the acceptance rate may not decrease the arbitrariness appreciably.
Finally, we would encourage authors to avoid excessive discouragement from rejections as there is a real possibility that the result says more about the review process than the paper.
Acknowledgments
We would like to thank the entire OpenReview team, especially Melisa Bok, for their support with the experiment. We also thank the reviewers, ACs, and SACs who contributed their time to the review process, and all of the authors who submitted to NeurIPS.
On day 3 (Wednesday) we have a full agenda starting with the main program, which will consists of three sessions.
Session 1 starting at 7 UTC-0 will showcase the invited lecture by Gabor Lugosi on “Do We Know How to Estimate the Mean?“”. This lecture is in honor of Leo Breiman, a distinguished statistician who passed away in 2005 and whose seminal contributions include bootstrapping and random forests, to name a few. The invited talk will be followed by a poster session.
Session 3 starting at 23.00 UTC-00 will begin with the lecture by Peter Bartlett on “Benign Overfitting”; this talk is in memory of Ed Posner who founded NeurIPS. This last session will conclude with a poster session.
Dataset and Benchmark Track, and Demos. Concurrent with the main program, there will also be presentations from the Dataset and Benchmark Track (during session 1 and sessions 2) and demos.
Town Hall . Lastly, there will also be the first of two Town Hall meetings at 17.00 UTC-0, which is an opportunity to provide feedback to the organizers and to learn about all the work that went on behind the scenes to prepare NeurIPS 2021. Consider sending feedback or questions prior to the two Town Hall meetings via email or in the rocketchat channel #townhall.
Opportunities. Remember to take advantage of our career website and mentorship opportunities, hang out at the café and send us your feedback via email or in the rocketchat channel #townhall.
This track will feature one of the outstanding paper awards On the Expressivity of Markov Reward by David Abel, Will Dabney, Anna Harutyunyan, Mark K. Ho, Michael Littman, Doina Precup, and Satinder Singh.
Datasets and benchmark track. Concurrent with the main program, there will also be presentations from the dataset and benchmark track (during session 1 and sessions 2) and demos.
Opportunities. Remember to take advantage of our career website and mentorship opportunities, hang out at the café and send us your feedback via email or in the rocketchat channel #townhall.
Here are the highlights for the first day of NeurIPS 2021, which is dedicated to Tutorials!
There will be 10 tutorials in total. Each tutorial is four hours long, and there will be two tutorials running in parallel at any given time. Tutorials start on Monday 6 at 9.00 UTC-0 and end on Tuesday 7 at 5 UTC-0. The full list of tutorials and further details are available on the schedule page. Notice that registration won’t be required to attend tutorials (only login is required), however, registration is required to interact with Tutorial presenters as well as to access most of the other content of this year’s virtual NeurIPS conference.
Socials. We also have our first social gathering today, ML in Korea.
Opportunities. Remember to take advantage of our career website and mentorship opportunities, hang out at the café and send us your feedback via email or in the rocketchat channel #townhall.
Tomorrow, Day 2. We start the main conference program as well as several other socials, demonstrations, competitions and presentations from authors of the new track on Datasets and Benchmarks!
Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan NeurIPS 2021 Program Chairs
As the impact of machine learning research grows, so does the risk that this research will lead to harmful outcomes. Several machine learning conferences—including ACL, CVPR, ICLR, and EMNLP—have taken steps to establish ethical expectations for the research community, including introducing ethics review processes. Last year, NeurIPS piloted its own ethics review process, chaired by Iason Gabriel. This year, we aimed to expand the ethics review process and ensure that it is in line with ongoing efforts to establish NeurIPS Ethics Guidelines that have been spearheaded by Marc’Aurelio Ranzato as General Chair.
At this early stage of adoption, ethics reviews are meant to be educational, not prohibitive. Our goal is not to police submissions, but instead to prompt reflection. The process that we implemented was intended to support this goal.
In some ways, the process has been a success. We were able to recruit qualified Ethics Reviewers with diverse areas of expertise; many Reviewers, AC, and authors engaged constructively with these Ethics Reviewers; and authors improved their papers based on the feedback they received. However, there are several ongoing challenges with this new process, including how to surface the right set of papers to undergo ethics review; how to fit the ethics review process into the overall paper review timeline without overburdening Ethics Reviewers; and how to set clear expectations in order to achieve more alignment between Ethics Reviewers on what constitutes an ethical issue and when an ethical issue has been properly addressed. In this blog post, we discuss the ethics review process and share some of our learnings and recommendations going forward.
Overview of the Ethics Review Process
We view ethics reviews as a way of obtaining an additional expert perspective to inform paper decisions and provide feedback to authors. To implement this, we allowed Reviewers and Area Chairs (ACs) to flag papers for ethics review, just as ACs sometimes solicit external expert perspectives on technical issues. When a paper was flagged, Ethics Reviewers were added to the committee for the paper, included in discussions about the paper, and given the opportunity to interact with the authors during the rolling discussion period, just like other members of the committee. Ethics Reviewers were encouraged to provide constructive criticism that could lead to project improvement and maturity, similar to standard paper reviews.
Because this process was for educational purposes first, Ethics Reviewers were assigned to each flagged paper, regardless of how likely it was that the paper would be accepted on technical grounds. This was a change from last year and required us to recruit a larger pool of Ethics Reviewers. However, we believe that all authors can benefit from feedback on the ethical aspects of their research and the opportunity to integrate this feedback into future iterations of the work. While it would reduce the burden on Ethics Reviewers, only soliciting ethics reviews for papers likely to be accepted would counteract our goal of making this a constructive process for everyone in the community, rather than just operating as a filter to catch unethical content prior to publication.
Before the process began, we recruited 105 Ethics Reviewers with a wide range of disciplinary backgrounds. Ethics considerations in machine learning research are quite diverse—ranging from standard research ethics issues, like obtaining appropriate consent from human subjects, all the way to substantially thornier issues concerning the downstream negative societal impact of the work—so we understood the importance of working with a range of experts and allowing them to weigh in on issues aligned with their expertise. The breakdown of expertise is in the table below; some Ethics Reviewers had expertise in more than one area.
Number of Ethics Reviewers with this expertise
Number of papers flagged with issues
Discrimination / Bias / Fairness Concerns
92
34
Inadequate Data and Algorithm Evaluation
43
22
Inappropriate Potential Applications & Impact (e.g., human rights concerns)
47
52
Legal Compliance (e.g., GDPR, copyright, terms of use)
13
28
Privacy and Security (e.g., consent)
34
51
Responsible Research Practice (e.g., IRB, documentation, research ethics)
45
30
Research Integrity Issues (e.g., plagiarism)
24
47
During the review process, Reviewers had the chance to flag papers for ethics review by checking a box on the review form. For flagged papers, Reviewers could specify the areas of expertise required to properly assess the paper from the list in the table. (We note that the specific list of areas in the table was created based on common issues that arose last year and our own expectations about the types of ethics issues that might be flagged. However, it is not perfect, as we discuss more below.) In total, Reviewers flagged 265 papers out of 9122 submissions. The breakdown of issues flagged is in the last column of the table; note that some papers were flagged for more than one area of concern.
We designed an algorithm to assign papers to Ethics Reviewers with the right expertise while minimizing the maximum number of papers each Ethics Reviewer was assigned. Each flagged paper was reviewed by 2 Ethics Reviewers and each Ethics Reviewer had on average 4–5 papers to review.
Since Ethics Reviewers could not be assigned until initial reviews were submitted (and therefore papers flagged), the ethics review period coincided with the initial author response period. During this period, authors could see if their paper had been flagged or not, but ethics reviews only became available later during the rolling discussion period. Once available, authors, other Reviewers, and ACs were able to respond to the ethics reviews and engage in discussion with the assigned Ethics Reviewers.
In most cases, the issues raised in ethics reviews were not significant enough to prevent publication. In a small number of cases in which more serious issues were raised, papers were escalated to the Ethics Review Chairs and Program Chairs to deliberate and make the final decision on the paper, taking into consideration feedback from all parties as well as any responses from the authors. This resulted in a small number of papers being conditionally accepted and one paper being rejected on ethical grounds, as discussed below.
Ethical Issues Identified
The ethics review process brought to the surface a variety of ethical issues that regularly appear in research papers submitted to NeurIPS. The most common types of issues encountered at NeurIPS involve:
A lack of sufficient reflection around topics that involve thorny ethical considerations. For example:
Generation models that could be used to generate realistic fake content for misinformation campaigns and may exhibit bias issues in their generated content. These include text generation and image generation (notably including face generation) applications, as well as voice conversion.
Biometric surveillance projects that raise privacy concerns and could be used in sensitive contexts such as criminal justice. These include facial recognition and voice recognition projects.
The continued use of deprecated datasets that had already been explicitly removed from circulation by their authors. Such datasets include Duke MCMC, Celeb-1M, and Tiny Images, whose use has been discouraged by authors due to ethical reasons.
Inappropriate communication and publication release around identified security vulnerabilities. For example, adversarial attack models applied to publicly deployed systems not being adequately disclosed to the involved agency before publication.
A lack of transparency on model or data details and decision-making, as it relates to ethical concerns. For example:
Failing to discuss potential biases deriving from the use of a method or dataset. A lack of documentation of acknowledgement of such issues, often manifesting in a lack of diverse examples featured in the paper or evaluating performance on a homogenous demographic.
Not providing adequate details on data provenance or distribution.
A lack of communication of the details of annotator work conditions.
Issues appropriately handling or sourcing data involving humans. For example:
Collecting information about individuals with no concern for privacy and consent.
Violating copyright restrictions.
Indications of mistreatment of Mturk workers, or annotator and data collection practices that seem exploitative.
Lack of sending the project through an Institutional Review Board (IRB) in situations clearly involving human subjects.
Uncritically emphasising explicitly harmful applications, such as police profiling.
In many cases in which issues were identified, Ethics Reviewers simply recommended that authors reflect on the issues and include a discussion of them in the paper, either by expanding the discussion of potential negative societal impacts or being more explicit about limitations of the work. In other cases, Ethics Reviewers recommended more substantial modifications to the work, such as running additional experiments, the use of a different dataset, data/code distribution restrictions, or increased transparency measures like the inclusion of model or dataset documentation.
In some cases, the concerns raised were so critical that the acceptance of the paper was made conditional on the authors implementing the suggested mitigations. All such cases were discussed by the Program Chairs and Ethics Review Chairs, and the Ethics Reviewers were consulted in determining conditions for acceptance. Of eight papers conditionally accepted for ethical reasons, all were eventually accepted.
In a single case, the Program Chairs and Ethics Review Chairs jointly determined that the required mitigations would be so challenging to execute that they were beyond the scope of what the authors could realistically accomplish within the time frame for the camera-ready. In this case, the Program Chairs made the call to reject the paper on ethical grounds.
It should be noted that Ethics Reviewers were not always in agreement with each other. For 61% of submissions reviewed by two ethics reviewers, at least one Ethics Reviewer checked the box in their review form to indicate the paper had no ethical issue; in 42% of these cases, the Ethics Reviewers were split, with the other saying there was an issue, while the other saying there was not. Additionally, for 82% of submissions reviewed by two ethics reviewers, at least one Ethics Reviewer checked the box to indicate that the authors had not acknowledged the issue; for 43% of these cases, the other Ethics Reviewer indicated that the issue had been adequately acknowledged by the authors. We should not expect perfect agreement among Ethics Reviewers, but it is worth considering whether better guidance on what constitutes an ethical issue and how to appropriately address them could be helpful.
Challenges Surfacing the Right Papers for Review
As implemented, the success of the ethics review process hinges on Reviewers and ACs appropriately flagging papers for ethics review. The biggest challenge that we faced—and one area that we as a community will need to work hard to improve if we want ethics review to be a success—is that there was a lot of uncertainty around which papers to flag, leading to inconsistency in which papers received ethics reviews.
97% of papers flagged for ethics review were flagged by only a single Reviewer; across 9122 submissions, only 8 papers were flagged by more than one Reviewer. Considering the 882 papers that were part of the broader consistency experiment (and therefore assigned to two independent committees for review), there were 23 papers in which the original copy was flagged and 22 papers for which the duplicate was flagged, but the overlap between these two sets was only 3 papers. (Another blog post containing the full results of the consistency experiment is coming soon!)
Still, there were some notable differences between papers that were flagged and those that were not. 29% of all flagged submissions were withdrawn compared with 20% of submissions overall. 16% of flagged papers were ultimately accepted compared with 25.6% of papers overall. And these differences are more stark than they appear since the 25.6% acceptance rate includes papers that were desk rejected for formatting violations or withdrawn before they received reviews.
Some of this inconsistency was due to “false positives”—papers that did not actually have issues of concern to Ethics Reviewers, but that were erroneously flagged anyways. As mentioned above, for 61% of flagged submissions with two ethics reviews, at least one Ethics Reviewer checked the box in their review form to indicate there was no ethical issue, with both Ethics Reviewers checking the box for 58% of these. False positives often involved:
Papers that Reviewers didn’t like. For example, some Reviewers flagged papers because the results were poorly presented.
Plagiarism and other serious Code of Conduct violations that were out of scope for the ethics review process and should instead have been escalated to the Program Chairs. We note that plagiarism was erroneously included as an example in the list of ethical issues that could be checked, which was likely the cause of this problem and easy to fix in future years.
In addition to this, there were “false negatives”—papers that were not flagged, even though they should have been. These are difficult to quantify since we don’t know what we missed. However, some false negatives were later surfaced through other means. These included:
Papers that made use of deprecated datasets that had been retracted for ethical reasons. These cases were surfaced by running a search of submissions for mentions of common deprecated datasets late in the review process.
Papers on biometric data generation (e.g., generating face or voice) or surveillance. These cases were again surfaced by running a keyword search late in the review process.
We recommend that in future years, NeurIPS should provide more extensive guidance and training for Reviewers on how to identify which papers to flag. We also recommend that future Program Chairs implement a systematic search for papers with common ethics issues so that these papers can be automatically included for ethics review without the need to rely on Reviewers to flag them.
Highlights and Lessons Learned
Overall, we consider the following to be highlights of the process this year:
We were able to recruit over 100 qualified Ethics Reviewers with diverse areas of expertise, including many from disciplines outside of machine learning. Ethics reviewers took their role seriously and their engagement led to high-quality ethics reviews.
The executed scale of the operation this year gave us confidence in expanding ethical review to accommodate the growing number of flagged cases for a conference of this size. Of the 264 papers flagged, 250 received at least one ethics review; 202 received two ethics reviews; and 48 received at least one review. This means that in total there were at least 452 submitted ethics reviews. In reality, even more reviews were submitted due to additional reviews completed for papers flagged during the discussion period and as part of the Datasets and Benchmarks track.
Many Reviewers, AC, and authors engaged constructively with Ethics Reviewers. Some authors indicated that they benefited from the feedback in their ethics reviews and made significant improvements to their papers as a result. Of the 452 submitted ethics reviews, 140 (31%) had responses from the authors. All eight papers that were conditionally accepted due to ethical issues were ultimately accepted.
As this process is still relatively new, there were also lessons learned, which suggest improvements to the process for future years:
As described above, there were challenges in surfacing the right set of papers to undergo ethics review, with many false positives and false negatives. This culminated with the Program Chairs running keyword searches over submissions late in the review process to identify papers with ethical issues that had been overlooked. We expect that reconsidering the set of ethical areas listed in the review form and providing better guidance on which papers to flag would help. We would additionally encourage future organizers to plan for a more systematic automated flagging of papers early in the review process.
While Ethics Reviewers were not required to read the full details of the papers they were assigned, in practice it was difficult for them to assess whether or not the authors had appropriately reflected on ethical issues without reading the whole paper, which was burdensome given the short period of time allotted for ethics reviews. The pointers that authors included as part of the NeurIPS Paper Checklist were not enough. This is difficult to address; having designated sections on potential negative societal impact and limitations makes this easier, but to catch all potential issues, a full read-through may still be necessary.
There was a fair amount of disagreement among Ethics Reviewers about whether flagged papers had ethical issues and whether these issues were addressed, as discussed above. We should not expect perfect agreement among Ethics Reviewers, but better guidance on what constitutes an ethical issue may be valuable here too.
Since the review process for the new NeurIPS Benchmarks & Datasets track was entirely independent of the review process for the main track, this track was initially omitted from the ethics review process and only incorporated after the review process was underway; 10 papers from that track were then flagged for ethical review and one was rejected in part for ethical concerns. Since many ethical issues are related to data, this track should be included in the ethics review process going forward.
The ethics review process is still quite new, and both community norms and official conference guidelines are still evolving. We are grateful for the opportunity to contribute to this evolution as we all work to ensure that this community operates in the best interests of those impacted by our research. To learn more about how to incorporate ethical practices in your own research, attend the plenary panel on this topic Friday, December 10 at 11pm UTC (3pm PST).
NeurIPS 2021 will begin next week! As we prepare for the conference, we are delighted to take a moment to announce the recipients of the 2021 Outstanding Paper Awards, the Test of Time Award, and the new Datasets and Benchmarks Track Best Paper Awards.
First, we would like to say a huge thank you to the members of the community who led the award selection process. The Outstanding Paper Award committee consisted of Alice Oh, Daniel Hsu, Emma Brunskill, Kilian Weinberger, and Yisong Yue. The Test of Time Award committee consisted of Joelle Pineau, Léon Bottou, Max Welling, and Ulrike von Luxburg. We would also like to thank Nati Srebro, who helped set up the process for Outstanding Paper Award, and the members of the community who provided subject-matter expertise on specific papers and topics.
Outstanding Paper Awards
This year six papers were chosen as recipients of the Outstanding Paper Award. The committee selected these papers due to their excellent clarity, insight, creativity, and potential for lasting impact. Additional details about the paper selection process are provided below. While there is of course no perfect process for choosing award papers, we believe the NeurIPS community will appreciate the extremely strong contributions of these papers.
The award recipients are (in order of paper ID):
A Universal Law of Robustness via Isoperimetry By Sébastien Bubeck and Mark Sellke. This paper proposes a theoretical model to explain why many state-of-the-art deep networks require many more parameters than are necessary to smoothly fit the training data. In particular, under certain regularity conditions about the training distribution, the number of parameters needed for an O(1)-Lipschitz function to interpolate training data below the label noise scales as nd, where n is the number of training examples, and d is the dimensionality of the data. This result stands in stark contrast to conventional results stating that one needs n parameters for a function to interpolate the training data, and this extra factor of d appears necessary in order to smoothly interpolate. The theory is simple and elegant, and consistent with some empirical observations about the size of models that have robust generalization on MNIST classification. This work also offers a testable prediction about the model sizes needed to develop robust models for ImageNet classification. This paper will be presented Tuesday, December 7 at 08:20 GMT (12:20 am PST) in the session on Deep Learning Theory and Causality.
On the Expressivity of Markov Reward By David Abel, Will Dabney, Anna Harutyunyan, Mark K. Ho, Michael Littman, Doina Precup, and Satinder Singh. Markov reward functions are the dominant framework for sequential decision making under uncertainty and reinforcement learning. This paper provides a careful, clear exposition of when Markov rewards are, or are not, sufficient to enable a system designer to specify a task, in terms of their preference for a particular behavior, preferences over behaviors, or preferences over state and action sequences. The authors demonstrate with simple, illustrative examples that there exist some tasks for which no Markov reward function can be specified that induces the desired task and result. Fortunately, they also show that it is possible in polynomial time to decide if a compatible Markov reward exists for a desired setting, and if it does, there also exists a polynomial time algorithm to construct such a Markov reward in the finite decision process setting. This work sheds light on the challenge of reward design and may open up future avenues of research into when and how the Markov framework is sufficient to achieve performance desired by human stakeholders. This paper will be presented Tuesday, December 7 at 09:20 GMT (1:20 am PST) in the session on Reinforcement Learning.
Deep Reinforcement Learning at the Edge of the Statistical Precipice By Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G. Bellemare. Rigorous comparison of methods can accelerate meaningful scientific advances. This paper suggests practical approaches to improve the rigor of deep reinforcement learning algorithm comparison: specifically, that the evaluation of new algorithms should provide stratified bootstrap confidence intervals, performance profiles across tasks and runs, and interquartile means. The paper highlights that standard approaches for reporting results in deep RL across many tasks and multiple runs can make it hard to assess if a new algorithm represents a consistent and sizable advance over past methods, and illustrates this with empirical examples. The proposed performance summaries are designed to be feasible to compute with a small number of runs per task, which may be necessary for many research labs with limited computational resources. his paper will be presented Wednesday, December 8 at 16:20 GMT (8:20 am PST) in the session on Deep Learning.
MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers By Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. This paper presents MAUVE, a divergence measure to compare the distribution of model-generated text with the distribution of human-generated text. The idea is simple and elegant, and it basically uses a continuous family of (soft) KL divergence measures of quantized embeddings of the two texts being compared. The proposed MAUVE measure is essentially an integration over the continuous family of measures, and aims to capture both Type I error (generating unrealistic text) and Type II error (not capturing all possible human text). The empirical experiments demonstrate that MAUVE identifies the known patterns of model-generated text and correlates better with human judgements compared to previous divergence metrics. The paper is well-written, the research question is important in the context of rapid progress of open-ended text generation, and the results are clear. This paper will be presented Tuesday, December 7 at 8:00 GMT (midnight PST) in the session on Deep Learning.
Continuized Accelerations of Deterministic and Stochastic Gradient Descents, and of Gossip Algorithms By Mathieu Even, Raphaël Berthier, Francis Bach, Nicolas Flammarion, Pierre Gaillard, Hadrien Hendrikx, Laurent Massoulié, and Adrien Taylor. This paper describes a “continuized” version of Nesterov’s accelerated gradient method in which the two separate vector variables evolve jointly in continuous-time—much like previous approaches that use differential equations to understand acceleration—but uses gradient updates that occur at random times determined by a Poisson point process. This new approach leads to a (randomized) discrete-time method that: (1) enjoys the same accelerated convergence as Nesterov’s method; (2) comes with a clean and transparent analysis that leverages continuous-time arguments, which is arguably easier to understand than prior analyses of accelerated gradient methods; and (3) avoids additional errors from discretizing a continuous-time process, which stands in stark contrast to several previous attempts to understand accelerated methods using continuous-time processes. This paper will be presented Wednesday, December 8 at 16:00 GMT (8:00 am PST) in the session on Optimization.
Moser Flow: Divergence-based Generative Modeling on Manifolds By Noam Rozen, Aditya Grover, Maximilian Nickel, and Yaron Lipman. This paper proposes a method for training continuous normalizing flow (CNF) generative models over Riemannian manifolds. The key idea is to leverage a result by Moser (1965) that characterizes the solution of a CNF (which Moser called an orientation preserving automorphism on manifolds) using a restricted class of ODEs that enjoys geometric regularity conditions, and is explicitly defined using the divergence of the target density function. The proposed Moser Flow method uses this solution concept to develop a CNF approach based on a parameterized target density estimator (which can be a neural network). Training amounts to simply optimizing the divergence of the density estimator, which side-steps running an ODE solver (required for standard backpropagation training). The experiments show faster training times and superior test performance compared to prior CNF work, as well as the ability to model densities on implicit surfaces with non-constant curvature such as the Stanford Bunny model. More generally, this concept of exploiting geometric regularity conditions to side-step expensive backpropagation training may be of broader interest. This paper will be presented Saturday, December 11 at 00:00 GMT (Friday, December 10 at 4:00 pm PST) in the session on Generative Modeling.
Selection Process:
The Outstanding Paper Committee determined a selection process with the goal of identifying an equivalence class of outstanding papers that represent some of the breadth of excellent research being conducted by the NeurIPS community.
The committee was given an initial batch of 62 papers including all papers that received an Oral slot and papers explicitly nominated by an Area Chair or Senior Area Chair. The committee used three phases of down-selection. In Phase 1, each paper in this initial batch was assigned one primary reader who determined if the paper should move on to Phase 2. In Phase 2, each paper was assigned an additional secondary reader. In Phase 3, all remaining papers were considered by the entire committee, and the primary and secondary readers were charged with articulating why a paper should be deserving of an award. In each subsequent phase, the committee made increasingly critical assessments and also made sharper comparisons across papers. In the later phases, the committee occasionally sought external input from subject matter experts. Thirty-two papers remained after Phase 1, thirteen after Phase 2, and the final six after Phase 3.
The committee identified two types of conflict of interest. Committee members with domain conflicts (e.g., authors from the same institution as the committee member), were not assigned as the primary or secondary readers on a paper. Committee members with personal conflicts (e.g., advisor/advisee relationships, previous co-authorship) were not assigned as readers and were additionally not allowed to provide input on whether the paper belonged in the final equivalence class.
Test of Time Award
Last but certainly not least, we are thrilled to announce that the recipient of the NeurIPS 2021 Test of Time Award is Online Learning for Latent Dirichlet Allocation by Matthew Hoffman, David Blei, and Francis Bach.
This paper introduces a stochastic variational gradient based inference procedure for training Latent Dirichlet Allocation (LDA) models on very large text corpora. On the theoretical side it is shown that the training procedure converges to a local optimum and that, surprisingly, the simple stochastic gradient updates correspond to a stochastic natural gradient of the evidence lower bound (ELBO) objective. On the empirical side the authors show that for the first time LDA can be comfortably trained on text corpora of several hundreds of thousands of documents, making it a practical technique for “big data” problems. The idea has made a large impact in the ML community because it represented the first stepping stone for general stochastic gradient variational inference procedures for a much broader class of models. After this paper, there would be no good reason to ever use full batch training procedures for variational inference anymore.
Selection Process:
Historically, the Test of Time Award has been awarded to a paper from the NeurIPS conference 10 years back. In 2020, the committee considered a broader range of papers and ended up selecting a recipient from 2011 instead of 2010. Because of this, this year, we gave the Test of Time Award Committee the option of choosing any paper from 2010 or 2011. After some discussion, the committee decided to focus specifically on 2010 since no paper published at that conference had previously been honored.
The committee first ranked all NeurIPS 2010 papers according to citation count. They defined a cutoff threshold at roughly 500 citations and considered all papers that achieved at least this citation count. This resulted in 16 papers. The committee took two weeks to read those papers (with each paper read by one or more committee members) and then met to discuss.
In this discussion, there was exactly one paper supported by all four members of the committee: Online Learning for Latent Dirichlet Allocation. Each of the committee members ranked this paper higher than all the other candidate papers and there was no strong runner-up, so the decision was easy and unanimous.
The Test of Time Award talk will take place in the final session of the conference, Saturday, December 11 at 01:00 GMT (Friday, December 10 at 5:00 pm PST).
Datasets & Benchmarks Best Paper Awards
This year NeurIPS launched the new Datasets & Benchmarks track, to serve as a venue for data-oriented work. We are pleased to announce two best paper awards from this track. A short list of papers were selected based on reviewer scores. The final selected papers were chosen from this list based on a vote from all members of the advisory board. Both papers will be presented in the Datasets and Benchmarks Track 2 Session on Wednesday, December 8 at 16:00 GMT (8:00 am PST).
The award recipients are:
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research By Bernard Koch, Emily Denton, Alex Hanna, and Jacob Gates Foster. This paper analyzes thousands of papers and studies the evolution of dataset use within different machine learning subcommunities, as well as the interplay between dataset adoption and creation. It finds that in most communities, there is an evolution towards using fewer different datasets over time, and that these datasets come from a handful of elite institutions. This evolution is problematic, since benchmarks become less generalizable, biases that exist within the sources of these datasets may be amplified, and it becomes harder for new datasets to be accepted by the research community. This is an important ‘wake up call’ for the machine learning community as a whole, to think more critically about which datasets are used for benchmarking, and to put more emphasis on the creation of new and more varied datasets.
ATOM3D: Tasks on Molecules in Three Dimensions By Raphael John Lamarre Townshend, Martin Vögele, Patricia Adriana Suriana, Alexander Derry, Alexander Powers, Yianni Laloudakis, Sidhika Balachandar, Bowen Jing, Brandon M. Anderson, Stephan Eismann, Risi Kondor, Russ Altman, and Ron O. Dror. This paper introduces a collection of benchmark datasets with 3D representations of small molecules and/or biopolymers for solving a wide range of problems, spanning single molecular structure prediction and interactions between biomolecules as well as molecular functional and design/engineering tasks. Simple yet robust implementations of 3D models are then benchmarked against state-of-the-art models with 1D or 2D representation, and show better performance over lower-dimensional counterparts. This work provides important insight about how to choose and design models for a given task. Not only does this work provide benchmarking datasets, it also provides baseline models and open source tools to leverage these datasets and models, dramatically lowering the entry barrier for machine learning people to get into computational biology and molecule design.