By Sanmi Koyejo and Shakir Mohamed, General Chairs
The planning for NeurIPS2022 has already begun, and we are excited to have joined this year as co-General Chairs for the conference, and to make our contributions to this important meeting in our field’s annual calendar.
NeurIPS is a large conference and its organization continues to be mostly driven by volunteers from our community committed to its success. As we start this process, we hope to have as wide a set of people from which to select as chairs for the conference. Please consider nominating yourself, or someone you know, as an organiser for one of the roles in the conference, or in a generalist role that you think might serve the conference better. Serving as an organiser is a great way to build experience of crafting large scientific meetings and balancing the many tradeoffs involved, helps build new networks, and a great way to give back to the community in a way that is different from reviewing and workshops.
On day 5 (Friday) of NeurIPS 2021 we have the last day of the main program.
Session 1 starting at 7 UTC-0 will begin with our second Town Hall. Please remember to send us your feedback via email or in the rocketchat channel #townhall. This is a great opportunity to share your feedback about the conference. The session will conclude with an oral session with two concurrent tracks: theory and vision applications.
Session 2 starting at 15.00 UTC-0 with an interview with Daniel Kahneman, who won the Nobel Memorial Prize in Economic Sciences in 2002 and who is very well known for his research in behavioral economics. This session will close with another poster session.
Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan NeurIPS 2021 Program Chairs
In 2014, NeurIPS ran an experiment in which 10% of submissions were reviewed by two independent program committees to quantify the randomness in the review process. Since then, the number of annual NeurIPS submissions has increased more than fivefold. To check whether decision consistency has changed as the conference has grown, we ran a variant of this experiment again in 2021. This experiment was independently reviewed and approved by an Institutional Review Board (IRB).
For a more detailed discussion of the original 2014 experiment and results, please see this recent retrospective analysis of the results by 2014 Program Chairs, Corinna Cortes and Neil Lawrence, this talk by Neil, or this retrospective talk.
How was the 2021 experiment implemented?
During the assignment phase of the review process, we chose 10% of papers uniformly at random—we’ll refer to these as the “duplicated papers.” We assigned two Area Chairs (ACs) and twice the normal number of reviewers to these papers. With the help and guidance of the team at OpenReview, we then created a copy of each of these papers and split the ACs and reviewers at random between the two copies. We made sure that the two ACs were assigned to two different Senior Area Chairs (SACs) so that no SAC handled both copies of the same paper. Any newly invited reviewer for one copy was automatically added as a conflict for the other copy. We’ll refer to the SAC, AC, and reviewers assigned to the same copy as the copy’s “committee.” The papers’ committees were not told about the experiment and were not aware the paper had been duplicated.
The authors of duplicated papers were notified of the experiment right before initial reviews were released and instructed to respond to each set of reviews independently. They were also asked to keep the experiment confidential. At the time initial reviews were released, 8765 of the original 9122 submitted papers were still under review, and 882 of these were duplicated papers.
As in 2014, duplicated papers were accepted if at least one of the two copies was recommended for acceptance and no “fatal flaw” was found. This resulted in 92 accepted papers that would not have been accepted had we not run the experiment. Four papers were accepted by one committee but were ultimately rejected due to what was considered a fatal flaw. In an additional two cases, the committees for the two papers disagreed about whether a flaw was “fatal.” In these cases, the papers were conditionally accepted with conditions determined jointly by the two committees; both were ultimately accepted.
The table below summarizes the outcomes for the 882 duplicated papers:
As we can see from the table, there is especially high disagreement on which papers should be selected for orals and spotlights. More than half of all spotlights recommended by either committee were rejected by the other (13/25 and 13/23).
Note that 118 papers were withdrawn after initial reviews were released. We include these withdrawn papers in our analysis. Authors were most likely to withdraw their paper if both copies were headed for rejection. The withdrawal rate after seeing initial reviews was 45% higher for papers not in the experiment compared with duplicated papers, which we suspect is because authors of duplicated papers had two shots at acceptance.
There are a few ways to think about the results. First, we can measure the fraction of inconsistent outcomes—the fraction of duplicated papers that were accepted by only one of the two committees (as either a poster, a spotlight, or an oral). The number of papers with inconsistent outcomes was 203 out of 882, or 23.0%.
To put this number in context, we need a baseline. There were 206 papers accepted in the original set and 195 papers accepted in the duplicate set, for an average acceptance rate of 22.7%. If acceptance decisions were made at random with a 0.227 chance of accepting each paper, we would expect the fraction of inconsistent outcomes to be 35.1%. While the fraction of inconsistent outcomes is closer to the random baseline than it is to 0, many of these papers could genuinely have gone either way. When ACs entered recommendations, they were asked to note whether they were sure or whether the paper could be bumped up or down. If we don’t consider “Accept that can be bumped down” and “Reject,” as well as “Reject that can be bumped up” and “Poster Accept that shouldn’t be bumped up to Spotlight,” as inconsistent, the fraction of inconsistent outcomes drops down to only 16%.
We can see how the fraction of inconsistent outcomes would have changed if we shifted the acceptance threshold in different ways. For example, if the conference were so selective as to accept only orals and spotlights, the committees would have accepted 29 and 25 of the duplicated papers respectively, agreeing on only 3 papers. To visualize the impact of shifting the threshold, the gray curve in the plot below extends the random baseline to other acceptance rates. Points on the gray curve correspond to the expected fraction of inconsistent decisions if both committees were making acceptance decisions at random with the corresponding acceptance probability. We have added the following points to the plot:
Accepting only papers recommended as orals or spotlights (only a 3% relative improvement over the random baseline).
Bumping down all posters marked as candidates for being bumped down (a 25% relative improvement over the random baseline).
Decisions made by NeurIPS 2021 ACs (a 35% relative improvement over the random baseline).
Bumping up all rejects that were marked as candidates for being bumped up (a 35% relative improvement over the random baseline).
For comparison, in 2014, of the 166 papers that were duplicated, the two committees disagreed on 43 (25.9%). The acceptance rate was 25% for duplicated papers—a bit higher than the overall 2014 acceptance rate. The random baseline for this acceptance rate is 37.5% disagreement, so this is a 31% relative improvement (with a fairly large confidence interval given the small sample size).
Another way of measuring disagreement is to look at the fraction of accepted papers that would have changed if we reran the review process. This is also the probability that a randomly chosen accepted paper would have been rejected if it were re-reviewed, previously discussed as the “complement to 1 of accept precision” and “arbitrariness” in the context of the 2014 experiment, also discussed here.
In 2014, 49.5% of the papers accepted by the first committee were rejected by the second (with a fairly wide confidence interval as the experiment included only 116 papers). This year, this number was 50.6%. We can also look at the probability that a randomly chosen rejected paper would have been accepted if it were re-reviewed. This number was 14.9% this year, compared to 17.5% in 2014.
Feedback from ACs and SACs
After the review period ended and decisions were released, we gave ACs and SACs who were assigned to duplicated papers (in which there had been disagreement) access to the reviews and discussion for the papers’ other copies. We asked them to complete a brief survey to provide feedback. Of the 203 papers that were recommended for acceptance by only one committee, we received feedback on 99. Unfortunately, we received feedback from both committees for only 18 papers, which limits the scope of our analysis.
Based on this feedback, the vast majority of cases fell into one of three categories. First, there is what one AC called “noise on the decision frontier.” In such cases, there was no real disagreement, but one committee may have been feeling a bit more generous or more excited about the work and willing to overlook the paper’s limitations. Indeed, 48% of multiple-choice responses were “This was a borderline paper that could have gone either way; there was no real disagreement between the committees.”
Second, there were genuine disagreements about the value of the contribution or the severity of limitations. We saw a spectrum here ranging from basically borderline cases to a few more difficult cases in which expert reviewers disagreed. In some of these cases, there was also disagreement within committees.
Third were cases in which one committee found a significant issue that the other did not. Such issues included, for example, close prior work, incorrect proofs, and methodological flaws.
45% of responses were “I still stand by our committee’s decision,” while the remaining 7% were “I believe the other committee made the right decision.” We can only speculate about why this may be the case. Part of this could be that, once formed, opinions are hard to change. Part of it is that many of these papers are borderline, and different borderline papers just fundamentally appeal to different people. Part of it could also be selection bias; the ACs and SACs who took the time to respond to our survey may have been more diligent and involved during the review process as well, leading to better decisions.
There are two caveats we would like to call out that may impact these results.
First, although we asked authors of duplicated papers to respond to the two sets of reviews independently, there is evidence that some authors put significantly more effort into their responses for the copy that they felt was more likely to be accepted. In fact, some authors told us directly that they were only going to spend the time to write a detailed response for the copy of their paper with the higher scores. Overall, there were 50 pairs of papers where authors only left comments on the copy with the higher average score.
To dig into this a bit more, we had 8,765 papers still under review at the time initial reviews were released. The acceptance rate for the 7,883 papers not in the experiment was 2036/7883 = 25.8%. (Note that the overall acceptance rate for the conference was 25.6%, but this overall rate also includes papers that were withdrawn or rejected for violations of the CFP prior to initial reviews being released—here we are looking only at papers still under review at this point.) As discussed above, the average acceptance rate for duplicated papers was 22.7% (206 papers recommended for acceptance in the original set and 195 papers recommended in the duplicate set, for 401 acceptances out of the total of 882*2 papers). The 95% binomial confidence intervals for the two observed rates do not overlap. Authors changing their behavior may account for this difference. This confounder may have somewhat skewed the results of the experiment.
Second, when decisions shifted as part of the calibration process, ACs were often asked to edit their meta-reviews to move a paper from “poster” to “reject” or vice versa, or from “spotlight” to “poster” or vice versa. We observed several cases in which ACs made these changes without altering the field for whether a paper “can be bumped up” or “can be bumped down.” For example, there were nine cases in which it appears that a duplicated paper was initially marked “poster” and “can be bumped down” and later moved to “reject,” ending up marked as the nonsensical “reject” and “can be bumped down.” This could potentially introduce minor inaccuracies into our analysis of shifted thresholds.
The experimental results appear consistent with the 2014 experiment when the conference was an order of magnitude smaller. Thus there is no evidence that the decision process has become more or less noisy with increasing scale.
For program chairs, there is a perennial question: “How selective should the conference be?” With the current review process, it appears that being significantly more selective will significantly increase the arbitrariness (i.e., fraction of accepted papers with a different decision upon rereview). However, increasing the acceptance rate may not decrease the arbitrariness appreciably.
Finally, we would encourage authors to avoid excessive discouragement from rejections as there is a real possibility that the result says more about the review process than the paper.
We would like to thank the entire OpenReview team, especially Melisa Bok, for their support with the experiment. We also thank the reviewers, ACs, and SACs who contributed their time to the review process, and all of the authors who submitted to NeurIPS.
On day 3 (Wednesday) we have a full agenda starting with the main program, which will consists of three sessions.
Session 1 starting at 7 UTC-0 will showcase the invited lecture by Gabor Lugosi on “Do We Know How to Estimate the Mean?“”. This lecture is in honor of Leo Breiman, a distinguished statistician who passed away in 2005 and whose seminal contributions include bootstrapping and random forests, to name a few. The invited talk will be followed by a poster session.
Session 3 starting at 23.00 UTC-00 will begin with the lecture by Peter Bartlett on “Benign Overfitting”; this talk is in memory of Ed Posner who founded NeurIPS. This last session will conclude with a poster session.
Dataset and Benchmark Track, and Demos. Concurrent with the main program, there will also be presentations from the Dataset and Benchmark Track (during session 1 and sessions 2) and demos.
Town Hall . Lastly, there will also be the first of two Town Hall meetings at 17.00 UTC-0, which is an opportunity to provide feedback to the organizers and to learn about all the work that went on behind the scenes to prepare NeurIPS 2021. Consider sending feedback or questions prior to the two Town Hall meetings via email or in the rocketchat channel #townhall.
This track will feature one of the outstanding paper awards On the Expressivity of Markov Reward by David Abel, Will Dabney, Anna Harutyunyan, Mark K. Ho, Michael Littman, Doina Precup, and Satinder Singh.
Here are the highlights for the first day of NeurIPS 2021, which is dedicated to Tutorials!
There will be 10 tutorials in total. Each tutorial is four hours long, and there will be two tutorials running in parallel at any given time. Tutorials start on Monday 6 at 9.00 UTC-0 and end on Tuesday 7 at 5 UTC-0. The full list of tutorials and further details are available on the schedule page. Notice that registration won’t be required to attend tutorials (only login is required), however, registration is required to interact with Tutorial presenters as well as to access most of the other content of this year’s virtual NeurIPS conference.
Socials. We also have our first social gathering today, ML in Korea.
Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan NeurIPS 2021 Program Chairs
As the impact of machine learning research grows, so does the risk that this research will lead to harmful outcomes. Several machine learning conferences—including ACL, CVPR, ICLR, and EMNLP—have taken steps to establish ethical expectations for the research community, including introducing ethics review processes. Last year, NeurIPS piloted its own ethics review process, chaired by Iason Gabriel. This year, we aimed to expand the ethics review process and ensure that it is in line with ongoing efforts to establish NeurIPS Ethics Guidelines that have been spearheaded by Marc’Aurelio Ranzato as General Chair.
At this early stage of adoption, ethics reviews are meant to be educational, not prohibitive. Our goal is not to police submissions, but instead to prompt reflection. The process that we implemented was intended to support this goal.
In some ways, the process has been a success. We were able to recruit qualified Ethics Reviewers with diverse areas of expertise; many Reviewers, AC, and authors engaged constructively with these Ethics Reviewers; and authors improved their papers based on the feedback they received. However, there are several ongoing challenges with this new process, including how to surface the right set of papers to undergo ethics review; how to fit the ethics review process into the overall paper review timeline without overburdening Ethics Reviewers; and how to set clear expectations in order to achieve more alignment between Ethics Reviewers on what constitutes an ethical issue and when an ethical issue has been properly addressed. In this blog post, we discuss the ethics review process and share some of our learnings and recommendations going forward.
Overview of the Ethics Review Process
We view ethics reviews as a way of obtaining an additional expert perspective to inform paper decisions and provide feedback to authors. To implement this, we allowed Reviewers and Area Chairs (ACs) to flag papers for ethics review, just as ACs sometimes solicit external expert perspectives on technical issues. When a paper was flagged, Ethics Reviewers were added to the committee for the paper, included in discussions about the paper, and given the opportunity to interact with the authors during the rolling discussion period, just like other members of the committee. Ethics Reviewers were encouraged to provide constructive criticism that could lead to project improvement and maturity, similar to standard paper reviews.
Because this process was for educational purposes first, Ethics Reviewers were assigned to each flagged paper, regardless of how likely it was that the paper would be accepted on technical grounds. This was a change from last year and required us to recruit a larger pool of Ethics Reviewers. However, we believe that all authors can benefit from feedback on the ethical aspects of their research and the opportunity to integrate this feedback into future iterations of the work. While it would reduce the burden on Ethics Reviewers, only soliciting ethics reviews for papers likely to be accepted would counteract our goal of making this a constructive process for everyone in the community, rather than just operating as a filter to catch unethical content prior to publication.
Before the process began, we recruited 105 Ethics Reviewers with a wide range of disciplinary backgrounds. Ethics considerations in machine learning research are quite diverse—ranging from standard research ethics issues, like obtaining appropriate consent from human subjects, all the way to substantially thornier issues concerning the downstream negative societal impact of the work—so we understood the importance of working with a range of experts and allowing them to weigh in on issues aligned with their expertise. The breakdown of expertise is in the table below; some Ethics Reviewers had expertise in more than one area.
Number of Ethics Reviewers with this expertise
Number of papers flagged with issues
Discrimination / Bias / Fairness Concerns
Inadequate Data and Algorithm Evaluation
Inappropriate Potential Applications & Impact (e.g., human rights concerns)
Privacy and Security (e.g., consent)
Responsible Research Practice (e.g., IRB, documentation, research ethics)
Research Integrity Issues (e.g., plagiarism)
During the review process, Reviewers had the chance to flag papers for ethics review by checking a box on the review form. For flagged papers, Reviewers could specify the areas of expertise required to properly assess the paper from the list in the table. (We note that the specific list of areas in the table was created based on common issues that arose last year and our own expectations about the types of ethics issues that might be flagged. However, it is not perfect, as we discuss more below.) In total, Reviewers flagged 265 papers out of 9122 submissions. The breakdown of issues flagged is in the last column of the table; note that some papers were flagged for more than one area of concern.
We designed an algorithm to assign papers to Ethics Reviewers with the right expertise while minimizing the maximum number of papers each Ethics Reviewer was assigned. Each flagged paper was reviewed by 2 Ethics Reviewers and each Ethics Reviewer had on average 4–5 papers to review.
Since Ethics Reviewers could not be assigned until initial reviews were submitted (and therefore papers flagged), the ethics review period coincided with the initial author response period. During this period, authors could see if their paper had been flagged or not, but ethics reviews only became available later during the rolling discussion period. Once available, authors, other Reviewers, and ACs were able to respond to the ethics reviews and engage in discussion with the assigned Ethics Reviewers.
In most cases, the issues raised in ethics reviews were not significant enough to prevent publication. In a small number of cases in which more serious issues were raised, papers were escalated to the Ethics Review Chairs and Program Chairs to deliberate and make the final decision on the paper, taking into consideration feedback from all parties as well as any responses from the authors. This resulted in a small number of papers being conditionally accepted and one paper being rejected on ethical grounds, as discussed below.
Ethical Issues Identified
The ethics review process brought to the surface a variety of ethical issues that regularly appear in research papers submitted to NeurIPS. The most common types of issues encountered at NeurIPS involve:
A lack of sufficient reflection around topics that involve thorny ethical considerations. For example:
Generation models that could be used to generate realistic fake content for misinformation campaigns and may exhibit bias issues in their generated content. These include text generation and image generation (notably including face generation) applications, as well as voice conversion.
Biometric surveillance projects that raise privacy concerns and could be used in sensitive contexts such as criminal justice. These include facial recognition and voice recognition projects.
The continued use of deprecated datasets that had already been explicitly removed from circulation by their authors. Such datasets include Duke MCMC, Celeb-1M, and Tiny Images, whose use has been discouraged by authors due to ethical reasons.
Inappropriate communication and publication release around identified security vulnerabilities. For example, adversarial attack models applied to publicly deployed systems not being adequately disclosed to the involved agency before publication.
A lack of transparency on model or data details and decision-making, as it relates to ethical concerns. For example:
Failing to discuss potential biases deriving from the use of a method or dataset. A lack of documentation of acknowledgement of such issues, often manifesting in a lack of diverse examples featured in the paper or evaluating performance on a homogenous demographic.
Not providing adequate details on data provenance or distribution.
A lack of communication of the details of annotator work conditions.
Issues appropriately handling or sourcing data involving humans. For example:
Collecting information about individuals with no concern for privacy and consent.
Violating copyright restrictions.
Indications of mistreatment of Mturk workers, or annotator and data collection practices that seem exploitative.
Lack of sending the project through an Institutional Review Board (IRB) in situations clearly involving human subjects.
Uncritically emphasising explicitly harmful applications, such as police profiling.
In many cases in which issues were identified, Ethics Reviewers simply recommended that authors reflect on the issues and include a discussion of them in the paper, either by expanding the discussion of potential negative societal impacts or being more explicit about limitations of the work. In other cases, Ethics Reviewers recommended more substantial modifications to the work, such as running additional experiments, the use of a different dataset, data/code distribution restrictions, or increased transparency measures like the inclusion of model or dataset documentation.
In some cases, the concerns raised were so critical that the acceptance of the paper was made conditional on the authors implementing the suggested mitigations. All such cases were discussed by the Program Chairs and Ethics Review Chairs, and the Ethics Reviewers were consulted in determining conditions for acceptance. Of eight papers conditionally accepted for ethical reasons, all were eventually accepted.
In a single case, the Program Chairs and Ethics Review Chairs jointly determined that the required mitigations would be so challenging to execute that they were beyond the scope of what the authors could realistically accomplish within the time frame for the camera-ready. In this case, the Program Chairs made the call to reject the paper on ethical grounds.
It should be noted that Ethics Reviewers were not always in agreement with each other. For 61% of submissions reviewed by two ethics reviewers, at least one Ethics Reviewer checked the box in their review form to indicate the paper had no ethical issue; in 42% of these cases, the Ethics Reviewers were split, with the other saying there was an issue, while the other saying there was not. Additionally, for 82% of submissions reviewed by two ethics reviewers, at least one Ethics Reviewer checked the box to indicate that the authors had not acknowledged the issue; for 43% of these cases, the other Ethics Reviewer indicated that the issue had been adequately acknowledged by the authors. We should not expect perfect agreement among Ethics Reviewers, but it is worth considering whether better guidance on what constitutes an ethical issue and how to appropriately address them could be helpful.
Challenges Surfacing the Right Papers for Review
As implemented, the success of the ethics review process hinges on Reviewers and ACs appropriately flagging papers for ethics review. The biggest challenge that we faced—and one area that we as a community will need to work hard to improve if we want ethics review to be a success—is that there was a lot of uncertainty around which papers to flag, leading to inconsistency in which papers received ethics reviews.
97% of papers flagged for ethics review were flagged by only a single Reviewer; across 9122 submissions, only 8 papers were flagged by more than one Reviewer. Considering the 882 papers that were part of the broader consistency experiment (and therefore assigned to two independent committees for review), there were 23 papers in which the original copy was flagged and 22 papers for which the duplicate was flagged, but the overlap between these two sets was only 3 papers. (Another blog post containing the full results of the consistency experiment is coming soon!)
Still, there were some notable differences between papers that were flagged and those that were not. 29% of all flagged submissions were withdrawn compared with 20% of submissions overall. 16% of flagged papers were ultimately accepted compared with 25.6% of papers overall. And these differences are more stark than they appear since the 25.6% acceptance rate includes papers that were desk rejected for formatting violations or withdrawn before they received reviews.
Some of this inconsistency was due to “false positives”—papers that did not actually have issues of concern to Ethics Reviewers, but that were erroneously flagged anyways. As mentioned above, for 61% of flagged submissions with two ethics reviews, at least one Ethics Reviewer checked the box in their review form to indicate there was no ethical issue, with both Ethics Reviewers checking the box for 58% of these. False positives often involved:
Papers that Reviewers didn’t like. For example, some Reviewers flagged papers because the results were poorly presented.
Plagiarism and other serious Code of Conduct violations that were out of scope for the ethics review process and should instead have been escalated to the Program Chairs. We note that plagiarism was erroneously included as an example in the list of ethical issues that could be checked, which was likely the cause of this problem and easy to fix in future years.
In addition to this, there were “false negatives”—papers that were not flagged, even though they should have been. These are difficult to quantify since we don’t know what we missed. However, some false negatives were later surfaced through other means. These included:
Papers that made use of deprecated datasets that had been retracted for ethical reasons. These cases were surfaced by running a search of submissions for mentions of common deprecated datasets late in the review process.
Papers on biometric data generation (e.g., generating face or voice) or surveillance. These cases were again surfaced by running a keyword search late in the review process.
We recommend that in future years, NeurIPS should provide more extensive guidance and training for Reviewers on how to identify which papers to flag. We also recommend that future Program Chairs implement a systematic search for papers with common ethics issues so that these papers can be automatically included for ethics review without the need to rely on Reviewers to flag them.
Highlights and Lessons Learned
Overall, we consider the following to be highlights of the process this year:
We were able to recruit over 100 qualified Ethics Reviewers with diverse areas of expertise, including many from disciplines outside of machine learning. Ethics reviewers took their role seriously and their engagement led to high-quality ethics reviews.
The executed scale of the operation this year gave us confidence in expanding ethical review to accommodate the growing number of flagged cases for a conference of this size. Of the 264 papers flagged, 250 received at least one ethics review; 202 received two ethics reviews; and 48 received at least one review. This means that in total there were at least 452 submitted ethics reviews. In reality, even more reviews were submitted due to additional reviews completed for papers flagged during the discussion period and as part of the Datasets and Benchmarks track.
Many Reviewers, AC, and authors engaged constructively with Ethics Reviewers. Some authors indicated that they benefited from the feedback in their ethics reviews and made significant improvements to their papers as a result. Of the 452 submitted ethics reviews, 140 (31%) had responses from the authors. All eight papers that were conditionally accepted due to ethical issues were ultimately accepted.
As this process is still relatively new, there were also lessons learned, which suggest improvements to the process for future years:
As described above, there were challenges in surfacing the right set of papers to undergo ethics review, with many false positives and false negatives. This culminated with the Program Chairs running keyword searches over submissions late in the review process to identify papers with ethical issues that had been overlooked. We expect that reconsidering the set of ethical areas listed in the review form and providing better guidance on which papers to flag would help. We would additionally encourage future organizers to plan for a more systematic automated flagging of papers early in the review process.
While Ethics Reviewers were not required to read the full details of the papers they were assigned, in practice it was difficult for them to assess whether or not the authors had appropriately reflected on ethical issues without reading the whole paper, which was burdensome given the short period of time allotted for ethics reviews. The pointers that authors included as part of the NeurIPS Paper Checklist were not enough. This is difficult to address; having designated sections on potential negative societal impact and limitations makes this easier, but to catch all potential issues, a full read-through may still be necessary.
There was a fair amount of disagreement among Ethics Reviewers about whether flagged papers had ethical issues and whether these issues were addressed, as discussed above. We should not expect perfect agreement among Ethics Reviewers, but better guidance on what constitutes an ethical issue may be valuable here too.
Since the review process for the new NeurIPS Benchmarks & Datasets track was entirely independent of the review process for the main track, this track was initially omitted from the ethics review process and only incorporated after the review process was underway; 10 papers from that track were then flagged for ethical review and one was rejected in part for ethical concerns. Since many ethical issues are related to data, this track should be included in the ethics review process going forward.
The ethics review process is still quite new, and both community norms and official conference guidelines are still evolving. We are grateful for the opportunity to contribute to this evolution as we all work to ensure that this community operates in the best interests of those impacted by our research. To learn more about how to incorporate ethical practices in your own research, attend the plenary panel on this topic Friday, December 10 at 11pm UTC (3pm PST).