A Retrospective on the NeurIPS 2021 Ethics Review Process
Samy Bengio and Inioluwa Deborah Raji
NeurIPS 2021 Ethics Review Chairs
Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan
NeurIPS 2021 Program Chairs
As the impact of machine learning research grows, so does the risk that this research will lead to harmful outcomes. Several machine learning conferences—including ACL, CVPR, ICLR, and EMNLP—have taken steps to establish ethical expectations for the research community, including introducing ethics review processes. Last year, NeurIPS piloted its own ethics review process, chaired by Iason Gabriel. This year, we aimed to expand the ethics review process and ensure that it is in line with ongoing efforts to establish NeurIPS Ethics Guidelines that have been spearheaded by Marc’Aurelio Ranzato as General Chair.
At this early stage of adoption, ethics reviews are meant to be educational, not prohibitive. Our goal is not to police submissions, but instead to prompt reflection. The process that we implemented was intended to support this goal.
In some ways, the process has been a success. We were able to recruit qualified Ethics Reviewers with diverse areas of expertise; many Reviewers, AC, and authors engaged constructively with these Ethics Reviewers; and authors improved their papers based on the feedback they received. However, there are several ongoing challenges with this new process, including how to surface the right set of papers to undergo ethics review; how to fit the ethics review process into the overall paper review timeline without overburdening Ethics Reviewers; and how to set clear expectations in order to achieve more alignment between Ethics Reviewers on what constitutes an ethical issue and when an ethical issue has been properly addressed. In this blog post, we discuss the ethics review process and share some of our learnings and recommendations going forward.
Overview of the Ethics Review Process
We view ethics reviews as a way of obtaining an additional expert perspective to inform paper decisions and provide feedback to authors. To implement this, we allowed Reviewers and Area Chairs (ACs) to flag papers for ethics review, just as ACs sometimes solicit external expert perspectives on technical issues. When a paper was flagged, Ethics Reviewers were added to the committee for the paper, included in discussions about the paper, and given the opportunity to interact with the authors during the rolling discussion period, just like other members of the committee. Ethics Reviewers were encouraged to provide constructive criticism that could lead to project improvement and maturity, similar to standard paper reviews.
Because this process was for educational purposes first, Ethics Reviewers were assigned to each flagged paper, regardless of how likely it was that the paper would be accepted on technical grounds. This was a change from last year and required us to recruit a larger pool of Ethics Reviewers. However, we believe that all authors can benefit from feedback on the ethical aspects of their research and the opportunity to integrate this feedback into future iterations of the work. While it would reduce the burden on Ethics Reviewers, only soliciting ethics reviews for papers likely to be accepted would counteract our goal of making this a constructive process for everyone in the community, rather than just operating as a filter to catch unethical content prior to publication.
Before the process began, we recruited 105 Ethics Reviewers with a wide range of disciplinary backgrounds. Ethics considerations in machine learning research are quite diverse—ranging from standard research ethics issues, like obtaining appropriate consent from human subjects, all the way to substantially thornier issues concerning the downstream negative societal impact of the work—so we understood the importance of working with a range of experts and allowing them to weigh in on issues aligned with their expertise. The breakdown of expertise is in the table below; some Ethics Reviewers had expertise in more than one area.
Number of Ethics Reviewers with this expertise | Number of papers flagged with issues | |
Discrimination / Bias / Fairness Concerns | 92 | 34 |
Inadequate Data and Algorithm Evaluation | 43 | 22 |
Inappropriate Potential Applications & Impact (e.g., human rights concerns) | 47 | 52 |
Legal Compliance (e.g., GDPR, copyright, terms of use) | 13 | 28 |
Privacy and Security (e.g., consent) | 34 | 51 |
Responsible Research Practice (e.g., IRB, documentation, research ethics) | 45 | 30 |
Research Integrity Issues (e.g., plagiarism) | 24 | 47 |
During the review process, Reviewers had the chance to flag papers for ethics review by checking a box on the review form. For flagged papers, Reviewers could specify the areas of expertise required to properly assess the paper from the list in the table. (We note that the specific list of areas in the table was created based on common issues that arose last year and our own expectations about the types of ethics issues that might be flagged. However, it is not perfect, as we discuss more below.) In total, Reviewers flagged 265 papers out of 9122 submissions. The breakdown of issues flagged is in the last column of the table; note that some papers were flagged for more than one area of concern.
We designed an algorithm to assign papers to Ethics Reviewers with the right expertise while minimizing the maximum number of papers each Ethics Reviewer was assigned. Each flagged paper was reviewed by 2 Ethics Reviewers and each Ethics Reviewer had on average 4–5 papers to review.
Since Ethics Reviewers could not be assigned until initial reviews were submitted (and therefore papers flagged), the ethics review period coincided with the initial author response period. During this period, authors could see if their paper had been flagged or not, but ethics reviews only became available later during the rolling discussion period. Once available, authors, other Reviewers, and ACs were able to respond to the ethics reviews and engage in discussion with the assigned Ethics Reviewers.
In most cases, the issues raised in ethics reviews were not significant enough to prevent publication. In a small number of cases in which more serious issues were raised, papers were escalated to the Ethics Review Chairs and Program Chairs to deliberate and make the final decision on the paper, taking into consideration feedback from all parties as well as any responses from the authors. This resulted in a small number of papers being conditionally accepted and one paper being rejected on ethical grounds, as discussed below.
Ethical Issues Identified
The ethics review process brought to the surface a variety of ethical issues that regularly appear in research papers submitted to NeurIPS. The most common types of issues encountered at NeurIPS involve:
- A lack of sufficient reflection around topics that involve thorny ethical considerations. For example:
- Generation models that could be used to generate realistic fake content for misinformation campaigns and may exhibit bias issues in their generated content. These include text generation and image generation (notably including face generation) applications, as well as voice conversion.
- Biometric surveillance projects that raise privacy concerns and could be used in sensitive contexts such as criminal justice. These include facial recognition and voice recognition projects.
- The continued use of deprecated datasets that had already been explicitly removed from circulation by their authors. Such datasets include Duke MCMC, Celeb-1M, and Tiny Images, whose use has been discouraged by authors due to ethical reasons.
- Inappropriate communication and publication release around identified security vulnerabilities. For example, adversarial attack models applied to publicly deployed systems not being adequately disclosed to the involved agency before publication.
- A lack of transparency on model or data details and decision-making, as it relates to ethical concerns. For example:
- Failing to discuss potential biases deriving from the use of a method or dataset. A lack of documentation of acknowledgement of such issues, often manifesting in a lack of diverse examples featured in the paper or evaluating performance on a homogenous demographic.
- Not providing adequate details on data provenance or distribution.
- A lack of communication of the details of annotator work conditions.
- Issues appropriately handling or sourcing data involving humans. For example:
- Collecting information about individuals with no concern for privacy and consent.
- Violating copyright restrictions.
- Indications of mistreatment of Mturk workers, or annotator and data collection practices that seem exploitative.
- Lack of sending the project through an Institutional Review Board (IRB) in situations clearly involving human subjects.
- Uncritically emphasising explicitly harmful applications, such as police profiling.
In many cases in which issues were identified, Ethics Reviewers simply recommended that authors reflect on the issues and include a discussion of them in the paper, either by expanding the discussion of potential negative societal impacts or being more explicit about limitations of the work. In other cases, Ethics Reviewers recommended more substantial modifications to the work, such as running additional experiments, the use of a different dataset, data/code distribution restrictions, or increased transparency measures like the inclusion of model or dataset documentation.
In some cases, the concerns raised were so critical that the acceptance of the paper was made conditional on the authors implementing the suggested mitigations. All such cases were discussed by the Program Chairs and Ethics Review Chairs, and the Ethics Reviewers were consulted in determining conditions for acceptance. Of eight papers conditionally accepted for ethical reasons, all were eventually accepted.
In a single case, the Program Chairs and Ethics Review Chairs jointly determined that the required mitigations would be so challenging to execute that they were beyond the scope of what the authors could realistically accomplish within the time frame for the camera-ready. In this case, the Program Chairs made the call to reject the paper on ethical grounds.
It should be noted that Ethics Reviewers were not always in agreement with each other. For 61% of submissions reviewed by two ethics reviewers, at least one Ethics Reviewer checked the box in their review form to indicate the paper had no ethical issue; in 42% of these cases, the Ethics Reviewers were split, with the other saying there was an issue, while the other saying there was not. Additionally, for 82% of submissions reviewed by two ethics reviewers, at least one Ethics Reviewer checked the box to indicate that the authors had not acknowledged the issue; for 43% of these cases, the other Ethics Reviewer indicated that the issue had been adequately acknowledged by the authors. We should not expect perfect agreement among Ethics Reviewers, but it is worth considering whether better guidance on what constitutes an ethical issue and how to appropriately address them could be helpful.
Challenges Surfacing the Right Papers for Review
As implemented, the success of the ethics review process hinges on Reviewers and ACs appropriately flagging papers for ethics review. The biggest challenge that we faced—and one area that we as a community will need to work hard to improve if we want ethics review to be a success—is that there was a lot of uncertainty around which papers to flag, leading to inconsistency in which papers received ethics reviews.
97% of papers flagged for ethics review were flagged by only a single Reviewer; across 9122 submissions, only 8 papers were flagged by more than one Reviewer. Considering the 882 papers that were part of the broader consistency experiment (and therefore assigned to two independent committees for review), there were 23 papers in which the original copy was flagged and 22 papers for which the duplicate was flagged, but the overlap between these two sets was only 3 papers. (Another blog post containing the full results of the consistency experiment is coming soon!)
Still, there were some notable differences between papers that were flagged and those that were not. 29% of all flagged submissions were withdrawn compared with 20% of submissions overall. 16% of flagged papers were ultimately accepted compared with 25.6% of papers overall. And these differences are more stark than they appear since the 25.6% acceptance rate includes papers that were desk rejected for formatting violations or withdrawn before they received reviews.
Some of this inconsistency was due to “false positives”—papers that did not actually have issues of concern to Ethics Reviewers, but that were erroneously flagged anyways. As mentioned above, for 61% of flagged submissions with two ethics reviews, at least one Ethics Reviewer checked the box in their review form to indicate there was no ethical issue, with both Ethics Reviewers checking the box for 58% of these. False positives often involved:
- Papers that Reviewers didn’t like. For example, some Reviewers flagged papers because the results were poorly presented.
- Plagiarism and other serious Code of Conduct violations that were out of scope for the ethics review process and should instead have been escalated to the Program Chairs. We note that plagiarism was erroneously included as an example in the list of ethical issues that could be checked, which was likely the cause of this problem and easy to fix in future years.
In addition to this, there were “false negatives”—papers that were not flagged, even though they should have been. These are difficult to quantify since we don’t know what we missed. However, some false negatives were later surfaced through other means. These included:
- Papers that made use of deprecated datasets that had been retracted for ethical reasons. These cases were surfaced by running a search of submissions for mentions of common deprecated datasets late in the review process.
- Papers on biometric data generation (e.g., generating face or voice) or surveillance. These cases were again surfaced by running a keyword search late in the review process.
We recommend that in future years, NeurIPS should provide more extensive guidance and training for Reviewers on how to identify which papers to flag. We also recommend that future Program Chairs implement a systematic search for papers with common ethics issues so that these papers can be automatically included for ethics review without the need to rely on Reviewers to flag them.
Highlights and Lessons Learned
Overall, we consider the following to be highlights of the process this year:
- We were able to recruit over 100 qualified Ethics Reviewers with diverse areas of expertise, including many from disciplines outside of machine learning. Ethics reviewers took their role seriously and their engagement led to high-quality ethics reviews.
- The executed scale of the operation this year gave us confidence in expanding ethical review to accommodate the growing number of flagged cases for a conference of this size. Of the 264 papers flagged, 250 received at least one ethics review; 202 received two ethics reviews; and 48 received at least one review. This means that in total there were at least 452 submitted ethics reviews. In reality, even more reviews were submitted due to additional reviews completed for papers flagged during the discussion period and as part of the Datasets and Benchmarks track.
- Many Reviewers, AC, and authors engaged constructively with Ethics Reviewers. Some authors indicated that they benefited from the feedback in their ethics reviews and made significant improvements to their papers as a result. Of the 452 submitted ethics reviews, 140 (31%) had responses from the authors. All eight papers that were conditionally accepted due to ethical issues were ultimately accepted.
As this process is still relatively new, there were also lessons learned, which suggest improvements to the process for future years:
- As described above, there were challenges in surfacing the right set of papers to undergo ethics review, with many false positives and false negatives. This culminated with the Program Chairs running keyword searches over submissions late in the review process to identify papers with ethical issues that had been overlooked. We expect that reconsidering the set of ethical areas listed in the review form and providing better guidance on which papers to flag would help. We would additionally encourage future organizers to plan for a more systematic automated flagging of papers early in the review process.
- While Ethics Reviewers were not required to read the full details of the papers they were assigned, in practice it was difficult for them to assess whether or not the authors had appropriately reflected on ethical issues without reading the whole paper, which was burdensome given the short period of time allotted for ethics reviews. The pointers that authors included as part of the NeurIPS Paper Checklist were not enough. This is difficult to address; having designated sections on potential negative societal impact and limitations makes this easier, but to catch all potential issues, a full read-through may still be necessary.
- There was a fair amount of disagreement among Ethics Reviewers about whether flagged papers had ethical issues and whether these issues were addressed, as discussed above. We should not expect perfect agreement among Ethics Reviewers, but better guidance on what constitutes an ethical issue may be valuable here too.
- Since the review process for the new NeurIPS Benchmarks & Datasets track was entirely independent of the review process for the main track, this track was initially omitted from the ethics review process and only incorporated after the review process was underway; 10 papers from that track were then flagged for ethical review and one was rejected in part for ethical concerns. Since many ethical issues are related to data, this track should be included in the ethics review process going forward.
The ethics review process is still quite new, and both community norms and official conference guidelines are still evolving. We are grateful for the opportunity to contribute to this evolution as we all work to ensure that this community operates in the best interests of those impacted by our research. To learn more about how to incorporate ethical practices in your own research, attend the plenary panel on this topic Friday, December 10 at 11pm UTC (3pm PST).