Welcome to the April edition of the NeurIPS monthly Newsletter!
The NeurIPS Newsletter aims to provide an easy way to keep up to date with NeurIPS events and planning progress, respond to requests for feedback and participation, and find information about new initiatives. Notably, this newsletter will focus on NeurIPS 2025, held in San Diego, from Tuesday, Dec 2 to Sunday, Dec 7, 2025.
You are receiving this newsletter as per your subscription preferences in your NeurIPS profile. As you prepare to attend NeurIPS, we hope that you will find the following information valuable. To unsubscribe from the NeurIPS newsletter, unselect the “Subscribe to Newsletter” checkbox in your profile:https://neurips.cc/Profile/subscribe. To update your email preference, visit: https://neurips.cc/FAQ/EmailPreferences
Newsletter includes:
Call for Papers
Call for Papers for Datasets and Benchmarks
Call for Papers for Position Track
Call for Workshop Proposals
Call for Competitions Extended
Childcare Registration
Call for Papers NeurIPS 2025 is now accepting submissions as of April 3, 2025. Abstracts are due by May 11, 2025, with full papers due on May 15, 2025. All authors must have an OpenReview profile when submitting. For more information, please read our call for papers here: https://neurips.cc/Conferences/2025/CallForPapers
Call for Papers for Datasets and BenchmarksThe NeurIPS Dataset and Benchmarks track is now accepting submissions. As with the main track, abstracts are due by May 11, 2025, with full papers due on May 15, 2025. This year, the Dataset and Benchmark tracks include new changes in the submission and review guidelines that make them better suited for datasets and benchmarks. Authors are advised to carefully read the call for papers, as well as our companion blog post, which outlines the changes: https://blog.neurips.cc/2025/03/10/neurips-datasets-benchmarks-raising-the-bar-for-dataset-submissions/Read our call for papers here: https://neurips.cc/Conferences/2025/CallForDatasetsBenchmarks
Call for Papers for Position Track This year at NeurIPS, we are initiating a Position Paper Track, following ICML 2024 and 2025 Position Paper Tracks. This is a time of accelerated development and deployment in high-stakes settings of methods designed in the NeurIPS community. There are few people in the world that, in the past year, have not interacted with AI in myriad ways. We wanted to invite discussions and positions to ensure our community is fully aware and having these critical discussions of the ways that our methods may impact the broader scientific community and the world.
Call for Workshop Proposals NeurIPS 2025 is now soliciting workshop proposals. Workshops are one-day events on December 6 and 7, following the main conference program, and are intended to provide an informal, dynamic venue for discussion of work in progress and future directions. Workshop applications are due by May 30, 2025. For more information, read our call for proposals here: ttps://neurips.cc/Conferences/2025/CallForWorkshops
In addition, recognizing the rapid growth and interest in NeurIPS workshops, we have prepared guidelines for proposals, which can be found on our blog at: https://blog.neurips.cc/2025/04/12/guidance-for-neurips-workshop-proposals-2025/. Organizers of workshop proposals should take care to respect all guidance in this document, and to provide explicit answers to the questions throughout.
Call for Competitions Extended The Call for Competitions for NeurIPS 2025 has now been extended from April 13, 2025 to April 20, 2025. Notifications will now go out on May 18th, 2025. Read our call here: https://neurips.cc/Conferences/2025/CallForCompetitions
Childcare Registration NeurIPS is proud to offer free childcare for children between 4 months and 13 years for our registered attendees on a first come, first serve basis. The deadline to make a reservation is November 4th 2025 at 5 PM PST. Information on how to reserve childcare can be found at: https://neurips.cc/Conferences/2025/Children
Minors between 14-17 years old can attend the conference with a guardian, as long as they remain with the guardian at all times in the venue, and only attend events where alcohol is not served. Guardians will need to sign a waiver when they pick up their badge.
Danielle Belgrave, Cheng Zhang, Alex Lu, Jean Kossaifi, and Mengye Ren NeurIPS 2025 General Chairs and Communication Chairs
Authors: Manuel Gomez Rodriguez, Pascale Fung, Theodore Papamarkou
With the rapid growth and interest in NeurIPS and its associated workshops, the competition for workshops has increased alongside logistical constraints. To facilitate the process, the workshop chairs have agreed on the following guidelines for proposals to hold NeurIPS workshops in 2025. This document highlights the requirements and expectations for proposals. Organizers of workshop proposals should take care to respect all guidance provided here and each requirement should be explicitly addressed in the proposal.
Important Dates
Workshop Application Open: April 14, 2025
Workshop Application Deadline: May 30, 2025, AoE
Workshop Acceptance Notification: July 4, 2025, AoE
Suggested Submission Date for Workshop Contributions: August 22, 2025, AoE
Accept/Reject Notification Date: September 22, 2025, AoE
Workshop Date: December 6 or December 7, 2025.
If a workshop does not meet the above accept/reject notification date, it will have their complementary tickets withheld. Moreover, if workshop organizers are not responsive after this date, the workshop may even be canceled.
Workshop Format
NeurIPS 2025 workshops will be one-day in-person events that spread 7 to 9 hours. A maximum of 1-hour remote presentation will be allowed in case of unforeseen emergencies.
Workshop Goals
Workshops provide an informal, dynamic venue for discussion of work in progress and future directions. Good workshops have helped to crystallize common problems, explicitly contrast competing frameworks, and clarify essential questions for a subfield or application area.
Workshops are a structured means of bringing together people with common interests to form communities. We expect the workshops to include some form of community building and stand apart from other parts of the NeurIPS program such as Tutorials or Competitions.
Selection Criteria
Importance of the topic and its relevance for the community: is the workshop focused on a clear and topical problem, and will the community find it interesting, exciting and useful?
The degree to which the proposed program offers an opportunity for discussion among participants and for community building.
Diversity and inclusion in all forms.
Invited speakers. Workshop organizers are encouraged to confirm tentative interest from proposed invited speakers and mention this in their proposal. Speakers are expected to be present at the workshop in person to give their talk unless there are exceptional circumstances.
Organizational experience, potential, and ability of the team.
Points of difference. What makes this workshop enticingly different from the hundreds of NeurIPS workshops held previously?
Details of logistics for the workshop. The proposal should clearly lay out the logistics for the workshop, both prior to the conference (calls for papers, confirmation of the speakers) and during the conference (schedule and organization during the day).
Workshop Proposal Format
Submissions for workshop organization should be no more than three pages of proposal, plus no more than two pages of organizer information, and unlimited references. The reviewers are not obligated to read anything beyond those. To simplify the preparation of the proposal, we strongly recommend utilizing this template as a foundation for every proposal.
The three pages (or fewer) for the main proposal must include:
A title and a brief description of the workshop topic and content.
A list of invited speakers, if applicable, with an indication of which ones have already agreed and which ones are tentative.
An account of the efforts made to ensure demographic diversity of the organizers and speakers. Also an account of any efforts to include diverse participants (e.g., via mentoring, subsidies, or the wording and topics in the call for proposal).
An estimate of the number of attendees.
A description of special requirements and technical needs.
If the workshop has been held before, a note specifying how many submissions the workshop received, how many papers were accepted (extended abstract/long format), and how many attendees the workshop attracted.
Optionally, a URL for the workshop website.
The two pages (or fewer) for information about organizers must include:
The names, affiliations, and email addresses of the organizers, with one-paragraph statements of their research interests, areas of expertise, and experience in organizing workshops and related events. Please highlight how the organizers’ profiles can make the proposed workshop successful. Please also indicate what other workshops (if any) are concurrently being proposed by an organizer.
A list of Program Committee members, with an indication of which members have already agreed. Organizers should do their best to estimate the number of submissions (especially for recurring workshops) in order to (a) ensure a sufficient number of reviewers so that each paper receives 3 reviews, and (b) anticipate that no one is committed to reviewing more than 3 papers. This practice is likely to ensure on-time and more comprehensive and thoughtful reviews.
Assessment Process
The workshop chairs will appoint a number of reviewers who will provide written assessments of the proposals against the criteria listed above. Reviewers’ reports will be considered by the workshop chairs who will jointly decide upon the selected workshops (subject to the notes on COIs listed below). The final decisions will be made by the workshop chairs via consensus and judgment; we will not simply add up scores assigned to all the criteria.
Hard Constraints/Workshop Requirements
Mandatory Accept/Reject Notification Deadline Before September 22, 2025: By submitting a workshop proposal, workshop organizers commit to notifying those who submit contributions (including talks and posters) to the workshop of their acceptance status before September 22, 2025. A timeline should be included in the proposal that will allow for this. This deadline of September 22 will be published on the NeurIPS main web page and cannot be extended under any circumstances.
Use CMT or OpenReview for Contributed Work Workshops that accept contributions must use either CMT or OpenReview to manage their submission process. This ensures that accepted submissions can be efficiently uploaded to the NeurIPS.cc site to announce the workshop schedule to NeurIPS attendants clearly.
Managing Chair and Reviewer Conflicts of Interest
Workshop chairs and assistant chairs cannot be organizers or give invited talks at any workshop. However, they can submit papers and give contributed talks.
Workshop reviewers cannot review any proposal on which they are listed as an organizer or invited speaker, or on which they have conflicts of interest as defined by the NeurIPS Conflicts of Interest. Moreover, they cannot accept invitations to speak at any workshop they have reviewed after the workshop is accepted.
Workshop chairs and reviewers cannot review or shape acceptance decisions about workshops with organizers from within their organization. (For large corporations, this means anyone in the corporation worldwide).
Managing Organizer Conflicts of Interest
Workshop organizers cannot give talks at the workshops they organize. They can give a brief introduction to the workshop and/or act as a panel moderator.
Workshop organizers should state in their proposals how they will manage conflicts of interest in assessing submitted contributions. At a minimum, an organizer should not be involved in the assessment of a submission from someone within the same organization.
Other Guidance and Expectations for Workshop Proposals
We encourage, and expect, diversity in the organizing team and speakers. This includes the diversity of viewpoints and thinking regarding the topics discussed at the workshop, gender, race, affiliations, seniority, etc. If a workshop is part of a series, the organizer list should include people who have not organized it in the past. Organizers should articulate how they have addressed diversity in their proposal in each of these senses. Invited speakers should not give the same talk or very similar talks at multiple workshops.
Since the goal of the workshop is to generate discussion, sufficient time and structure need to be included in the program for this. Proposals should explicitly articulate how they will encourage broad discussion.
Workshop proposals should list explicitly the problems they would like to see solved or at least advances made as part of their workshop. They should explain why these are important problems and how holding their proposed workshop will contribute to their solution.
Workshops are not a venue for work that has been previously published in other conferences on machine learning or related fields. Work that is presented at the main NeurIPS conference should not appear in a workshop, including as part of an invited talk. Organizers should make this clear in their calls and explain in their proposal how they will discourage the presentation of already finalized machine learning work.
We encourage workshop submissions of varying lengths and scopes. Organizers should state whether their workshops are meant to be large-attendance talk format or small group presentations. Organizers should articulate what they hope to achieve from the format of the proposal beyond the talks listed.
Workshops should have a clear and well-communicated agenda or schedule that outlines the topics and speakers to be presented, to provide attendees with the ability to choose which talks or sessions they want to attend based on the content being presented. Good workshops will put talk titles up publicly before site publication and note the archival status of their submissions. Organizers should articulate how they will do this.
Organizing a workshop is a complex task, and proposals should outline the organizational experience and skills of the proposed organizers (as a team). We encourage junior researchers to be involved in workshop organization but prefer some collective experience in organizing a complex event.
Frequently Asked Questions From Past Workshops
Workshop Series Although we ask for statistics and information if the workshop has been held before, we neither encourage nor discourage workshops on topics that have appeared before. Membership in an existing sequence of workshops is irrelevant in the assessment of a workshop proposal (it neither helps nor hinders). Workshop proposals will be evaluated solely on their merits for this year’s conference. That said, proposals from the workshop series are encouraged to differentiate from past versions with some freshness (contents, organizers, etc.) to help the reviewers evaluate the merits.
Overlapping Proposals We will not forcibly merge proposals. If multiple strong proposals are submitted on similar topics, we might accept 1-2 workshops in overlapping topics to curate the best workshop program.
Common Pitfalls From Past Workshops
Insufficient time for discussion Too many invited speakers—some proposals listed a dozen or more—do not make for a great audience experience, and a workshop with nothing but long-form talks is unlikely to lead to new breakthroughs. We encourage organizers to allocate a larger amount of time to contributed talks and posters, and open discussion.
Leaning too heavily on past success Proposals for workshops that are part of a series sometimes lean too heavily on the declared popularity of previous workshops. In some cases, this led to proposals that were less creative and innovative than what we had hoped to see.
Unconfirmed or irrelevant speakers The vast majority of proposals included lists of confirmed invited speakers. This made it hard to champion any workshop that didn’t have at least a few speakers confirmed, especially when many unconfirmed big-name speakers were listed (it’s unlikely all would say yes), or when the diversity statement centered on the assumed presence of unconfirmed speakers. There were also several proposals featuring long lists of “celebrity” speakers without clear relevance to the topic of the workshop.
Going too big We saw only a few proposals that we felt were too narrow, but many we found too broad. There seems to be a tendency to overreach for the sake of going big, while we’d prefer to see more focused workshops.
Too many organizers or organizing too many workshops Several proposals had remarkably large organizing committees. It’s not clear why more than five or six organizers would be necessary for a workshop. Similarly, organizing too many workshops can be a negative factor towards high-quality workshops.
Authors: Program Chairs: Razvan Pascanu, Nancy Chen, Marzyeh Ghassemi, Piotr Koniusz, Hsuan-Tien Lin; Assistant Program Chairs: Elena Burceanu, Junhao Dong, Zhengyuan Liu, Po-Yi Lu, Isha Puri
NeurIPS relies on the active participation of the research community to evaluate numerous submissions and uphold scientific quality. Being involved in the reviewing process provides an opportunity to engage with the NeurIPS community, while contributing to the advancement of research excellence. Starting this year, we are aiming to enhance transparency and fairness of the NeurIPS reviewing process by introducing a self-nomination system, in addition to the usual reviewer invitation procedures. The main track and the D&B track decided to recruit reviewers and area chairs together this year, and the selected ones will be allocated to either the main track or the D&B track afterwards.
Criteria
We encourage anyone who wants to be part of the NeurIPS community and believes has the experience to act as a reviewer or area chair to apply. Good research cannot happen without fair and careful review, and being a reviewer or an area chair is an integral part of being a researcher and will help you hone vital skills that will help you in your career.
To provide guidance of whether you are ready to be a reviewer or an area chair, we provide a few criteria. Note that you should not take these criteria in a very strict sense, and you do not need to meet all of them, but rather a subset of them to qualify.
Also, as our goal is to increase fairness, we understand that individuals may come from different contexts, and due to the large diversity of backgrounds in the field, many might qualify as strong reviewers without fitting the criteria below. In that sense, we provide a free-form text box that allows self-nominations to justify their readiness to be a reviewer if their background does not fit the criteria below.
As this initiative is intended to extend the list of reviewers from previous years, depending on the volume of self-nomination applications and the needs of the conference, we may invite only a random subset of eligible candidates. So please do not treat not being selected as not being qualified for the role you applied for and do consider re-applying in future years, as the pool of reviewers changes from year to year.
If you have accepted an invitation to be a reviewer, but want to self-nominate as an AC, please note that in the form you should indicate that you previously accepted being a reviewer. This will allow us to remove you from the double role in case you are selected. Please understand that the most crucial role in the whole reviewing pipeline is that of the reviewer, and is where NeurIPS needs most support. So in case you are not selected as an AC, please consider staying as a reviewer providing your expertise in judging submitted works.
Self-nominating reviewers
For each self-nomination application for reviewing at NeurIPS 2025, we will consider the following criteria as relevant. You do not need to meet all of them, but should ideally meet several of them to qualify.
Hold a Ph.D. in a relevant field
Published at least 2 papers as first author in the past 5 years in any conference/journal from the following list: REVIEWER_VENUE_LIST
Received at least 10 citations on your first-authored papers (in relevant fields)
Served as a reviewer at any conference/journal in REVIEWER_VENUE_LIST, for at least 2 times in a row
Be an author on at least 4 peer-reviewed papers (this can include non-archival venues like workshops) on relevant topics
If you are willing to self-nominate to serve as a reviewer for NeurIPS 2025, please fill in this form.
Self-nominating ACs
For each self-nomination application for being an AC at NeurIPS 2025, we will consider the following criteria as relevant. You do not need to meet all of them, but should ideally meet several of them to qualify.
At least 5 last-authored papers accepted at AC_CONF_LIST in the past 5 years
At least 10 papers accepted at AC_CONF_LIST in the area of expertise
Served as an AC at any conference in the AC_CONF_LIST at least 2 times in a row
Served as a reviewer at AC_CONF_LIST at least 5 times in a row
Authors: DB Track chairs: Lora Aroyo, Francesco Locatello, Konstantina Palla, DB Track resource and metadata chairs: Meg Risdal, Joaquin Vanschoren
The NeurIPS Datasets & Benchmarks Track exists to highlight the crucial role that high-quality datasets and benchmarks play in advancing machine learning research. While algorithmic innovation often takes center stage, the progress of AI depends just as much on the quality, accessibility, and rigor of the datasets that fuel these models. Our goal is to ensure that impactful dataset contributions receive the recognition and scrutiny they deserve.
This blog post accompanies the release of the Call for Papers for the 2025 Datasets & Benchmarks Track (https://neurips.cc/Conferences/2025/CallForDatasetsBenchmarks), outlining key updates to submission requirements and best practices. Please note that this year Datasets & Benchmarks Track this year will follow the NeurIPS2025 Main Track Call for Papers, with the addition of three track-specific points:(1) Single-blind submissions, (2) Required dataset and benchmark code submission and (3) Specific scope for datasets and benchmarks paper submission. See the Call for Papers for details.
The Challenge of Assessing High-Quality Datasets
Unlike traditional research papers that have well-established peer review standards, dataset and benchmark papers require unique considerations on their review evaluation. A high-quality dataset must be well-documented, reproducible, and accessible while adhering to best practices in data collection and ethical considerations. Without clear guidelines and automated validation, reviewers face inconsistencies in their assessments, and valuable contributions risk being overlooked.
To address these challenges, we have developed submission and review guidelines that align with widely recognized frameworks in the research community and the open-source movement. For instance, in 2024, we encouraged authors to use established documentation standards such as datasheets for datasets, dataset nutrition labels, data statements for NLP, data cards, and accountability frameworks.By promoting these frameworks, we aim to ensure that dataset contributions are well-documented and transparent, making it easier for researchers to assess their reliability and impact.
Raising the Bar: Machine-Readable Metadata with Croissant
A persistent challenge has been the lack of a standardized, reliable way for reviewers to assess datasets against industry best practices. Unlike the main track, which has commonly accepted standards for paper submissions, dataset reviews still have to mature in this respect.
In 2024, we took a significant step toward improving dataset review by encouraging authors to generate a Croissant machine-readable metadata file to document their datasets. Croissant is an open community effort created because existing standards for dataset metadata lack ML-specific support and lag behind AI’s dynamically evolving requirements. Croissant records ML-specific metadata that enables datasets to be loaded directly into ML frameworks and tools, streamlines usage and community sharing independent of hosting platforms, and includes responsible AI metadata. At that time, Croissant tooling was still in its early stages, and many authors found the process burdensome. Since then, Croissant has matured significantly and gained industry and community adoption. Platforms like Hugging Face, Kaggle, OpenML, and Dataverse now natively support Croissant, making metadata generation effortless.
Making High-Quality Dataset Submissions the Standard
With these improvements in tooling and ecosystem support, we are now requiring dataset authors to ensure that their datasets are properly hosted. That means that they release their datasets via a data repository (e.g., Hugging Face, Kaggle, OpenML, or Dataverse) or provide a custom hosting solution that supports and ensures long-term access and includes a croissant description. We also provide detailed guidelines for authors to make the process as smooth as possible. This requirement ensures that:
Datasets are easily accessible and discoverable through widely used research platforms over long periods of time.
Standard interfaces (e.g., via Python client libraries) simplify dataset retrieval for both researchers and reviewers.
Metadata is automatically validated to streamline the review process.
By enforcing this requirement, we are lowering the barriers to high-quality dataset documentation while improving the overall transparency and reproducibility of dataset contributions.
Looking Ahead
The NeurIPS Datasets & Benchmarks Track is committed to evolving alongside the broader research community. By integrating best practices and leveraging industry standards like Croissant, we aim to enhance the visibility, impact, and reliability of dataset contributions. These changes will help ensure that machine learning research is built on a foundation of well-documented, high-quality datasets that drive meaningful progress.
If you are preparing a dataset submission for NeurIPS, we encourage you to explore Croissant-integrated repositories today and take advantage of the powerful tools available to streamline your metadata generation. Let’s work together to set a new standard for dataset contributions
By Yixuan Even Xu, Fei Fang, Jakub Tomczak, Cheng Zhang, Zhenyu Sherry Xue, Ulrich Paquet, Danielle Belgrave
Overview
Paper assignment is crucial in conference peer review as we need to ensure that papers receive high-quality reviews and reviewers are assigned papers that they are willing and able to review. Moreover, it is essential that a paper matching process mitigates potential malicious behavior. The default paper assignment approach used in previous years of NeurIPS is to find a deterministic maximum-quality assignment using linear programming. This year, for NeurIPS 2024, as a collaborative effort between the organizing committee and researchers from Carnegie Mellon University, we experimented with a new assignment algorithm [1] that introduces randomness to improve robustness against potential malicious behavior, as well as enhance reviewer diversity and anonymity, while maintaining most of the assignment quality.
TLDR
How did the algorithm do compared to the default assignment algorithm? We compare the randomized assignment calculated by the new algorithm to that calculated from the default algorithm. We measure various randomness metrics [1], including maximum assignment probability among all paper-reviewer pairs, the average of maximum assignment probability to any reviewer per paper, L2 norm, entropy, and support size, i.e., the number of paper-reviewer pairs that could be assigned with non-zero probability. As expected, the randomized algorithm was able to introduce a good amount of randomness while ensuring the overall assignment quality is of the default assignment. We show the computed metrics in the table below. Here, ↑ indicates that a higher value is better, and ↓ indicates that a lower value is better.
Default
New
Quality (↑)
100%
98%
Max Probability (↓)
1.0
0.90
Average Maxprob (↓)
1.0
0.86
L2 Norm (↓)
250.83
199.50
Entropy (↑)
0
40,678.48
Support Size (↑)
62,916
191,266
Also, one key takeaway from our analysis is that it is important for all the reviewers to complete their OpenReview profile and bid actively to get high-quality assignments. In fact, among all reviewers who bid “High” or “Very High” for at least one paper, of them got assigned a paper that they bid “High” or “Very High” on.
In the rest of this post, we introduce the details of the algorithm, explain how we implemented it, and analyze the deployed assignment for NeurIPS 2024.
The Algorithm
The assignment algorithm we used is Perturbed Maximization (PM) [1], a work published at NeurIPS 2023. To introduce the algorithm, we first briefly review the problem setting of paper assignment in peer review as well as the default algorithm used in previous years.
Problem Setting and Default Algorithm
In a standard paper assignment setting, a set of papers needs to be assigned to a set of reviewers. To ensure each paper receives enough reviewers and no reviewer is overloaded with papers, each paper in should be assigned to reviewers and each reviewer in should receive no more than papers. An assignment is represented as a binary matrix in , where indicates that paper is assigned to reviewer . The main objective of paper assignment is usually to maximize the predicted match quality between reviewers and papers [2]. To characterize the matching quality, a similarity matrix in is commonly used [2-8]. Here, represents the predicted quality of a review by reviewer for paper and is generally computed from various sources [9], e.g., reviewer-selected bids and textual similarity between the paper and the reviewer’s past work [2, 10-13]. Then, the quality of an assignment can be defined as the total similarity of all assigned paper-reviewer pairs, i.e., One standard approach for computing a paper assignment is to maximize quality [2, 5-8, 14], i.e., to solve the following optimization problem:
The optimization above can be solved efficiently by linear programming and is widely used in practice. In fact, the default automatic assignment algorithm used in OpenReview is also based on this linear programming formulation and has been used for NeurIPS in past years.
Perturbed Maximization
While the deterministic maximum-quality assignment is the most common, there are strong reasons [1] to introduce randomness into paper assignment, i.e., to determine a probability distribution over feasible deterministic assignments and sample one assignment from the distribution. For example, one important reason is that randomization can help mitigate potential malicious behavior in the paper assignment process. Several computer science conferences have uncovered “collusion rings” of reviewers and authors [15-16], in which reviewers aim to get assigned to the authors’ papers in order to give them good reviews without considering their merits. Randomization can help break such collusion rings by making it harder for the colluding reviewers to get assigned to the papers they want. The randomness will also naturally increase reviewer diversity and enhance reviewer anonymity.
Perturbed Maximization (PM) [1] is a simple and effective algorithm that introduces randomness into paper assignment. Mathematically, PM solves a perturbed version of the optimization problem above, parameterized by a number and a perturbation function :
In this perturbed optimization, the variables are no longer binary but continuous in . This is because we changed the meaning of in the randomized assignment context: now represents the marginal probability that paper is assigned to reviewer . By constraining to be in , we ensure that each paper-reviewer pair has a probability of at most to be assigned to each other. This constraint is adopted from an earlier work on randomized paper assignment [17]. The perturbation function is a concave function that is used to penalize high values of , so that the probability mass is spread more evenly among the paper-reviewer pairs. The perturbation function can be chosen in various ways, and one simple option is , which makes the optimization concave and quadratic, allowing us to solve it efficiently.
After solving the optimization problem above, we obtain a probabilistic assignment matrix . To get the final assignment, we then sample according to the method in [17-18] to meet the marginal probabilities . The method is based on the Birkhoff-von Neumann theorem.
Implementation
Since this is the first time we use a randomized algorithm for paper assignment at NeurIPS, the organizing committee decided to set the parameters so that the produced assignment is close in quality to the maximum-quality assignment, while introducing a moderate amount of randomness. Moreover, we introduced additional constraints to ensure that the randomization does not result in many low-quality assignments.
Similarity Computation of NeurIPS 2024
The similarity matrix of NeurIPS 2024 was computed from two sources: affinity scores and bids. The affinity scores were computed using a text similarity model comparing the paper’s text with the reviewer’s past work on OpenReview. The resulting affinity scores were normalized to be within . The bids were collected from reviewers during the bidding phase, where reviewers could bid on papers they are interested in reviewing at five levels: “Very High”, “High”, “Neutral”, “Low”, and “Very Low”. We mapped these bids to respectively. The final similarity matrix was computed as the sum of the normalized affinity scores and the mapped bids, resulting in a similarity matrix consisting of numbers in .
Additional Constraints for Restricting Low-Quality Assignments
One of the main concerns of paper assignment in large conferences like NeurIPS is the occurrence of low-quality assignments because the matching quality of any individual paper-reviewer pair significantly affects the relevance of both papers and reviewers. To mitigate this issue, we explicitly restrict the number of low-quality assignments. Specifically, we first solve another optimization problem without the perturbation function [17]:
Let the optimal solution of this problem be . We want to ensure that adding the perturbation function does not introduce additional low-quality assignments compared to . To achieve this, we set a set of thresholds . For each , we add constraints that our perturbed assignment should have at least the same number of assignments with quality above as , i.e.,
The thresholds were chosen to distinguish between different levels of quality. According to the similarity computation for NeurIPS 2024, matchings with quality above are “good” ones that either the reviewer has a high affinity score with the paper and bids positively on it, or the reviewer bids “Very High” on the paper; matchings with quality above are “moderate” ones that either the reviewer has a high affinity score with the paper or the reviewer bids positively on it; matchings with quality above are “acceptable” ones that the reviewer has a moderate affinity score with the paper and bids neutrally on it. By setting these thresholds, we limited the number of low-quality assignments introduced by the perturbation function.
Running the Algorithm
We integrated the Python implementation of PM into the OpenReview system, using Gurobi [19] as the solver for the concave optimization. However, since the number of papers and reviewers in NeurIPS 2024 is too large, we could not directly use OpenReview’s computing resources to solve the assignment in early 2024. Instead, we ran the algorithm on a local server with anonymized data. The assignment was then uploaded to OpenReview for further processing, such as manual adjustments by the program committee. We ran four different parameter settings of PM and sampled three assignments from each setting. Each parameter setting took around 4 hours to run on a server with 112 cores, using peak memory of around 350GB. The final assignment was chosen by the program committee based on the computed statistics of the assignments. The final deployed assignment came from the parameter setting where and . The maximum number of papers each reviewer can review was set to .
Analysis of The Deployed Assignment
How did the assignment turn out? We analyzed various statistics of the assignment, including the aggregate scores, affinity scores, reviewer bids, reviewer load, and reviewer confidence in the review process. We also compared the statistics across different subject areas. Here are the results.
Aggregate Scores
The deployed assignment achieved an average aggregate score of and a median of . Recall from the computation of the similarity matrix, this means the majority of the assignments are of very high quality, with high affinity scores and “Very High” bids. Additionally, we note that every single matched paper-reviewer pair has an aggregate score above , which means that each assigned pair is at least a “moderate” match. In addition, we see no statistical difference in the aggregate score across different subject areas, despite the varying sizes of different areas.
Affinity Scores
Since the aggregate score is the sum of the affinity score based on text similarity and the converted reviewer bids, we also checked the distribution of these two key scores. The deployed assignment achieved high affinity scores, with an average of and a median of . Note that there are also some matched pairs with zero affinity scores. These pairs are matched because the reviewers bid “Very High” on the papers, which results in an aggregate score of . Therefore, we still prioritize these pairs over those with positive affinity scores but neutral or negative bids.
Reviewer Bids
For reviewer bids, we see that most of the assigned pairs have “Very High” bids from the reviewers, with the majority of the rest having “High” bids. Moreover, not a single pair has a negative bid. This indicates that reviewers are generally interested in the papers they are assigned to. Note that although we default missing bids to “Neutral”, the number of matched pairs with “Missing” bids is larger than that of pairs with “Neutral” bids. This is because if a reviewer submitted their bids, they are most likely assigned to the papers they bid positively on. The matched pairs with “Missing” bids are usually those where reviewers did not submit their bids, and the assignment for them was purely based on the affinity scores.
Reviewer Load
If we distribute the reviewer load evenly, reviewers should be assigned to an average of papers. However, as the assignment algorithm aims for high-quality assignments, the majority of reviewers were assigned to papers, the limit we set for reviewers.
Nevertheless, some reviewers in the pool are not assigned to any papers or are assigned to only one paper. After analyzing the data more carefully, we found that most of these reviewers either had no known affinity scores with the papers (mostly because they did not have any past work on OpenReview) or did not submit their bids. Moreover, there are even reviewers who had neither affinity scores nor bids. Therefore, it is hard for the algorithm to find good matches for them.
We suggest that reviewers submit their bids and provide more information about their past work to help the algorithm find better matches for them.
While the reviewer load distribution for each subject area generally follows the overall distribution, we note that some subject areas, like Bandits, have a notably higher number of papers assigned to each reviewer. In fact, most reviewers in the Bandits area were assigned to papers. This indicates that for these areas, we will need to work harder to recruit more reviewers in future conferences.
Reviewer Confidence
In the review process, reviewers were asked to provide their confidence in their reviews on a scale from to . The distribution of reviewer confidence is shown below. Here, a confidence of means that the matched pair was adjusted manually by the area chairs, and a confidence of means that the reviewer did not submit their review. We can see that among the pairs where the reviewer completed the review, most matched pairs have a confidence of or higher. This indicates that reviewers are generally confident in their reviews.
On a side note, we found that reviewer confidence is generally lower for theoretical areas like Algorithmic Game Theory, Bandits, Causal Inference, and Learning Theory, while it is higher for other areas. It is hard to explain this phenomenon exactly, but we think this might be because the difficulty of reviewing papers in theoretical areas is generally higher, leading reviewers to be more cautious in their reviews.
Comparison with the Default Algorithm
Besides analyzing the deployed assignment, it is also natural to ask how the new algorithm PM compares to the default algorithm used in OpenReview. To answer this question, we ran the default algorithm on the same data and compared the resulting assignment with the deployed assignment. Below, we show the comparison with the default algorithm in aggregate scores, reviewer bids, and reviewer load.
Aggregate Scores
In terms of aggregate scores, the default algorithm achieved an average of , while PM achieved an average of , which is about of that of the default algorithm. Note that the default algorithm is optimal in quality, so any other algorithm will have a lower quality, and the difference is expected.
Reviewer Bids
How do the sampled assignments resulting from the new algorithm differ from the default one? Here we show the distribution of reviewer bids in the default assignment, the overlap between the optimal deterministic assignment and the deployed assignment, and the overlap between the optimal deterministic assignment and three sampled assignments from PM. As seen in the following figure, a non-negligible number of matched pairs have changed from the default assignment to the deployed assignment, and over half of the matched pairs would be different in three samples from PM. This indicates that PM introduces a good amount of randomness into the assignment, increasing robustness against malicious behavior while incurring only a small loss in matching quality.
Reviewer Load
Another side benefit of PM is that it can help distribute the reviewer load more evenly. In the following figure, we show the distribution of reviewer load in the optimal deterministic assignment and the deployed assignment. We can see that both the number of reviewers assigned to papers and the number of reviewers assigned to papers are reduced in the deployed assignment compared to the optimal one. To ensure an even more balanced reviewer load, additional constraints on the minimum number of papers per reviewer could be added in the future.
Conclusion
In this post, we introduced the paper assignment algorithm used for NeurIPS 2024 and explained how we implemented it. We analyzed the results of the assignment and compared it with the default algorithm used in OpenReview. We found that the assignment produced by the new algorithm achieved high-quality matches, with a good amount of randomness introduced into the assignment, increasing robustness against malicious behavior as well as enhancing reviewer diversity and anonymity. In future conferences, we suggest that reviewers submit their bids and provide more information about their past work to help the algorithm find better matches for them.
References
[1] Xu, Yixuan Even, Steven Jecmen, Zimeng Song, and Fei Fang. “A One-Size-Fits-All Approach to Improving Randomness in Paper Assignment.” Advances in Neural Information Processing Systems 36 (2024).
[2] Charlin, Laurent, and Richard Zemel. “The Toronto paper matching system: an automated paper-reviewer assignment system.” (2013).
[3] Stelmakh, Ivan, Nihar Shah, and Aarti Singh. “PeerReview4All: Fair and accurate reviewer assignment in peer review.” Journal of Machine Learning Research 22.163 (2021): 1-66.
[4] Jecmen, Steven, Hanrui Zhang, Ryan Liu, Fei Fang, Vincent Conitzer, Nihar B. Shah. “Near-optimal reviewer splitting in two-phase paper reviewing and conference experiment design.” Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. Vol. 10. 2022.
[5] Tang, Wenbin, Jie Tang, and Chenhao Tan. “Expertise matching via constraint-based optimization.” 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Vol. 1. IEEE, 2010.
[6] Flach, Peter A., Sebastian Spiegler, Bruno Golenia, Simon Price, John Guiver, Ralf Herbrich, Thore Graepel and Mohammed J. Zaki. “Novel tools to streamline the conference review process: Experiences from SIGKDD’09.” ACM SIGKDD Explorations Newsletter 11.2 (2010): 63-67.
[7] Taylor, Camillo J. “On the optimal assignment of conference papers to reviewers.” University of Pennsylvania Department of Computer and Information Science Technical Report 1.1 (2008): 3-1.
[8] Charlin, Laurent, Richard S. Zemel, and Craig Boutilier. “A Framework for Optimizing Paper Matching.” UAI. Vol. 11. 2011.
[9] Shah, Nihar B. “Challenges, experiments, and computational solutions in peer review.” Communications of the ACM 65.6 (2022): 76-87.
[10] Mimno, David, and Andrew McCallum. “Expertise modeling for matching papers with reviewers.” Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. 2007.
[11] Liu, Xiang, Torsten Suel, and Nasir Memon. “A robust model for paper reviewer assignment.” Proceedings of the 8th ACM Conference on Recommender systems. 2014.
[12] Rodriguez, Marko A., and Johan Bollen. “An algorithm to determine peer-reviewers.” Proceedings of the 17th ACM conference on Information and knowledge management. 2008.
[13] Tran, Hong Diep, Guillaume Cabanac, and Gilles Hubert. “Expert suggestion for conference program committees.” 2017 11th International Conference on Research Challenges in Information Science (RCIS). IEEE, 2017.
[14] Goldsmith, Judy, and Robert H. Sloan. “The AI conference paper assignment problem.” Proc. AAAI Workshop on Preference Handling for Artificial Intelligence, Vancouver. 2007.
[16] Littman, Michael L. “Collusion rings threaten the integrity of computer science research.” Communications of the ACM 64.6 (2021): 43-44.
[17] Jecmen, Steven, Hanrui Zhang, Ryan Liu, Nihar B. Shah, Vincent Conitzer and Fei Fang. “Mitigating manipulation in peer review via randomized reviewer assignments.” Advances in Neural Information Processing Systems 33 (2020): 12533-12545.
[18] Budish, Eric, Yeon-Koo Che, Fuhito Kojima and Paul Milgrom. “Implementing random assignments: A generalization of the Birkhoff-von Neumann theorem.” Cowles Summer Conference. Vol. 2. No. 2.1. 2009.
By Marco Cuturi, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub Tomczak, Cheng Zhang, Lora Aroyo, Francesco Locatello, Lingjuan Lyu
The search committees for the “Best Paper Award” were nominated by the program chairs and the respective track chairs, who selected leading researchers with a diverse perspective on machine learning topics. These nominations were approved by the general and DIA chairs.
The best paper award committees were tasked with selecting a handful of highly impactful papers from both tracks of the conference. The search committees considered all accepted NeurIPS papers equally, and made decisions independently based on the scientific merit of the papers, without making separate considerations on authorship or other factors, in keeping with the Neurips blind review process.
With that, we are excited to share the news that the best and runner up paper awards this year go to five ground-breaking papers (four main track and one datasets and benchmarks track) that highlight, respectively, a new autoregressive model for vision, new avenues for supervised learning using higher-order derivatives, improved training of LLMs and inference methods for text2image diffusion and a novel diverse benchmark dataset for LLM alignment.
This paper introduces a novel visual autoregressive (VAR) model that iteratively predicts the image at a next higher resolution, rather than a different patch in the image following an arbitrary ordering. The VAR model shows strong results in image generation, outperforming existing autoregressive models in efficiency and achieving competitive results with diffusion-based methods. At the core of this contribution lies an innovative multiscale VQ-VAE implementation. The overall quality of the paper presentation, experimental validation and insights (scaling laws) give compelling reasons to experiment with this model.
This paper proposes a tractable approach to train neural networks (NN) using supervision that incorporates higher-order derivatives. Such problems arise when training physics-informed NN to fit certain PDEs. Naive application of automatic differentiation rules are both inefficient and intractable in practice for higher orders k and high dimensions d. While these costs can be mitigated independently (e.g. for large k but small d, or large d but small k using subsampling) this paper proposes a method, stochastic taylor derivative estimator (STDE) that can address both. This work opens up possibilities in scientific applications of NN and more generally in supervised training of NN using higher-order derivatives.
This paper presents a simple method to filter pre-training data when training large language models (LLM). The method builds on the availability of a high-quality reference dataset on which a reference language model is trained. That model is then used to assign a quality score for tokens that come from a larger pre-training corpus. Tokens whose scores have the highest rank are then used to guide the final LLM training, while the others are discarded. This ensures that the final LLM is trained on a higher quality dataset that is well aligned with the reference dataset.
This paper proposes an alternative to classifier free guidance (CFG) in the context of text-2-image (T2I) models. CFG is a guidance technique (a correction in diffusion trajectories) that is extensively used by practitioners to obtain better prompt alignment and higher-quality images. However, because CFG uses an unconditional term that is independent from the text prompt, CFG has been empirically observed to reduce diversity of image generation. The paper proposes to replace CFG by Autoguidance, which uses a noisier, less well-trained T2I diffusion model. This change leads to notable improvements in diversity and image quality.
Alignment of LLMs with human feedback is one of the most impactful research areas of today, with key challenges such as confounding by different preferences, values, or beliefs. This paper introduces the PRISM dataset providing a unique perspective on human interactions with LLMs. The authors collected data from 75 countries with diverse demographics and sourced both subjective and multicultural perspectives benchmarking over 20 current state of the art models. The paper has high societal value and enables research on pluralism and disagreements in RLHF.
Best Paper Award committee for main track: Marco Cuturi (Committee Lead), Zeynep Akata, Kim Branson, Shakir Mohamed, Remi Munos, Jie Tang, Richard Zemel, Luke Zettlemoyer
Best Paper Award committee for dataset and benchmark track: Yulia Gel, Ludwig Schmidt, Elena Simperl, Joaquin Vanschoren, Xing Xie.
Large language models (LLMs) represent a promising but controversial aide in the process of preparing and reviewing scientific papers. Despite risks like inaccuracy and bias, LLMs are already being used in the review of scientific papers. [1,2] Their use raises the pressing question: “How can we harness LLMs responsibly and effectively in the application of conference peer review?”
In an experiment at this year’s NeurIPS, we took an initial step towards answering this question. We evaluated a relatively clear-cut and low-risk use case: vetting paper submissions against submission standards, with results shown only to paper authors. We deployed an optional to use LLM-based “Checklist Assistant” to authors at NeurIPS 2024 as an assistant to check compliance with the NeurIPS Paper Checklist. We then systematically evaluated the benefits and risks of the LLM Checklist Assistant focusing on two main questions:
(1) Do authors perceive an LLM Author Checklist Assistant as a valuable enhancement to the paper submission process?
(2) Does the use of an Author Checklist Assistant meaningfully help authors to improve their paper submissions?
While there are nuances to our results, the main takeaway is that an LLM Checklist Assistant can effectively aid authors in ensuring scientific rigor, but should likely not be used as a fully automated review tool that replaces human review.
Example of checklist questions, answers, and LLM-provided review from the Checklist Assistant.
(1) Did authors find the Checklist Assistant useful?
We administered surveys both before and after use of the Checklist Assistant asking authors about their expectations for and perceptions of the tool. We received 539 responses to the pre-usage survey, 234 submissions to the Checklist Assistant and 78 responses to the post-usage survey.
Authors felt the Checklist Assistant was a valuable enhancement to the paper submission process. The majority of surveyed authors reported a positive experience using the LLM Checklist Assistant: >70% of authors found it useful and >70% said they would modify their paper in response to feedback.
Interestingly, authors’ expectations of the assistant’s effectiveness were even more positive before using it than their assessments after actually using it. Comparing pre- and post-usage responses, there was a statistically significant drop in positive feedback on the “Useful” and “Excited to Use” questions.
Responses to survey questions before and after using checklist verification (n=63 unique responses.)
(2) What were the main issues authors had with the Checklist Assistant?
We also solicited freeform feedback on issues that the authors experienced using the Checklist Assistant, with responses grouped below.
Among the main issues reported by authors in qualitative feedback, the most frequently cited problems were inaccuracy (20/52 respondents) and thatthe LLM was too strict in its requirements (14/52 respondents).
Reported issues using checklist verification from freeform feedback on post-usage survey (n=52 out of 78 total survey responses.)
(3) What kinds of feedback did the Checklist Assistant provide?
We used another LLM to extract key points from the Checklist Assistant’s responses for each question on the paper checklist and to cluster these points into overarching categories. Below we show the most frequent categories of feedback given by the Author Checklist Assistant on four questions of the checklist:
Clustering of most common types of feedback given by the LLM Checklist Assistant on four checklist questions.
The LLM was able to give concrete feedback to authors grounded in the content of their paper and checklist. The LLM tended to provide 4-6 distinct and specific points of feedback per question across the 15 questions. While it tended to give some generic boilerplate as part of its responsesand to expand the scope of questions, it also was capable of giving concrete and specific feedback for many questions.
(4) Did authors actually modify their submissions?
Authors reported in freeform survey responses reflect that they planned to make meaningful changes to their submissions—35/78 survey respondents provided specific modifications they would make to their submissions in response to the Checklist Assistant. This included improving justifications for checklist answers and adding more details to the paper about experiments, datasets, or compute resources.
In 40 instances, authors submitted their paper twice to the checklist verifier (accounting for 80 total paper submissions.) We find that of the 40 pairs of papers, in 22 instances authors changed at least one answer in their checklist (e.g., ‘NA’ to ‘Yes’) between the first and second submission and in 39 instances they changed at least one justification for a checklist answer. Of the authors who changed justifications on their paper checklist, many authors made a large number of changes, with 35/39 changing more than 6 justifications of the 15 questions on the checklist. While we cannot causally attribute these changes to the Checklist Assistant, they suggest that authors may have incorporated feedback from the assistant in between submissions.
Below, we show (multiplicative) increase in word count between initial submission and final submission on questions where authors changed justifications (a value of 2 corresponds to a doubling of the length of an answer). We find that over half the time when authors changed a checklist answer, they more than doubled the length of their justification.
Change in word count of authors’ checklist responses between first and second submission to the Checklist Assistant. Over half the time, authors more than doubled the length of their checklist response.
In summary:
When authors submitted to the Checklist Assistant multiple times they almost always made changes to their checklists between submissions and significantly lengthened their answers, suggesting that they may have added content in response to LLM feedback.
(5) Can the Checklist Assistant be gamed?
The intended use of our Checklist Assistant was to help authors improve their papers, not to serve as a tool for reviewers to verify the accuracy of authors’ responses. If the system were used as an automated verification step as part of a review process, this could introduce an incentive for authors to “game” the system motivating the following question: could authors automatically improve the evaluations of their checklist responses with the help of AI, without making actual changes to their paper? If such gaming were possible, authors could provide a false impression of compliance to a conference without (much) additional effort and without actually improving their papers.
To assess whether our system is vulnerable to such gaming, we employed another LLM as an attack agent to iteratively manipulate the checklist justifications, aiming to deceive the Checklist Assistant. In this iterative process, the attack agent receives feedback from the system after each round and uses it to refine its justifications. We provided GPT-4o with the initial checklist responses and instructed it to revise the justifications based solely on feedback, without altering the underlying content of the paper. We allowed the attack agent to do this for three iterations (reflecting the submission limit on our deployed assistant), with the agent selecting the highest scored response for each checklist question over the iterations.
To statistically quantify how attack success we submitted the selected justification to our Checklist Assistant for an evaluation “Score” (1 when the Checklist Assistant said the checklist question had “No Issues” and 0 when the Assistant identified issues.) Below we show the results of this attack:
Authors could adversarially improve their chance of passing an automated LLM Checklist review by modifying their checklist without modifying their paper: on 14 out of 15 questions our simple adversarial attack shows a score increase when comparing the unchanged justifications with those refined through three iterations of automated improvement.
In a manual examination of the changes made by the (attacker) LLM to the justifications, we found that the attacker LLM employed several illegitimate strategies, such as adding a hardware description that was not present in the paper or original justifications and generating a placeholder URL as a purported repository for the code. These illegitimate justifications were evaluated as adequate by the Checklist Assistant.
Conclusions
Our deployment of an LLM-based paper Checklist Assistant at NeurIPS 2024 demonstrated that LLMs hold potential in enhancing the quality of scientific submissions by assisting authors in validating whether their papers meet submission standards. However, our study points to notable limitations in deploying LLMs within the scientific peer review process that need to be addressed, in particular accuracy and alignment issues. Further, our system was not robust to gaming by authors, suggesting that while a Checklist Assistant could be useful as an aid to authors it may be a poor substitute for human review. NeurIPS will continue to build on its LLM Policy Reviews for 2025.
[2] Latona et al., The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates, 2024. https://arxiv.org/abs/2405.02150
72 Windfall Films will be filming at NeurIPS 2024 for a feature documentary about AI. This (filming) is a first-time experiment at NeurIPS, and any other filming is not permitted.
The team will be filming the poster sessions on the afternoon of Wednesday Dec 11, and they may also film a small number of other events (which we will also announce ahead of filming). The areas will be clearly marked with signs that contain the following info:
It is possible that you will be included in general shots of the conference. If you don’t want to appear in the programme, please contact Zara Powell at zara.powell@windfallfilms.com or +44 7557771061 immediately.
You can also find this info here: https://neurips.cc/Conferences/2024/FilmingNotice.
For any feedback about this filming, please contact the NeurIPS Communication Chairs at communication-chairs@neurips.cc.
By Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub Tomczak, Cheng Zhang
We are honored to announce the Test of Time Paper Awards for NeurIPS 2024. This award is intended to recognize papers published 10 years ago at NeurIPS 2014 that have significantly shaped the research field since then, standing the test of time.
This year, we are making an exception to award two Test of Time papers given the undeniable influence of these two papers on the entire field. The awarded papers are:
Generative Adversarial Nets Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
Sequence to Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le
The Generative Adversarial Nets paper has been cited more than 85,000 times as of this blog post. It is one of the foundational pieces for generative modeling and has inspired numerous research advances in the past 10 years. Besides research, it has also enabled generative modeling to make an impact in a diverse range of applications, considering vision data and other domains.
Sequence to Sequence Learning with Neural Networks has been cited more than 27,000 times as of this blog post. With the current fast advances of large language models and foundation models in general, making a paradigm shift in AI and applications, the field has benefited from the foundation laid by this work. It is the cornerstone work that set the encoder-decoder architecture, inspiring later attention-based improvements leading to today’s foundation model research.
There is no doubt regarding the significance of these two works.
At NeurIPS, you will see both papers presented by the authors in person, followed by a short Q&A on Friday, December 13th, 2024. We look forward to announcing other NeurIPS awards at the conference!
Welcome to the November edition of the NeurIPS monthly Newsletter! NeurIPS2024 is around the corner. It will be held in Vancouver in less than a month, from Tuesday, Dec 10 to Sunday, Dec 15, 2024.
The NeurIPS Newsletter aims to provide an easy way to keep up to date with NeurIPS events and planning progress, respond to requests for feedback and participation, and find information about new initiatives.
You are receiving this newsletter as per your subscription preferences in your NeurIPS profile. As you prepare to attend NeurIPS, we hope that you will find the following information valuable. To unsubscribe from the NeurIPS newsletter, unselect the “Subscribe to Newsletter” checkbox in your profile:https://neurips.cc/Profile/subscribe. To update your email preference, visit: https://neurips.cc/FAQ/EmailPreferences
NeurIPS 2024 workshops will take place on Dec. 14 & 15. We received 204 total submissions — a significant increase from last year. From this great batch of submissions, we have accepted 56 workshops. Given the exceptional quality of submissions this year, we wish we could have accepted many more, but we could not due to logistical constraints. We want to thank everyone who put in tremendous effort in submitting a workshop proposal. For a list of accepted workshops, refer to our blog post here.
3) NeurIPS 2024 Tutorials
The NeurIPS 2024 tutorials will be held on Tuesday, Dec 10. There will be 14 tutorials this year. All of them will be conducted in person to encourage active participation and some of them include panels to allow for a diverse range of discussion. For a list of accepted tutorials and their speakers, refer to our blog post here.
4) NeurIPS 2024 Affinity Events
We are excited to announce this year’s affinity events co-located with NeurIPS. At NeurIPS, affinity groups play a crucial role in promoting and supporting the ideas and voices of various communities that are defined by some axis of joint identity and raise awareness of issues that affect their members. In addition, they provide members of these affinity groups with increased opportunities to showcase their work, engage in discussions, and build connections during NeurIPS events, promoting diversity and inclusion at NeurIPS. For more information, please visit the Affinity Events Blog Post.
5) Bridging the Future
In this event, we will cover recent activities towards broadening participation activities in Artificial Intelligence and Machine Learning. NeurIPS has recently provided support to several groups active in this space, and in this event, they will describe their efforts and results. The event will take place on Thursday, December 12 at 7:30 pm in room East MR 18. Join us to learn about their ongoing project to better support their communities and the world. Light snacks and drinks will be served. See details here.
6) NeurIPS Town Hall Details
NeurIPS invites all attendees to our annual Town Hall, which will occur in person at the conference on Friday, December 13th at 7 PM this year. The NeurIPS Town Hall provides community members with an opportunity to hear updates and ask questions about the conference. The town hall lasts for an hour, with the first 30 minutes dedicated to presentations from various chairs and the last 30 minutes dedicated to an open Q&A from the community.
7) Announcing the NeurIPS High School Projects Results
We are thrilled to announce the results of the first call for NeurIPS High School Projects. With a theme of machine learning for social impact, this track was launched to get the next generation excited and thinking about how ML can benefit society, to encourage those already pursuing research on this topic, and to amplify that work through interaction with the NeurIPS community.
In total, we received 330 project submissions from high schoolers around the globe. Among those, 21 projects were chosen to be spotlighted and 4 were chosen as award winners. We congratulate all of the students and encourage community members to attend the joint poster session on Tuesday, December 10, where representatives of the four award-winning projects will present their work. For the details refer to this blog post.
8) Poster Printing Service
Read the NeurIPS poster printing information page for additional insight and information about templates, virtual poster and paper thumbnails, poster sizes, and printing.
You can use any service you want to print your poster. We offer an optional poster printing service.
9) Childcare and Other Amenities
NeurIPS is proud to provide free on-site child care. The deadline for registration has passed and it is now at capacity. We invite everyone to review the Child attendance policy here. We kindly request that all attendees adhere to our child attendance policy by refraining from bringing children under the age of 14 to the venue unless they are registered in the childcare program.
Other amenities will include a nursing room and first aid.
10) Reminder About the Conference Dates
As already announced multiple times, the conference start date has been changed to Tuesday, December 10 in order to support delegates arriving on Monday for the Tuesday morning sessions. Registration will now open on Monday from 1 pm to 6 pm. Tutorials remain on Tuesday as originally scheduled. We encourage all attendees to verify dates directly on our website (https://neurips.cc/Conferences/2024) to avoid confusion.
11) NeurIPS EXPO
This year’s NeurIPS features an exciting lineup of 56 Expo events across December 10th-11th. This includes 18 interactive demonstrations on Tuesday, December 10th, along with 26 talk panels and 12 workshops spread across both days. Grab a coffee or box lunch and engage in an EXPO talk or workshop. Demos will be available on the exhibit sponsor hall floor.
Thanks and looking forward to seeing you at the conference.
By Yixuan Even Xu, Fei Fang, Jakub Tomczak, Cheng Zhang, Zhenyu Sherry Xue, Ulrich Paquet, Danielle Belgrave
Overview
Paper assignment is crucial in conference peer review as we need to ensure that papers receive high-quality reviews and reviewers are assigned papers that they are willing and able to review. Moreover, it is essential that a paper matching process mitigates potential malicious behavior. The default paper assignment approach used in previous years of NeurIPS is to find a deterministic maximum-quality assignment using linear programming. This year, for NeurIPS 2024, as a collaborative effort between the organizing committee and researchers from Carnegie Mellon University, we experimented with a new assignment algorithm [1] that introduces randomness to improve robustness against potential malicious behavior, as well as enhance reviewer diversity and anonymity, while maintaining most of the assignment quality.
TLDR
How did the algorithm do compared to the default assignment algorithm? We compare the randomized assignment calculated by the new algorithm to that calculated from the default algorithm. We measure various randomness metrics [1], including maximum assignment probability among all paper-reviewer pairs, the average of maximum assignment probability to any reviewer per paper, L2 norm, entropy, and support size, i.e., the number of paper-reviewer pairs that could be assigned with non-zero probability. As expected, the randomized algorithm was able to introduce a good amount of randomness while ensuring the overall assignment quality is of the default assignment. We show the computed metrics in the table below. Here, ↑ indicates that a higher value is better, and ↓ indicates that a lower value is better.
Also, one key takeaway from our analysis is that it is important for all the reviewers to complete their OpenReview profile and bid actively to get high-quality assignments. In fact, among all reviewers who bid “High” or “Very High” for at least one paper, of them got assigned a paper that they bid “High” or “Very High” on.
In the rest of this post, we introduce the details of the algorithm, explain how we implemented it, and analyze the deployed assignment for NeurIPS 2024.
The Algorithm
The assignment algorithm we used is Perturbed Maximization (PM) [1], a work published at NeurIPS 2023. To introduce the algorithm, we first briefly review the problem setting of paper assignment in peer review as well as the default algorithm used in previous years.
Problem Setting and Default Algorithm
In a standard paper assignment setting, a set of papers needs to be assigned to a set of reviewers. To ensure each paper receives enough reviewers and no reviewer is overloaded with papers, each paper in should be assigned to reviewers and each reviewer in should receive no more than papers. An assignment is represented as a binary matrix in , where indicates that paper is assigned to reviewer . The main objective of paper assignment is usually to maximize the predicted match quality between reviewers and papers [2]. To characterize the matching quality, a similarity matrix in is commonly used [2-8]. Here, represents the predicted quality of a review by reviewer for paper and is generally computed from various sources [9], e.g., reviewer-selected bids and textual similarity between the paper and the reviewer’s past work [2, 10-13]. Then, the quality of an assignment can be defined as the total similarity of all assigned paper-reviewer pairs, i.e., One standard approach for computing a paper assignment is to maximize quality [2, 5-8, 14], i.e., to solve the following optimization problem:
The optimization above can be solved efficiently by linear programming and is widely used in practice. In fact, the default automatic assignment algorithm used in OpenReview is also based on this linear programming formulation and has been used for NeurIPS in past years.
Perturbed Maximization
While the deterministic maximum-quality assignment is the most common, there are strong reasons [1] to introduce randomness into paper assignment, i.e., to determine a probability distribution over feasible deterministic assignments and sample one assignment from the distribution. For example, one important reason is that randomization can help mitigate potential malicious behavior in the paper assignment process. Several computer science conferences have uncovered “collusion rings” of reviewers and authors [15-16], in which reviewers aim to get assigned to the authors’ papers in order to give them good reviews without considering their merits. Randomization can help break such collusion rings by making it harder for the colluding reviewers to get assigned to the papers they want. The randomness will also naturally increase reviewer diversity and enhance reviewer anonymity.
Perturbed Maximization (PM) [1] is a simple and effective algorithm that introduces randomness into paper assignment. Mathematically, PM solves a perturbed version of the optimization problem above, parameterized by a number and a perturbation function :
In this perturbed optimization, the variables are no longer binary but continuous in . This is because we changed the meaning of in the randomized assignment context: now represents the marginal probability that paper is assigned to reviewer . By constraining to be in , we ensure that each paper-reviewer pair has a probability of at most to be assigned to each other. This constraint is adopted from an earlier work on randomized paper assignment [17]. The perturbation function is a concave function that is used to penalize high values of , so that the probability mass is spread more evenly among the paper-reviewer pairs. The perturbation function can be chosen in various ways, and one simple option is , which makes the optimization concave and quadratic, allowing us to solve it efficiently.
After solving the optimization problem above, we obtain a probabilistic assignment matrix . To get the final assignment, we then sample according to the method in [17-18] to meet the marginal probabilities . The method is based on the Birkhoff-von Neumann theorem.
Implementation
Since this is the first time we use a randomized algorithm for paper assignment at NeurIPS, the organizing committee decided to set the parameters so that the produced assignment is close in quality to the maximum-quality assignment, while introducing a moderate amount of randomness. Moreover, we introduced additional constraints to ensure that the randomization does not result in many low-quality assignments.
Similarity Computation of NeurIPS 2024
The similarity matrix of NeurIPS 2024 was computed from two sources: affinity scores and bids. The affinity scores were computed using a text similarity model comparing the paper’s text with the reviewer’s past work on OpenReview. The resulting affinity scores were normalized to be within . The bids were collected from reviewers during the bidding phase, where reviewers could bid on papers they are interested in reviewing at five levels: “Very High”, “High”, “Neutral”, “Low”, and “Very Low”. We mapped these bids to respectively. The final similarity matrix was computed as the sum of the normalized affinity scores and the mapped bids, resulting in a similarity matrix consisting of numbers in .
Additional Constraints for Restricting Low-Quality Assignments
One of the main concerns of paper assignment in large conferences like NeurIPS is the occurrence of low-quality assignments because the matching quality of any individual paper-reviewer pair significantly affects the relevance of both papers and reviewers. To mitigate this issue, we explicitly restrict the number of low-quality assignments. Specifically, we first solve another optimization problem without the perturbation function [17]:
Let the optimal solution of this problem be . We want to ensure that adding the perturbation function does not introduce additional low-quality assignments compared to . To achieve this, we set a set of thresholds . For each , we add constraints that our perturbed assignment should have at least the same number of assignments with quality above as , i.e.,
The thresholds were chosen to distinguish between different levels of quality. According to the similarity computation for NeurIPS 2024, matchings with quality above are “good” ones that either the reviewer has a high affinity score with the paper and bids positively on it, or the reviewer bids “Very High” on the paper; matchings with quality above are “moderate” ones that either the reviewer has a high affinity score with the paper or the reviewer bids positively on it; matchings with quality above are “acceptable” ones that the reviewer has a moderate affinity score with the paper and bids neutrally on it. By setting these thresholds, we limited the number of low-quality assignments introduced by the perturbation function.
Running the Algorithm
We integrated the Python implementation of PM into the OpenReview system, using Gurobi [19] as the solver for the concave optimization. However, since the number of papers and reviewers in NeurIPS 2024 is too large, we could not directly use OpenReview’s computing resources to solve the assignment in early 2024. Instead, we ran the algorithm on a local server with anonymized data. The assignment was then uploaded to OpenReview for further processing, such as manual adjustments by the program committee. We ran four different parameter settings of PM and sampled three assignments from each setting. Each parameter setting took around 4 hours to run on a server with 112 cores, using peak memory of around 350GB. The final assignment was chosen by the program committee based on the computed statistics of the assignments. The final deployed assignment came from the parameter setting where and . The maximum number of papers each reviewer can review was set to .
Analysis of The Deployed Assignment
How did the assignment turn out? We analyzed various statistics of the assignment, including the aggregate scores, affinity scores, reviewer bids, reviewer load, and reviewer confidence in the review process. We also compared the statistics across different subject areas. Here are the results.
Aggregate Scores
The deployed assignment achieved an average aggregate score of and a median of . Recall from the computation of the similarity matrix, this means the majority of the assignments are of very high quality, with high affinity scores and “Very High” bids. Additionally, we note that every single matched paper-reviewer pair has an aggregate score above , which means that each assigned pair is at least a “moderate” match. In addition, we see no statistical difference in the aggregate score across different subject areas, despite the varying sizes of different areas.
Affinity Scores
Since the aggregate score is the sum of the affinity score based on text similarity and the converted reviewer bids, we also checked the distribution of these two key scores. The deployed assignment achieved high affinity scores, with an average of and a median of . Note that there are also some matched pairs with zero affinity scores. These pairs are matched because the reviewers bid “Very High” on the papers, which results in an aggregate score of . Therefore, we still prioritize these pairs over those with positive affinity scores but neutral or negative bids.
Reviewer Bids
For reviewer bids, we see that most of the assigned pairs have “Very High” bids from the reviewers, with the majority of the rest having “High” bids. Moreover, not a single pair has a negative bid. This indicates that reviewers are generally interested in the papers they are assigned to. Note that although we default missing bids to “Neutral”, the number of matched pairs with “Missing” bids is larger than that of pairs with “Neutral” bids. This is because if a reviewer submitted their bids, they are most likely assigned to the papers they bid positively on. The matched pairs with “Missing” bids are usually those where reviewers did not submit their bids, and the assignment for them was purely based on the affinity scores.
Reviewer Load
If we distribute the reviewer load evenly, reviewers should be assigned to an average of papers. However, as the assignment algorithm aims for high-quality assignments, the majority of reviewers were assigned to papers, the limit we set for reviewers.
Nevertheless, some reviewers in the pool are not assigned to any papers or are assigned to only one paper. After analyzing the data more carefully, we found that most of these reviewers either had no known affinity scores with the papers (mostly because they did not have any past work on OpenReview) or did not submit their bids. Moreover, there are even reviewers who had neither affinity scores nor bids. Therefore, it is hard for the algorithm to find good matches for them.
We suggest that reviewers submit their bids and provide more information about their past work to help the algorithm find better matches for them.
While the reviewer load distribution for each subject area generally follows the overall distribution, we note that some subject areas, like Bandits, have a notably higher number of papers assigned to each reviewer. In fact, most reviewers in the Bandits area were assigned to papers. This indicates that for these areas, we will need to work harder to recruit more reviewers in future conferences.
Reviewer Confidence
In the review process, reviewers were asked to provide their confidence in their reviews on a scale from to . The distribution of reviewer confidence is shown below. Here, a confidence of means that the matched pair was adjusted manually by the area chairs, and a confidence of means that the reviewer did not submit their review. We can see that among the pairs where the reviewer completed the review, most matched pairs have a confidence of or higher. This indicates that reviewers are generally confident in their reviews.
On a side note, we found that reviewer confidence is generally lower for theoretical areas like Algorithmic Game Theory, Bandits, Causal Inference, and Learning Theory, while it is higher for other areas. It is hard to explain this phenomenon exactly, but we think this might be because the difficulty of reviewing papers in theoretical areas is generally higher, leading reviewers to be more cautious in their reviews.
Comparison with the Default Algorithm
Besides analyzing the deployed assignment, it is also natural to ask how the new algorithm PM compares to the default algorithm used in OpenReview. To answer this question, we ran the default algorithm on the same data and compared the resulting assignment with the deployed assignment. Below, we show the comparison with the default algorithm in aggregate scores, reviewer bids, and reviewer load.
Aggregate Scores
In terms of aggregate scores, the default algorithm achieved an average of , while PM achieved an average of , which is about of that of the default algorithm. Note that the default algorithm is optimal in quality, so any other algorithm will have a lower quality, and the difference is expected.
Reviewer Bids
How do the sampled assignments resulting from the new algorithm differ from the default one? Here we show the distribution of reviewer bids in the default assignment, the overlap between the optimal deterministic assignment and the deployed assignment, and the overlap between the optimal deterministic assignment and three sampled assignments from PM. As seen in the following figure, a non-negligible number of matched pairs have changed from the default assignment to the deployed assignment, and over half of the matched pairs would be different in three samples from PM. This indicates that PM introduces a good amount of randomness into the assignment, increasing robustness against malicious behavior while incurring only a small loss in matching quality.
Reviewer Load
Another side benefit of PM is that it can help distribute the reviewer load more evenly. In the following figure, we show the distribution of reviewer load in the optimal deterministic assignment and the deployed assignment. We can see that both the number of reviewers assigned to papers and the number of reviewers assigned to papers are reduced in the deployed assignment compared to the optimal one. To ensure an even more balanced reviewer load, additional constraints on the minimum number of papers per reviewer could be added in the future.
Conclusion
In this post, we introduced the paper assignment algorithm used for NeurIPS 2024 and explained how we implemented it. We analyzed the results of the assignment and compared it with the default algorithm used in OpenReview. We found that the assignment produced by the new algorithm achieved high-quality matches, with a good amount of randomness introduced into the assignment, increasing robustness against malicious behavior as well as enhancing reviewer diversity and anonymity. In future conferences, we suggest that reviewers submit their bids and provide more information about their past work to help the algorithm find better matches for them.
References
[1] Xu, Yixuan Even, Steven Jecmen, Zimeng Song, and Fei Fang. “A One-Size-Fits-All Approach to Improving Randomness in Paper Assignment.” Advances in Neural Information Processing Systems 36 (2024).
[2] Charlin, Laurent, and Richard Zemel. “The Toronto paper matching system: an automated paper-reviewer assignment system.” (2013).
[3] Stelmakh, Ivan, Nihar Shah, and Aarti Singh. “PeerReview4All: Fair and accurate reviewer assignment in peer review.” Journal of Machine Learning Research 22.163 (2021): 1-66.
[4] Jecmen, Steven, Hanrui Zhang, Ryan Liu, Fei Fang, Vincent Conitzer, Nihar B. Shah. “Near-optimal reviewer splitting in two-phase paper reviewing and conference experiment design.” Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. Vol. 10. 2022.
[5] Tang, Wenbin, Jie Tang, and Chenhao Tan. “Expertise matching via constraint-based optimization.” 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Vol. 1. IEEE, 2010.
[6] Flach, Peter A., Sebastian Spiegler, Bruno Golenia, Simon Price, John Guiver, Ralf Herbrich, Thore Graepel and Mohammed J. Zaki. “Novel tools to streamline the conference review process: Experiences from SIGKDD’09.” ACM SIGKDD Explorations Newsletter 11.2 (2010): 63-67.
[7] Taylor, Camillo J. “On the optimal assignment of conference papers to reviewers.” University of Pennsylvania Department of Computer and Information Science Technical Report 1.1 (2008): 3-1.
[8] Charlin, Laurent, Richard S. Zemel, and Craig Boutilier. “A Framework for Optimizing Paper Matching.” UAI. Vol. 11. 2011.
[9] Shah, Nihar B. “Challenges, experiments, and computational solutions in peer review.” Communications of the ACM 65.6 (2022): 76-87.
[10] Mimno, David, and Andrew McCallum. “Expertise modeling for matching papers with reviewers.” Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. 2007.
[11] Liu, Xiang, Torsten Suel, and Nasir Memon. “A robust model for paper reviewer assignment.” Proceedings of the 8th ACM Conference on Recommender systems. 2014.
[12] Rodriguez, Marko A., and Johan Bollen. “An algorithm to determine peer-reviewers.” Proceedings of the 17th ACM conference on Information and knowledge management. 2008.
[13] Tran, Hong Diep, Guillaume Cabanac, and Gilles Hubert. “Expert suggestion for conference program committees.” 2017 11th International Conference on Research Challenges in Information Science (RCIS). IEEE, 2017.
[14] Goldsmith, Judy, and Robert H. Sloan. “The AI conference paper assignment problem.” Proc. AAAI Workshop on Preference Handling for Artificial Intelligence, Vancouver. 2007.
[15] Vijaykumar, T. N. “Potential organized fraud in ACM.” IEEE computer architecture conferences. Online https://medium.com/@tnvijayk/potential-organized-fraud-in-acm-ieee-computer-architecture-conferences-ccd61169370d Last accessed April. Vol. 4. 2020.
[16] Littman, Michael L. “Collusion rings threaten the integrity of computer science research.” Communications of the ACM 64.6 (2021): 43-44.
[17] Jecmen, Steven, Hanrui Zhang, Ryan Liu, Nihar B. Shah, Vincent Conitzer and Fei Fang. “Mitigating manipulation in peer review via randomized reviewer assignments.” Advances in Neural Information Processing Systems 33 (2020): 12533-12545.
[18] Budish, Eric, Yeon-Koo Che, Fuhito Kojima and Paul Milgrom. “Implementing random assignments: A generalization of the Birkhoff-von Neumann theorem.” Cowles Summer Conference. Vol. 2. No. 2.1. 2009.
[19] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.