Reflecting on the 2025 Review Process from the Datasets and Benchmarks Chairs
The Datasets and Benchmarks (DB) track, since introduced in 2021, has played a pivotal role in raising the profile of datasets and benchmarks, which is foundational to machine learning, in the NeurIPS community. As this track continues to grow and refine its processes from year to year, we, the DB Chairs, decided to write this blog post to make our process this year transparent. This blog post focuses specifically on the DB track, but the Program Committee chairs have also written a blog post, as have the Position Track chairs.
Raising the standards for dataset and benchmark submissions
As the NeurIPS Datasets and Benchmarks (DB) track continues to mature, its growth in submissions is beginning to stabilize, as is also its establishment within the community. After three years of exponential growth, the track received 1,995 submissions this year. While this is a large number, the increase from the 1,820 submissions received last year is smaller than in previous years (for comparison, there were 987 submissions in 2023), suggesting that the track has begun to stabilize. In light of this maturity, significant steps were taken for the second consecutive year to align its standards and processes with the main track. This strategic alignment aims to ensure that papers involving datasets and benchmarks are held to the same rigorous criteria as main track papers, and maintain the reputation for quality that NeurIPS proceedings papers have. This is also particularly relevant for papers that blur the lines across tracks (most frequently benchmarks and evaluations).

A first step in the direction of alignment with the main track began in 2024, when DB track saw an acceptance rate of 25.3% closely mirroring the main program’s 25.8% rate. For 2025, our goal was to align even more with the main track by working together from recruiting reviewers and ACs, adapting the majority of the main track’s reviewing processes, and finally jointly implementing the responsible reviewing initiative. This initiative is designed to safeguard the quality of reviews, addressing issues of late or low-quality feedback that can hinder the peer-review process. All this was aimed at ensuring a consistent and high level of scrutiny for all submissions.
Building on the lessons learned from previous years, the DB track chairs have also focused on streamlining the submission and review process for datasets. The objective was to create a more standardized submission process for authors and a more efficient evaluation method for reviewers, and in this to allow for easier access, comparison, and assessment of datasets. We elaborated on these changes in our blog post titled NeurIPS Datasets & Benchmarks: Raising the Bar for Dataset Submissions. We highlighted the evolving nature of AI research, where the distinction between a “dataset paper” and a “main track” paper can be nuanced. To address this, the updated best practices and requirements for 2025 include more stringent criteria for dataset submission, hosting, and reproducibility. These measures are intended to ensure that datasets are not only useful and accessible at the time of publication but also remain so over time, thereby improving their accessibility for the review process.
Many of these enhancements are reflected in the updated Call for Papers for the DB track, which this year directly references the main track’s call for papers while providing specific guidelines for dataset and benchmark submissions. This change sends a clear message about the intent to elevate the quality of the DB track submissions to the well-established standards of the main conference.
DB Track calibration process for ensuring fairness and consistency
A critical component of our commitment to maturing the DB track is the careful calibration of our decision-making process. Like all tracks at NeurIPS, we must ensure that reviewing standards are applied consistently and fairly across all submissions, accounting for the natural variance that exists among reviewers, Area Chairs (ACs), and Senior Area Chairs (SACs). As the Program Committee Chairs have noted for the main track in their blog post, many factors can introduce noise into reviewer scores and feedback. It is our responsibility as the DB track chairs to mitigate this and ensure that every paper receives a fair evaluation.
In line with our primary goal of raising the standards and streamlining processes for the DB track, this year, we collaborated closely with the PC chairs to parallel many of the main track reviewing processes. This marks a significant step forward from previous years, where the DB track largely defined its own standards. While our processes were not identical, as DB papers have unique considerations, we adopted similar protocols to the main track for resolving disagreements, such as supporting ACs and SACs throughout the process of calibration for the final decision making.
This year, we implemented two significant changes to the review process, hand in hand with the main track. First, we introduced a revised scoring system. Second, we rolled out the responsible reviewing initiative, which focused on safeguarding review quality and timeliness.
The new scoring system may have influenced how reviewers assessed submissions. In particular, we have observed two distinct trends in the DB track during this process:
- Increased average score: Unlike papers in the main track, which one can expect to be more method and algorithm oriented, it is less common for a dataset submission to be fundamentally “technically incorrect.”
- Subjective nature of contributions: Evaluating the merit of a dataset or benchmark can be highly subjective. A dataset that fills a critical gap for a smaller, long-tail research area might be just as valuable or novel as one that targets a well-established “head” problem.
We think that these factors can lead to reviewer scores that skew higher and have a tighter distribution compared to the papers in the main track. As a result, it can be difficult for the ACs to differentiate clearly between two papers that may have the same average score but vastly different merits and trade-offs.
To address this, this year, at the end of the calibration process, which was mirrored from the main track, we implemented an additional step that allowed us to have a nuanced evaluation system, rather than relying solely on numerical scores. Thus, we asked our SACs, based on their discussion with the ACs, to produce a relative ranking of the papers within their stack of papers. The same procedure was also done last year. The main difference is that last year, we did not have a template for the rankings, and SACs were asked to explain their ranking in a live meeting. To be mindful of the increased effort, this year, we structured this interaction using a ranking form. For any paper ranked below the track-wide average score of 4.25, SACs were required to provide a detailed description of its merits and motivation. This combination of relative ranking and qualitative justification provided a much richer signal.

Provide us with feedback
As with all tracks, community feedback is essential for us to continue to improve our processes: please reach out if you have any feedback. This year, we advanced many changes to improve the quality of datasets and benchmarks papers at NeurIPS, under the idea that a NeurIPS DB paper should not just be correct, but also meet standards for impact and scientific relevance. In future years, we would like to better describe these criteria to reviewers/ACs/SACs, and would invite feedback on this, especially. In particular, we encourage community members to attend our Town Hall at the conference venue, where they can ask questions and provide comments to the organizers in real-time.