NeurIPS Datasets & Benchmarks Track: From Art to Science in AI Evaluations
This post provides an update on the 2025 Datasets and Benchmarks Track, reflecting on how the new hosting and metadata requirements affected submissions and the review process. We present submission statistics, survey findings from 851 authors and 155 reviewers, and identify areas requiring continued development. This differs from our earlier post on the review process by focusing on empirical outcomes rather than procedural changes.
Background: The D&B Track
The Datasets and Benchmarks Track was established in 2021 to provide a venue for work on datasets, benchmarks and evaluation methodologies that often fell outside traditional algorithmic research papers. The track has experienced consistent growth: doubling submissions annually through 2024 and increasing further to 1,820 in 2024 and 1,995 in 2025. This last year, the track operated with 41 senior area chairs, 281 area chairs and 2,680 reviewers.
For 2025, the organizers also implemented two major changes to standardize quality evaluations and transform reproducibility from aspiration to expectation. First, paper submission requirements were aligned with the NeurIPS main track, while retaining dataset-specific elements like optional single-blind submission. Second, the track introduced rigorous requirements for hosting on persistent public repositories with mandatory Croissant metadata. These standards enabled automated checklists and standardized dataset summaries within OpenReview, streamlining the process to reduce reviewer effort and ensure datasets remain verifiable, accessible, and scientifically impactful over time.
Submission Statistics
Dataset Hosting Patterns
When we looked at where authors chose to host their datasets, a clear pattern emerged – over 80% of the accepted papers used a handful of widely adopted platforms: Hugging Face, Kaggle, Dataverse, and OpenML. Another 13% relied on self-hosted or bespoke solutions, with the rest distributed across smaller repositories like Zenodo and the Open Science Framework.
Research Focus Areas
The distribution of accepted papers reflected broader trends in machine learning research. Eighty-four percent of accepted papers introduced new datasets as part of benchmark or evaluation contributions. The track saw alignment with main track trends, particularly increased focus on large language model evaluation, alongside continued activity in AI for science, domain-specific applications and socially beneficial AI.

Figure 1: Overview of papers with author-provided keywords across accepted papers
Metadata Compliance
The majority of accepted papers included required Croissant metadata, though gaps appeared in initial submissions. Missing fields included license information (11.9 percent), dataset descriptions (4.9 percent) and URLs (3.5 percent). Less than one percent failed to include dataset names.Where licensing information was provided, authors predominantly selected open and permissive terms, particularly Creative Commons BY 4.0 and CC0 1.0.
Adoption of the Croissant Responsible AI extension remained minimal. This extension captures data collection practices, biases and sensitive content, but few submissions included RAI-compliant documentation.
Survey Results
Author Experience
After the acceptance notification, we sent an anonymous survey to both authors and reviewers. 851 authors and 155 reviewers responded. In response to questions about the hosting process, 82 percent of authors reported smooth experiences, while 16 percent encountered difficulties. Common issues involved very large datasets (one terabyte or larger), platform rate limits and occasional instability near submission deadlines. Authors noted that automated Croissant generation sometimes failed for complex datasets.
When asked about the review process, 58 percent of authors agreed that the new requirements led to fairer or more thorough reviews. 15 percent reported little effect, and 25 percent indicated review quality needed improvement. Concerns among the latter group included limited reviewer engagement in rebuttals, reliance on AI-generated feedback and emphasis on methodological novelty over real-world impact.
For hosting and metadata, 63 percent of authors rated the requirements as effective or very effective in improving quality standards, with 16 percent neutral.

Figure 2: Responses of authors to the question “Do you think it was effective in improving the review process?”
Reviewer Feedback
For the 155 reviewers that responded to the anonymous survey, 77 percent reported that datasets were easy to access. Around 10 percent encountered difficulties, most commonly due to missing or broken links and very large files. Eleven percent did not directly inspect datasets and instead based their evaluations solely on the accompanying papers.
Automated metadata reports were first introduced in 2025 with the goal of supporting more consistent and efficient review. In this first year of use, 69 percent of reviewers found them useful or very useful, and 70 percent indicated that the compliance checklist helped them assess submissions more efficiently. Several reviewers recommended that future iterations of the reports be shorter and more focused.

Figure 3: Responses of reviewers to the question “How did the requirement for all datasets to be hosted affected your review process?”
Looking Forward
Identified Challenges and Areas for Development
Several patterns emerged from the 2025 cycle that require continued attention:
Metadata Documentation: While most submissions included required fields, gaps in the first submission round revealed a learning curve. Authors are adapting to structured format requirements. Current guidance leaves room for interpretation, particularly regarding licensing documentation and descriptive context in machine-readable form. This could be improved through clearer documentation for metadata submission, and by refining automated validation reporting tools to provide concise, targeted information to authors and reviewers.
Responsible AI Documentation: Low adoption of RAI metadata indicates a gap between available standards and practical implementation. Authors need clearer instructions for documenting data provenance, biases, limitations and societal impacts. Moreover, platform support for RAI-compliant exports would reduce the documentation burden, and more extensive validation checks at submission time can ensure that the most needed information is provided
Reviewer Expertise: Submissions increasingly span specialized domains including AI for science, medicine, multimodal data and LLM evaluation. The reviewer pool shows limitations in diversity and domain coverage. Each paper requires review by experts in data-centric machine learning as well as domain-specific knowledge. Future iterations would benefit from expanding the reviewer pool to include broader domain expertise
Impact Assessment: Unlike algorithmic work with performance metrics, dataset and benchmark impact depends on enabling future research by broadening applicability, surfacing underexplored problems or challenging dominant evaluation paradigms. The track requires shared frameworks to assess data coverage, representativeness and innovation aligned with the community. Next iterations could consider requiring a “demonstrated impact” section in papers or the review form that map dataset characteristics to evaluation results.
Large Dataset Handling and New Standards: Handling of very large datasets presented challenges underlying the need for better platform support. Community feedback highlighted priorities for further development: : clearer self-hosting guidelines and platform partnerships for large datasets streamlined automated metadata reports for reviewer efficiency, and stronger adoption of Responsible AI documentation through improved guidance and platform support.
The track continues to operate in a learning phase as the community establishes norms for data-centric research evaluation. The shift toward standardized hosting platforms, introduction of machine-readable metadata and implementation of automated review tools represent steps in developing infrastructure for reproducible, transparent dataset research.