benchmarks

wordcloud (3)

August 12 2021

Updates on the NeurIPS 2021 Datasets and Benchmarks Track

Comms Chairs 2021 Conference benchmarks, datasets, update

Joaquin Vanschoren, Maria Xenochristou, and Serena Yeung, Datasets and Benchmarks Chairs

The second round of submissions for the Neural Information Processing Systems 2021 Datasets and Benchmarks Track is open right now. If you have exciting datasets, benchmarks, or ideas to share, we warmly welcome you to submit your work by August 27th (with abstracts due by August 20th).

As announced in our earlier blog post, NeurIPS launched the new Datasets and Benchmarks track, to serve as a venue for exceptional work in creating high-quality datasets, insightful benchmarks, and discussions on how to improve dataset development and data-oriented work more broadly. Submissions are reviewed through OpenReview to facilitate additional public discussion. For more information on the track and submission instructions, please see https://neurips.cc/Conferences/2021/CallForDatasetsBenchmarks.

The accepted papers for the first round are in and we are overwhelmed by the amount and quality of the submissions we received. We would like to thank again all authors who already submitted their work to this track, as well as the reviewers and ACs who contributed to the success of the first round. Below is an overview of the accepted submissions organized by topic. We hope it may serve as inspiration for further amazing work.

For any questions, ideas, and remarks, please contact us at neurips-2021-datasets-benchmarks@googlegroups.com

We look forward to receiving your submissions!

Datasets (General)

RadGraph: Extracting Clinical Entities and Relations from Radiology Reports S Jain et al.
ATOM3D: Tasks on Molecules in Three Dimensions RJL Townshend et al.
EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement Prediction A Kastrati et al.
Programming Puzzles T Schuster et al.
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer M Sorkhei et al.
Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development K Huang et al.
TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers L Zheng et al.
The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions JJ Sun et al.
One Million Scenes for Autonomous Driving: ONCE Dataset J Mao et al.
The PAIR-R24M Dataset for Multi-animal 3D Pose Estimation JD Marshall et al.
ARKitScenes – A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data A Dehghan et al.

Datasets (Text, Language and Speech)

Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management C Logé et al.
NaturalProofs: Mathematical Theorem Proving in Natural Language S Welleck et al.
FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information R Aly et al.
Modeling Worlds in Text P Ammanabrolu and M Riedl
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation S Lu et al.
A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning GY Kebe et al.
EventNarrative: A large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation A Colas et al.
ReaSCAN: Compositional Reasoning in Language Grounding Z Wu et al.
The People’s Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage D Galvez et al.
CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription N Pavlichenko et al.
Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers L Lugosch et al.
BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue Modeling Z Lin et al.
PROCAT: Product Catalogue Dataset for Implicit Clustering, Permutation Learning and Structure Prediction MM Jurewicz and L Derczynski

Simulation environments and data generators

ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation C Gan et al.
An Extensible Benchmark Suite for Learning to Simulate Physical Systems K Otness et al.
A Procedural World Generation Framework for Systematic Evaluation of Continual Learning T Hess et al.
The Neural MMO Platform for Massively Multiagent Research J Suarez et al.
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks G Papoudakis et al.
OmniPrint: A Configurable Printed Character Synthesizer H Sun et al.
Revisiting Time Series Outlier Detection: Definitions and Benchmarks KH Lai et al.
Brax – A Differentiable Physics Engine for Large Scale Rigid Body Simulation CD Freeman et al.
Generating Datasets of 3D Garments with Sewing Patterns M Korosteleva and SH Lee
The Caltech Off-Policy Policy Evaluation Benchmarking Suite C Voloshin et al.
MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research M Samvelyan et al.

Meta-analysis and AI Fairness

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks CG Northcutt et al.
It’s COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks M Bao et al.
Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing S Wiegreffe and A Marasovic
PASS: An ImageNet replacement for self-supervised pretraining without humans Y Asano et al.
Benchmarking Bias Mitigation Algorithms in Representation Learning through Fairness Metrics C Reddy et al.
Addressing “Documentation Debt” in Machine Learning: A Retrospective Datasheet for BookCorpus J Bandy and N Vincent
ImageNet-21K Pretraining for the Masses T Ridnik et al.

Benchmarks

CCNLab: A Benchmarking Framework for Computational Cognitive Neuroscience NX Bhattasali et al.
CommonsenseQA 2.0: Exposing the Limits of AI through Gamification A Talmor et al.
Which priors matter? Benchmarking models for learning latent dynamics A Botev et al.
Variance-Aware Machine Translation Test Sets R Zhan et al.
B-Pref: Benchmarking Preference-Based Reinforcement Learning K Lee et al.
CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms M Pawelczyk et al.
Physion: Evaluating Physical Prediction from Vision in Humans and Machines D Bear et al.
A Unified Few-Shot Classification Benchmark to Compare Transfer and Meta Learning Approaches V Dumoulin et al.
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets S Mille et al.
LiRo: Benchmark and leaderboard for Romanian language tasks SD Dumitrescu et al.
Personalized Benchmarking with the Ludwig Benchmarking Toolkit A Narayan et al.
Reinforcement Learning Benchmarks for Traffic Signal Control J Ault and G Sharon
HiRID-ICU-Benchmark — A Comprehensive Machine Learning Benchmark on High-resolution ICU Data H Yèche et al.
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning PP Liang et al.
Contemporary Symbolic Regression Methods and their Relative Performance W La Cava et al.
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation L Li et al.
DABS: a Domain-Agnostic Benchmark for Self-Supervised Learning A Tamkin et al.
MLPerf Tiny Benchmark C Banbury et al.
Benchmark for Compositional Text-to-Image Synthesis D H Park et al.
Towards a robust experimental framework and benchmark for lifelong language learning A Hussain et al.
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark Y Li et al.

Word cloud of the contents of all round 1 accepted submissions

Word cloud of the contents of all round 1 accepted submissions