Benchmarking is Broken -- Don't Let AI be its Own Judge
Cheng, Wohnig, Gupta et al.
The meteoric rise of AI, with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today's AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench (with its prototype implementation at https://www.peerbench.ai/), a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.
academic
Benchmarking is Broken -- Don't Let AI be its Own Judge
Title: Benchmarking is Broken -- Don't Let AI be its Own Judge
Authors: Zerui Cheng, Stella Wohnig, Ruchika Gupta, Samiul Alam, Tassallah Abdullahi, João Alves Ribeiro, Christian Nielsen-Garcia, Saif Mir, Siran Li, Jason Orender, Seyed Ali Bahrainian, Daniel Kirste, Aaron Gokaslan, Mikołaj Glinka, Carsten Eickhoff, Ruben Wolff
Classification: cs.AI cs.LG
Publication Venue/Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
With the rapid advancement of AI technology and accelerating market valuations, AI evaluation faces critical challenges. Current benchmarking practices expose serious vulnerabilities, including data contamination and selective reporting by model developers, which fuel hype, while insufficient data quality control may lead to biased assessments. Against the backdrop of an influx of participants into the AI field, this "Wild West" approach to evaluation makes it exceptionally difficult to distinguish genuine progress from inflated claims. This paper argues that the current laissez-faire approach is unsustainable, and genuine AI progress requires a unified, real-time, quality-controlled benchmarking framework. To this end, the paper dissects systemic deficiencies in current AI evaluation, proposes fundamental requirements for next-generation assessment, and introduces PeerBench—a community-governed, supervised evaluation blueprint.
Systematic Critique: Comprehensive analysis of structural deficiencies in current benchmarking, including contamination, fragmentation, and monopolization issues
Position Statement: Proposes repositioning AI evaluation as secure, standardized examinations, with design principles balancing openness and rigor
Prototype Architecture: Designs the PeerBench system, including a concrete ten-step workflow, cryptographic signature artifacts, lightweight reputation mechanisms, and score normalization methods
Practical Implementation: Provides a prototype implementation of PeerBench (https://peerbench.ai), demonstrating concept feasibility
T1. Test Submission and Commitment: Contributors submit test T^(c) and scoring function F^(c); the system records binding commitment h = Com(T^(c), F^(c))
T2. Model Evaluation: Server immediately schedules queries against all currently registered models
T3. Review Process: Randomly assigned to reviewers, requiring at least three valid reviews
The paper cites 56 relevant references spanning multiple domains including AI evaluation, benchmarking, data contamination, and reputation systems, providing substantial theoretical support for its positions.
Overall Assessment: This is a position paper of significant importance that not only provides profound analysis of current AI evaluation system problems but also proposes concrete and feasible solutions. The PeerBench design reflects the authors' deep thinking about the future development of AI evaluation, and its prototype implementation demonstrates concept feasibility. While challenges remain in large-scale implementation, the paper provides clear direction for the development of the AI evaluation field.