2025-11-17T00:55:12.821885

Benchmarking is Broken -- Don't Let AI be its Own Judge

Cheng, Wohnig, Gupta et al.
The meteoric rise of AI, with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today's AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench (with its prototype implementation at https://www.peerbench.ai/), a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.
academic

Benchmarking is Broken -- Don't Let AI be its Own Judge

Basic Information

  • Paper ID: 2510.07575
  • Title: Benchmarking is Broken -- Don't Let AI be its Own Judge
  • Authors: Zerui Cheng, Stella Wohnig, Ruchika Gupta, Samiul Alam, Tassallah Abdullahi, João Alves Ribeiro, Christian Nielsen-Garcia, Saif Mir, Siran Li, Jason Orender, Seyed Ali Bahrainian, Daniel Kirste, Aaron Gokaslan, Mikołaj Glinka, Carsten Eickhoff, Ruben Wolff
  • Classification: cs.AI cs.LG
  • Publication Venue/Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
  • Paper Link: https://arxiv.org/abs/2510.07575

Abstract

With the rapid advancement of AI technology and accelerating market valuations, AI evaluation faces critical challenges. Current benchmarking practices expose serious vulnerabilities, including data contamination and selective reporting by model developers, which fuel hype, while insufficient data quality control may lead to biased assessments. Against the backdrop of an influx of participants into the AI field, this "Wild West" approach to evaluation makes it exceptionally difficult to distinguish genuine progress from inflated claims. This paper argues that the current laissez-faire approach is unsustainable, and genuine AI progress requires a unified, real-time, quality-controlled benchmarking framework. To this end, the paper dissects systemic deficiencies in current AI evaluation, proposes fundamental requirements for next-generation assessment, and introduces PeerBench—a community-governed, supervised evaluation blueprint.

Research Background and Motivation

Core Problems

This research addresses systemic issues in AI benchmarking:

  1. Data Contamination: Public benchmarks may leak into training sets, resulting in test set memorization and inflated scores
  2. Selective Reporting: Model creators may report results only for favorable task subsets
  3. Evaluation Fragmentation: Lack of unified evaluation standards and interfaces
  4. Absence of Fairness Guarantees: AI evaluation lacks proctoring and identity verification compared to high-stakes human examinations

Problem Significance

  • The societal impact of AI technology is increasingly profound, necessitating trustworthy evaluation mechanisms
  • Deficiencies in the current evaluation ecosystem obscure scientific signals and erode public confidence
  • Just as financial markets require trustworthy regulatory institutions, the AI field similarly requires credible evaluation standards

Limitations of Existing Approaches

  1. Static Benchmarks: Such as MMLU and GSM8K rapidly saturate and are susceptible to memorization
  2. Dynamic Benchmarks: LiveBench, while continuously updated, relies on a single team and has limited scale
  3. Private Benchmarks: Reduce contamination but lack transparency, with inherent bias risks
  4. Crowdsourced Evaluation: Platforms like Chatbot Arena lack identity verification and are susceptible to manipulation

Core Contributions

  1. Systematic Critique: Comprehensive analysis of structural deficiencies in current benchmarking, including contamination, fragmentation, and monopolization issues
  2. Position Statement: Proposes repositioning AI evaluation as secure, standardized examinations, with design principles balancing openness and rigor
  3. Prototype Architecture: Designs the PeerBench system, including a concrete ten-step workflow, cryptographic signature artifacts, lightweight reputation mechanisms, and score normalization methods
  4. Practical Implementation: Provides a prototype implementation of PeerBench (https://peerbench.ai), demonstrating concept feasibility

Methodology Details

Seven Principles of the New Paradigm

  1. Secret Test Sets: Evaluation items remain undisclosed until runtime
  2. Supervised Execution: Models are evaluated in unified sealed sandboxes with all inputs and outputs recorded and cryptographically signed
  3. Community Governance: Multi-stakeholder validator networks enforce rules and governance
  4. Continuous Updates and Vitality: A fixed proportion of questions are retired and replaced in each evaluation round
  5. Auditability and Integrity: Validators pre-commit test and answer hashes before publication
  6. Fair Access: Any legitimate team can submit models, requiring only computational compensation fees
  7. Multi-Metric Reporting: Provides domain-specific subscores and percentile rankings

PeerBench Architecture Design

Participant Roles

  • Data Contributors: Create private test suites and executable scoring functions
  • Reviewers: Assess quality of submitted tests, producing ordinal ratings
  • Model Creators: Expose inference endpoints and register specific streams
  • Coordination Server: Authenticate uploads, manage active repositories, schedule peer reviews
  • End Users: Researchers, journalists, and others consulting real-time leaderboards

Three Leaderboard Systems

  1. Data Contributor Leaderboard:
    ContributorScore(c) = Σ quality(T_i^(c)) + bonuses
    
  2. Reviewer Leaderboard:
    ReviewerScore(r) = Pearson({q_r^(i)}, {q^(i)})
    
  3. Model Leaderboard:
    ModelScore(m) = (Σ w(T_i) s_i^(m)) / (Σ w(T_i))
    

End-to-End Workflow

Setup Phase

  • Participants register using verifiable credentials
  • Generate public key signature keys
  • Contributors and reviewers post collateral bonds

Continuous Evaluation Process

T1. Test Submission and Commitment: Contributors submit test T^(c) and scoring function F^(c); the system records binding commitment h = Com(T^(c), F^(c))

T2. Model Evaluation: Server immediately schedules queries against all currently registered models

T3. Review Process: Randomly assigned to reviewers, requiring at least three valid reviews

T4. Weight Calculation:

w(T^(c)) = max{0, 0.7 * quality(T^(c)) + 0.3 * min(2, ρ_c/100)}

T5. Repository Management: New tests enter the active repository; zero-weight tests are prioritized for retirement

T6. Reputation Updates: Reputation of all relevant participants updated after each round

Experimental Setup

Temporal Fairness Dilemma

The paper identifies two design choices:

  • Option A: On-Demand Immediate Scoring: Models are scored immediately upon request, maximizing responsiveness
  • Option B: Periodic Synchronized Evaluation: Models register for scheduled evaluation windows, ensuring the strongest form of fairness

PeerBench adopts a hybrid approach supporting both paradigms, prioritizing immediate scoring flexibility in the prototype.

Security and Audit Mechanisms

  • Partial Disclosure: Display small random portions of tests to reviewers in read-only, non-copyable format
  • Complete Publication: Release tests, logs, and model responses after retirement
  • Slashing Mechanisms: Remove participants with reputation below threshold; malicious behavior triggers bond slashing

Experimental Results

Prototype Implementation

The paper provides a practical prototype implementation of PeerBench (https://peerbench.ai), demonstrating:

  • Complete workflow implementation
  • Operational reputation system mechanisms
  • Multi-stream evaluation support (mathematics, code generation, translation, etc.)

Validity of Design Choices

The paper addresses common issues through architectural design:

  • Data Contamination and Cherry-Picking: Validators pre-commit to test sets, maintaining privacy until round completion
  • Private Data Cheating: Public random sources determine disclosed queries, preventing validators from anticipating audit items
  • Test Quality: Each test receives multiple independent reviews; data quality determines its weight in final scores
  • Accessibility: Registration for all roles is lightweight, supporting broad participation

Static Benchmarks and Leaderboards

  • MMLU, GSM8K, SuperGLUE provide clear progress snapshots but rapidly saturate and leak into training corpora
  • BIG-Bench expands task coverage but tasks become public upon release
  • HELM adds multiple metrics but remains static between publication intervals

Dynamic or Contamination-Resistant Benchmarks

  • LiveBench continuously refreshes tasks but relies on a single centralized team
  • Dynabench explores human-in-the-loop adversarial data collection
  • Adversarial "model-breaking" competitions expose weaknesses but lack systematic score aggregation

Human Preference and Open Evaluation Platforms

  • Chatbot Arena's Elo ladder and OpenAI Evals promote openness
  • HuggingFace Open LLM Leaderboard allows users to upload test scripts
  • However, these platforms are vulnerable to spam, bot voting, and untracked contamination

Conclusions and Discussion

Main Conclusions

  1. Current AI benchmarking systems exhibit systemic deficiencies requiring paradigm shift
  2. A supervised evaluation paradigm inspired by human standardized testing is a viable solution
  3. PeerBench demonstrates the practicality of community-governed, contamination-resistant evaluation
  4. Balance between openness and rigor must be achieved

Limitations

  1. Temporal Fairness: Fundamental tension exists between immediate and synchronized evaluation
  2. Implementation Costs: Requires sustained high-quality test creation and infrastructure maintenance
  3. Participation Incentives: Appropriate economic incentives needed to sustain reviewer engagement
  4. Governance Complexity: Multi-stakeholder governance may face coordination challenges

Future Directions

  1. Mechanism Design: Further research into game-theoretic security analysis to strengthen economic and adversarial robustness
  2. Governance Optimization: Improve multi-institutional governance structures and rotating membership systems
  3. Cost Optimization: Explore methods to reduce operational costs, such as containerized inference submissions
  4. Standardization: Promote collaboration with neutral organizations such as NIST or MLCommons

In-Depth Evaluation

Strengths

  1. Accurate Problem Identification: Precisely identifies core issues in the current AI evaluation ecosystem
  2. Innovative Solutions: Proposes paradigm shift from static leaderboards to supervised examinations
  3. Strong Practicality: Provides concrete implementation prototypes and detailed workflows
  4. Solid Theoretical Foundation: Draws on mature experience from human standardized testing
  5. Community-Oriented: Emphasizes community governance and decentralization, avoiding single points of failure

Weaknesses

  1. Scalability Challenges: Large-scale implementation may face participant coordination and incentive issues
  2. Cold Start Problem: New systems require sufficient initial participants to establish credibility
  3. Incomplete Economic Model: While slashing mechanisms are mentioned, economic incentive details require further refinement
  4. Technical Implementation Complexity: High complexity in implementing technical components such as cryptographic signatures and reputation systems

Impact

  1. Academic Contribution: Provides new theoretical framework and practical direction for AI evaluation research
  2. Industry Impact: May drive establishment of more equitable and trustworthy AI evaluation standards
  3. Policy Significance: Provides technical foundation for AI regulation and standard-setting
  4. Long-Term Value: Establishes blueprint for sustainable AI evaluation ecosystem

Applicable Scenarios

  1. High-Risk AI Application Evaluation: Particularly suitable for AI systems requiring high credibility
  2. Academic Research: Provides fair model comparison platform for research community
  3. Industry Standard Development: Can serve as foundation for industry standard evaluation frameworks
  4. Regulatory Compliance: Provides technical support for regulatory evaluation of AI systems

References

The paper cites 56 relevant references spanning multiple domains including AI evaluation, benchmarking, data contamination, and reputation systems, providing substantial theoretical support for its positions.


Overall Assessment: This is a position paper of significant importance that not only provides profound analysis of current AI evaluation system problems but also proposes concrete and feasible solutions. The PeerBench design reflects the authors' deep thinking about the future development of AI evaluation, and its prototype implementation demonstrates concept feasibility. While challenges remain in large-scale implementation, the paper provides clear direction for the development of the AI evaluation field.