2025-11-17T00:55:12.821885

Benchmarking is Broken -- Don't Let AI be its Own Judge

Cheng, Wohnig, Gupta et al.

The meteoric rise of AI, with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today's AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench (with its prototype implementation at https://www.peerbench.ai/), a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.

academic

Benchmarking is Broken -- Don't Let AI be its Own Judge

Basic Information

Paper ID: 2510.07575
Title: Benchmarking is Broken -- Don't Let AI be its Own Judge
Authors: Zerui Cheng, Stella Wohnig, Ruchika Gupta, Samiul Alam, Tassallah Abdullahi, João Alves Ribeiro, Christian Nielsen-Garcia, Saif Mir, Siran Li, Jason Orender, Seyed Ali Bahrainian, Daniel Kirste, Aaron Gokaslan, Mikołaj Glinka, Carsten Eickhoff, Ruben Wolff
Classification: cs.AI cs.LG
Publication Venue/Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Paper Link: https://arxiv.org/abs/2510.07575

Abstract

With the rapid advancement of AI technology and accelerating market valuations, AI evaluation faces critical challenges. Current benchmarking practices expose serious vulnerabilities, including data contamination and selective reporting by model developers, which fuel hype, while insufficient data quality control may lead to biased assessments. Against the backdrop of an influx of participants into the AI field, this "Wild West" approach to evaluation makes it exceptionally difficult to distinguish genuine progress from inflated claims. This paper argues that the current laissez-faire approach is unsustainable, and genuine AI progress requires a unified, real-time, quality-controlled benchmarking framework. To this end, the paper dissects systemic deficiencies in current AI evaluation, proposes fundamental requirements for next-generation assessment, and introduces PeerBench—a community-governed, supervised evaluation blueprint.

Research Background and Motivation

Core Problems

This research addresses systemic issues in AI benchmarking:

Data Contamination: Public benchmarks may leak into training sets, resulting in test set memorization and inflated scores
Selective Reporting: Model creators may report results only for favorable task subsets
Evaluation Fragmentation: Lack of unified evaluation standards and interfaces
Absence of Fairness Guarantees: AI evaluation lacks proctoring and identity verification compared to high-stakes human examinations

Problem Significance

The societal impact of AI technology is increasingly profound, necessitating trustworthy evaluation mechanisms
Deficiencies in the current evaluation ecosystem obscure scientific signals and erode public confidence
Just as financial markets require trustworthy regulatory institutions, the AI field similarly requires credible evaluation standards

Limitations of Existing Approaches

Static Benchmarks: Such as MMLU and GSM8K rapidly saturate and are susceptible to memorization
Dynamic Benchmarks: LiveBench, while continuously updated, relies on a single team and has limited scale
Private Benchmarks: Reduce contamination but lack transparency, with inherent bias risks
Crowdsourced Evaluation: Platforms like Chatbot Arena lack identity verification and are susceptible to manipulation

Core Contributions

Systematic Critique: Comprehensive analysis of structural deficiencies in current benchmarking, including contamination, fragmentation, and monopolization issues
Position Statement: Proposes repositioning AI evaluation as secure, standardized examinations, with design principles balancing openness and rigor
Prototype Architecture: Designs the PeerBench system, including a concrete ten-step workflow, cryptographic signature artifacts, lightweight reputation mechanisms, and score normalization methods
Practical Implementation: Provides a prototype implementation of PeerBench (https://peerbench.ai), demonstrating concept feasibility

Methodology Details

Seven Principles of the New Paradigm

Secret Test Sets: Evaluation items remain undisclosed until runtime
Supervised Execution: Models are evaluated in unified sealed sandboxes with all inputs and outputs recorded and cryptographically signed
Community Governance: Multi-stakeholder validator networks enforce rules and governance
Continuous Updates and Vitality: A fixed proportion of questions are retired and replaced in each evaluation round
Auditability and Integrity: Validators pre-commit test and answer hashes before publication
Fair Access: Any legitimate team can submit models, requiring only computational compensation fees
Multi-Metric Reporting: Provides domain-specific subscores and percentile rankings

PeerBench Architecture Design

Participant Roles

Data Contributors: Create private test suites and executable scoring functions
Reviewers: Assess quality of submitted tests, producing ordinal ratings
Model Creators: Expose inference endpoints and register specific streams
Coordination Server: Authenticate uploads, manage active repositories, schedule peer reviews
End Users: Researchers, journalists, and others consulting real-time leaderboards

Three Leaderboard Systems

Data Contributor Leaderboard:

ContributorScore(c) = Σ quality(T_i^(c)) + bonuses

Reviewer Leaderboard:

ReviewerScore(r) = Pearson({q_r^(i)}, {q^(i)})

Model Leaderboard:

ModelScore(m) = (Σ w(T_i) s_i^(m)) / (Σ w(T_i))

End-to-End Workflow

Setup Phase

Participants register using verifiable credentials
Generate public key signature keys
Contributors and reviewers post collateral bonds

Continuous Evaluation Process

T1. Test Submission and Commitment: Contributors submit test T^(c) and scoring function F^(c); the system records binding commitment h = Com(T^(c), F^(c))

T2. Model Evaluation: Server immediately schedules queries against all currently registered models

T3. Review Process: Randomly assigned to reviewers, requiring at least three valid reviews

T4. Weight Calculation:

w(T^(c)) = max{0, 0.7 * quality(T^(c)) + 0.3 * min(2, ρ_c/100)}

T5. Repository Management: New tests enter the active repository; zero-weight tests are prioritized for retirement

T6. Reputation Updates: Reputation of all relevant participants updated after each round

Experimental Setup

Temporal Fairness Dilemma

The paper identifies two design choices:

Option A: On-Demand Immediate Scoring: Models are scored immediately upon request, maximizing responsiveness
Option B: Periodic Synchronized Evaluation: Models register for scheduled evaluation windows, ensuring the strongest form of fairness

PeerBench adopts a hybrid approach supporting both paradigms, prioritizing immediate scoring flexibility in the prototype.

Security and Audit Mechanisms

Partial Disclosure: Display small random portions of tests to reviewers in read-only, non-copyable format
Complete Publication: Release tests, logs, and model responses after retirement
Slashing Mechanisms: Remove participants with reputation below threshold; malicious behavior triggers bond slashing

Experimental Results

Prototype Implementation

The paper provides a practical prototype implementation of PeerBench (https://peerbench.ai), demonstrating:

Complete workflow implementation
Operational reputation system mechanisms
Multi-stream evaluation support (mathematics, code generation, translation, etc.)

Validity of Design Choices

The paper addresses common issues through architectural design:

Data Contamination and Cherry-Picking: Validators pre-commit to test sets, maintaining privacy until round completion
Private Data Cheating: Public random sources determine disclosed queries, preventing validators from anticipating audit items
Test Quality: Each test receives multiple independent reviews; data quality determines its weight in final scores
Accessibility: Registration for all roles is lightweight, supporting broad participation

Static Benchmarks and Leaderboards

MMLU, GSM8K, SuperGLUE provide clear progress snapshots but rapidly saturate and leak into training corpora
BIG-Bench expands task coverage but tasks become public upon release
HELM adds multiple metrics but remains static between publication intervals

Dynamic or Contamination-Resistant Benchmarks

LiveBench continuously refreshes tasks but relies on a single centralized team
Dynabench explores human-in-the-loop adversarial data collection
Adversarial "model-breaking" competitions expose weaknesses but lack systematic score aggregation

Human Preference and Open Evaluation Platforms

Chatbot Arena's Elo ladder and OpenAI Evals promote openness
HuggingFace Open LLM Leaderboard allows users to upload test scripts
However, these platforms are vulnerable to spam, bot voting, and untracked contamination

Conclusions and Discussion

Main Conclusions

Current AI benchmarking systems exhibit systemic deficiencies requiring paradigm shift
A supervised evaluation paradigm inspired by human standardized testing is a viable solution
PeerBench demonstrates the practicality of community-governed, contamination-resistant evaluation
Balance between openness and rigor must be achieved

Limitations

Temporal Fairness: Fundamental tension exists between immediate and synchronized evaluation
Implementation Costs: Requires sustained high-quality test creation and infrastructure maintenance
Participation Incentives: Appropriate economic incentives needed to sustain reviewer engagement
Governance Complexity: Multi-stakeholder governance may face coordination challenges

Future Directions

Mechanism Design: Further research into game-theoretic security analysis to strengthen economic and adversarial robustness
Governance Optimization: Improve multi-institutional governance structures and rotating membership systems
Cost Optimization: Explore methods to reduce operational costs, such as containerized inference submissions
Standardization: Promote collaboration with neutral organizations such as NIST or MLCommons

In-Depth Evaluation

Strengths

Accurate Problem Identification: Precisely identifies core issues in the current AI evaluation ecosystem
Innovative Solutions: Proposes paradigm shift from static leaderboards to supervised examinations
Strong Practicality: Provides concrete implementation prototypes and detailed workflows
Solid Theoretical Foundation: Draws on mature experience from human standardized testing
Community-Oriented: Emphasizes community governance and decentralization, avoiding single points of failure

Weaknesses

Scalability Challenges: Large-scale implementation may face participant coordination and incentive issues
Cold Start Problem: New systems require sufficient initial participants to establish credibility
Incomplete Economic Model: While slashing mechanisms are mentioned, economic incentive details require further refinement
Technical Implementation Complexity: High complexity in implementing technical components such as cryptographic signatures and reputation systems

Impact

Academic Contribution: Provides new theoretical framework and practical direction for AI evaluation research
Industry Impact: May drive establishment of more equitable and trustworthy AI evaluation standards
Policy Significance: Provides technical foundation for AI regulation and standard-setting
Long-Term Value: Establishes blueprint for sustainable AI evaluation ecosystem

Applicable Scenarios

High-Risk AI Application Evaluation: Particularly suitable for AI systems requiring high credibility
Academic Research: Provides fair model comparison platform for research community
Industry Standard Development: Can serve as foundation for industry standard evaluation frameworks
Regulatory Compliance: Provides technical support for regulatory evaluation of AI systems

References

The paper cites 56 relevant references spanning multiple domains including AI evaluation, benchmarking, data contamination, and reputation systems, providing substantial theoretical support for its positions.

Overall Assessment: This is a position paper of significant importance that not only provides profound analysis of current AI evaluation system problems but also proposes concrete and feasible solutions. The PeerBench design reflects the authors' deep thinking about the future development of AI evaluation, and its prototype implementation demonstrates concept feasibility. While challenges remain in large-scale implementation, the paper provides clear direction for the development of the AI evaluation field.