2025-11-21T22:49:22.913460

Active Model Selection for Large Language Models

Durmazkeser, Okanovic, Kirsch et al.
We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.
academic

Active Model Selection for Large Language Models

Basic Information

  • Paper ID: 2510.09418
  • Title: Active Model Selection for Large Language Models
  • Authors: Yavuz Durmazkeser (TU Delft), Patrik Okanovic (ETH Zurich), Andreas Kirsch, Torsten Hoefler (ETH Zurich), Nezihe Merve Gürel (TU Delft)
  • Classification: cs.CL cs.LG
  • Publication Date/Venue: arXiv preprint, October 2025
  • Paper Link: https://arxiv.org/abs/2510.09418

Abstract

This paper introduces LLM SELECTOR, the first active model selection framework for large language models (LLMs). Unlike traditional evaluation and benchmarking methods that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the optimal LLM for a given task with limited annotations. For any given task, LLM SELECTOR adaptively selects a small set of the most informative queries for annotation to determine the best model. To further reduce annotation costs, the method employs an oracle-based annotation model using judges. Through extensive experiments on 151 LLMs across 6 benchmarks, results demonstrate that LLM SELECTOR can reduce annotation costs by up to 59.62% when selecting the best and near-optimal LLMs.

Research Background and Motivation

1. Core Problem

With the rapid proliferation of large language models, selecting the optimal LLM for specific applications or data distributions without retraining has become increasingly challenging. Existing model selection methods face the following challenges:

  • The number of available models is growing exponentially, including diverse pre-trained models on academic and commercial platforms
  • Different LLMs exhibit significant performance variations across domains, tasks, and languages
  • Existing benchmarks struggle to keep pace with rapid model releases and often focus on standardized tasks

2. Problem Significance

Model selection is critical for practical deployment because:

  • Performance differences can be substantial, particularly in domain-specific applications
  • Annotation costs are high, necessitating efficient selection strategies
  • Traditional random or heuristic selection methods often lead to resource waste

3. Limitations of Existing Methods

  • Full Annotation Requirement: Traditional evaluation methods require annotation of entire datasets
  • Static Benchmarks: Cannot adapt to new models or application-specific requirements
  • Classification Task Limitation: Existing active model selection primarily targets classification tasks and is unsuitable for generation settings
  • Scalability Issues: Existing methods typically limit to two candidate models or single-model testing scenarios

Core Contributions

  1. Novel Framework: Proposes the first active model selection framework for LLMs, LLM SELECTOR
  2. Information-Theoretic Approach: Based on information gain criterion, using a dual-parameter model to quantify informativeness
  3. Judge Mechanism: Employs an oracle-based annotation process that significantly reduces annotation costs
  4. Model Agnostic: Completely model-agnostic approach applicable to black-box or API-only access scenarios
  5. Experimental Validation: Comprehensive evaluation on 151 LLMs across 6 benchmarks, demonstrating significant cost reduction

Methodology Details

Task Definition

Given a set of n unannotated queries Q = {qi ∈ Q | i ∈ n} and m pre-trained language models M = {fj : Q → R | j ∈ m}, the objective is to identify the optimal model f* that produces the highest quality responses for queries Q under a limited annotation budget b ≪ n.

The problem is formalized as maximizing mutual information:

A_opt[b] = argmax_{A⊆{(qi,ri)|i∈[n]}, |A|≤b} I(F; A)

Model Architecture

1. Preference Judgment-Based Annotation Framework

Employs direct preference judgments rather than reference answer comparisons:

  • Pairwise Comparison: For query qi, the oracle judge compares responses from models fj and fk
  • Judgment Results: >, <, = denote preference, being preferred, and equality respectively
  • Win Rate Calculation: WRQ(fj, fk) = (1/n)∑OracleJudge(qi, fj(·), fk(·))

2. Dual-Parameter Model

Introduces a dual-parameter model describing the behavior of the optimal language model relative to a baseline:

P(F(q) < f̄(q)|F = f*) = ε_loss
P(F(q) = f̄(q)|F = f*) = ε_draw  
P(F(q) > f̄(q)|F = f*) = 1 - ε_loss - ε_draw

3. Sequential Information Maximization Algorithm

Employs a greedy strategy to progressively select queries:

qt = argmin_{q∈Ut} E_R[H(F | At ∪ {(q,R)})]

4. Weak Judge Mechanism

Uses k-gram language models as weak judges:

  • Constructs k-gram models based on candidate model responses
  • Compares response quality through average sequence likelihood ratios
  • Employs ensemble results from multiple weak judges (z=10)

Technical Innovations

  1. Information-Theoretic Driven Selection: First application of Shannon mutual information to LLM selection with solid theoretical foundation
  2. Weak Judge Ensemble: Innovatively uses k-gram model ensemble as a noisy oracle, enabling parameter optimization without requiring true annotations
  3. Baseline Comparison Strategy: Reduces complexity from O(m²) to O(m) by comparing against a single baseline model
  4. Adaptive Parameter Selection: Automatically determines ε_loss and ε_draw parameters through weak judge ensemble

Experimental Setup

Datasets

Experiments cover 6 benchmarks with 151 LLMs:

DatasetQueriesLLMsCategoryWin Rate Range
AlpacaEval80553General Dialogue15.22%-97.64%
Arena-Hard50068General Dialogue5.20%-84.70%
MT-Bench806General Dialogue5.63%-81.88%
Flickr30k100051Vision-Language17.25%-64.85%
Bingo76231Vision-Language0.13%-55.91%
MediQA1509Medical QA33.67%-51.00%

Evaluation Metrics

  1. Identification Probability: Proportion of experiments correctly identifying the best model
  2. Annotation Efficiency: Percentage reduction in required annotations compared to the best baseline method
  3. 95th Percentile Win Rate Gap: 95th percentile of win rate differences between selected and absolute best models

Baseline Methods

  • Random: Random query selection
  • Bradley-Terry: Based on Bradley-Terry coefficient posterior distribution
  • Most Draws: Selects queries with most draws against baseline
  • Uncertainty: Uncertainty sampling-based approach
  • Confidence: Confidence sampling-based approach

Implementation Details

  • Oracle Judge: GPT-4 for text tasks, Prometheus-Vision for vision-language tasks
  • Number of Weak Judges: z=10
  • Parameter Optimization: Grid search for determining ε_loss and ε_draw
  • Experimental Setup: Multiple runs per configuration for performance estimation

Experimental Results

Main Results

1. Identification Probability Performance

LLM SELECTOR significantly outperforms baseline methods across multiple datasets:

  • Arena-Hard: Achieves 100% identification probability with 58.33% annotation reduction
  • MediQA: Reduces annotations by 50.40%
  • MT-Bench: Reduces annotations by 40.00%
  • Comparable or superior to strongest baselines on other benchmarks

2. Annotation Efficiency (Near-Optimal Models)

Efficiency gains when selecting near-optimal models within win rate gap δ:

Datasetδ=1%δ=2.5%δ=5%
Arena-Hard↓59.62%↓59.62%↓58.42%
AlpacaEval↑7.06%↓30.99%↓35.85%
MT-Bench↓40.00%↓40.00%↓42.68%
Flickr30k↓3.39%↓6.25%↓36.47%

Ablation Studies

1. Parameter Sensitivity Analysis

Optimal parameters determined through 1000 runs:

  • Arena-Hard: ε_loss=0.20, ε_draw=0.40
  • AlpacaEval: ε_loss=0.20, ε_draw=0.40
  • MT-Bench: ε_loss=0.15, ε_draw=0.35

2. Impact of Weak Judge Quantity

z=10 determined as optimal choice, with weak judges beyond this quantity providing limited new information.

Robustness Analysis

95th percentile win rate gap analysis demonstrates that LLM SELECTOR maintains small accuracy gaps across different budgets, achieving best or near-best performance in most cases.

1. LLM Evaluation Methods

  • Traditional Benchmarks: Multiple-choice and short-answer benchmarks (MMLU, HellaSwag, etc.)
  • Reference-Based Benchmarks: BLEU, ROUGE evaluation for summarization and translation tasks
  • Judge-Based Benchmarks: LMArena, Arena-Hard, AlpacaEval based on LLM-as-a-Judge

2. Active Model Selection

Existing work primarily focuses on:

  • Classification Tasks: Application of traditional active learning in classification scenarios
  • Online Settings: Scenarios where data arrives in streams
  • Dual-Model Comparison: Limited to two candidate models

3. Advantages of This Work

  • First active model selection for LLM generation tasks
  • Supports arbitrary number of candidate models
  • Data-centric perspective prioritizing annotation samples over model pairs

Conclusions and Discussion

Main Conclusions

  1. Effectiveness Validation: LLM SELECTOR significantly reduces annotation costs across multiple benchmarks
  2. Consistent Performance: Demonstrates consistent competitiveness compared to unstable baseline performance
  3. Practical Value: Completely model-agnostic design makes it suitable for real-world deployment scenarios

Limitations

  1. Baseline Dependency: Method performance partially depends on baseline model selection quality
  2. Parameter Tuning: Requires pre-determining ε_loss and ε_draw parameters
  3. Judge Quality: Depends on oracle judge quality and consistency
  4. Computational Overhead: Weak judge computation may become a bottleneck in large-scale scenarios

Future Directions

  1. Adaptive Parameters: Develop adaptive versions without preset parameters
  2. Multi-Task Extension: Extend to joint multi-task selection scenarios
  3. Online Learning: Incorporate online learning for dynamic model collections
  4. Theoretical Analysis: Provide deeper theoretical guarantees and convergence analysis

In-Depth Evaluation

Strengths

  1. Problem Importance: Addresses an important practical problem in the LLM era
  2. Methodological Innovation: First systematic application of active learning ideas to LLM selection
  3. Theoretical Foundation: Solid information-theoretic theoretical basis
  4. Comprehensive Experiments: Extensive validation across multiple domains and 151 models
  5. Practical Design: Model-agnostic design applicable to API scenarios

Weaknesses

  1. Judge Dependency: Method effectiveness strongly depends on oracle judge quality
  2. Parameter Sensitivity: Requires per-dataset parameter tuning, potentially limiting generalization
  3. Insufficient Theoretical Analysis: Lacks convergence and sample complexity theoretical guarantees
  4. Computational Complexity: Insufficient analysis of weak judge computational overhead

Impact

  1. Academic Contribution: Opens new research direction in active LLM selection
  2. Practical Value: Provides effective tools for real-world LLM deployment
  3. Reproducibility: Provides complete open-source implementation
  4. Extensibility: Establishes foundational framework for subsequent research

Applicable Scenarios

  1. Resource-Constrained Environments: Real-world applications with limited annotation budgets
  2. Domain-Specific Applications: Scenarios requiring model selection for specific data distributions
  3. API Service Selection: Choosing among multiple commercial API services
  4. Continuous Evaluation: Dynamic environments requiring periodic model selection updates

References

The paper cites extensive related work including:

  • LLM Evaluation Benchmarks: HELM (Liang et al., 2023), OpenCompass (2023)
  • Active Learning: Chen et al. (2015), Okanovic et al. (2025)
  • LLM-as-a-Judge: Zheng et al. (2023), Li et al. (2024)
  • Preference Learning: Rafailov et al. (2023), Ouyang et al. (2022)

Overall Assessment: This is a high-quality paper addressing an important practical problem, proposing the first active model selection framework for LLMs with significant contributions in methodological innovation, experimental validation, and practical value. While there remains room for improvement in theoretical analysis and parameter adaptation, it opens new research directions in the LLM selection field with substantial academic and practical significance.