2025-11-21T22:49:22.913460

Active Model Selection for Large Language Models

Durmazkeser, Okanovic, Kirsch et al.

We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.

academic

Active Model Selection for Large Language Models

Basic Information

Paper ID: 2510.09418
Title: Active Model Selection for Large Language Models
Authors: Yavuz Durmazkeser (TU Delft), Patrik Okanovic (ETH Zurich), Andreas Kirsch, Torsten Hoefler (ETH Zurich), Nezihe Merve Gürel (TU Delft)
Classification: cs.CL cs.LG
Publication Date/Venue: arXiv preprint, October 2025
Paper Link: https://arxiv.org/abs/2510.09418

Abstract

This paper introduces LLM SELECTOR, the first active model selection framework for large language models (LLMs). Unlike traditional evaluation and benchmarking methods that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the optimal LLM for a given task with limited annotations. For any given task, LLM SELECTOR adaptively selects a small set of the most informative queries for annotation to determine the best model. To further reduce annotation costs, the method employs an oracle-based annotation model using judges. Through extensive experiments on 151 LLMs across 6 benchmarks, results demonstrate that LLM SELECTOR can reduce annotation costs by up to 59.62% when selecting the best and near-optimal LLMs.

Research Background and Motivation

1. Core Problem

With the rapid proliferation of large language models, selecting the optimal LLM for specific applications or data distributions without retraining has become increasingly challenging. Existing model selection methods face the following challenges:

The number of available models is growing exponentially, including diverse pre-trained models on academic and commercial platforms
Different LLMs exhibit significant performance variations across domains, tasks, and languages
Existing benchmarks struggle to keep pace with rapid model releases and often focus on standardized tasks

2. Problem Significance

Model selection is critical for practical deployment because:

Performance differences can be substantial, particularly in domain-specific applications
Annotation costs are high, necessitating efficient selection strategies
Traditional random or heuristic selection methods often lead to resource waste

3. Limitations of Existing Methods

Full Annotation Requirement: Traditional evaluation methods require annotation of entire datasets
Static Benchmarks: Cannot adapt to new models or application-specific requirements
Classification Task Limitation: Existing active model selection primarily targets classification tasks and is unsuitable for generation settings
Scalability Issues: Existing methods typically limit to two candidate models or single-model testing scenarios

Core Contributions

Novel Framework: Proposes the first active model selection framework for LLMs, LLM SELECTOR
Information-Theoretic Approach: Based on information gain criterion, using a dual-parameter model to quantify informativeness
Judge Mechanism: Employs an oracle-based annotation process that significantly reduces annotation costs
Model Agnostic: Completely model-agnostic approach applicable to black-box or API-only access scenarios
Experimental Validation: Comprehensive evaluation on 151 LLMs across 6 benchmarks, demonstrating significant cost reduction

Methodology Details

Task Definition

Given a set of n unannotated queries Q = {qi ∈ Q | i ∈ n} and m pre-trained language models M = {fj : Q → R | j ∈ m}, the objective is to identify the optimal model f* that produces the highest quality responses for queries Q under a limited annotation budget b ≪ n.

The problem is formalized as maximizing mutual information:

A_opt[b] = argmax_{A⊆{(qi,ri)|i∈[n]}, |A|≤b} I(F; A)

Model Architecture

1. Preference Judgment-Based Annotation Framework

Employs direct preference judgments rather than reference answer comparisons:

Pairwise Comparison: For query qi, the oracle judge compares responses from models fj and fk
Judgment Results: >, <, = denote preference, being preferred, and equality respectively
Win Rate Calculation: WRQ(fj, fk) = (1/n)∑OracleJudge(qi, fj(·), fk(·))

2. Dual-Parameter Model

Introduces a dual-parameter model describing the behavior of the optimal language model relative to a baseline:

P(F(q) < f̄(q)|F = f*) = ε_loss
P(F(q) = f̄(q)|F = f*) = ε_draw  
P(F(q) > f̄(q)|F = f*) = 1 - ε_loss - ε_draw

3. Sequential Information Maximization Algorithm

Employs a greedy strategy to progressively select queries:

qt = argmin_{q∈Ut} E_R[H(F | At ∪ {(q,R)})]

4. Weak Judge Mechanism

Uses k-gram language models as weak judges:

Constructs k-gram models based on candidate model responses
Compares response quality through average sequence likelihood ratios
Employs ensemble results from multiple weak judges (z=10)

Technical Innovations

Information-Theoretic Driven Selection: First application of Shannon mutual information to LLM selection with solid theoretical foundation
Weak Judge Ensemble: Innovatively uses k-gram model ensemble as a noisy oracle, enabling parameter optimization without requiring true annotations
Baseline Comparison Strategy: Reduces complexity from O(m²) to O(m) by comparing against a single baseline model
Adaptive Parameter Selection: Automatically determines ε_loss and ε_draw parameters through weak judge ensemble

Experimental Setup

Datasets

Experiments cover 6 benchmarks with 151 LLMs:

Dataset	Queries	LLMs	Category	Win Rate Range
AlpacaEval	805	53	General Dialogue	15.22%-97.64%
Arena-Hard	500	68	General Dialogue	5.20%-84.70%
MT-Bench	80	6	General Dialogue	5.63%-81.88%
Flickr30k	1000	51	Vision-Language	17.25%-64.85%
Bingo	762	31	Vision-Language	0.13%-55.91%
MediQA	150	9	Medical QA	33.67%-51.00%

Evaluation Metrics

Identification Probability: Proportion of experiments correctly identifying the best model
Annotation Efficiency: Percentage reduction in required annotations compared to the best baseline method
95th Percentile Win Rate Gap: 95th percentile of win rate differences between selected and absolute best models

Baseline Methods

Random: Random query selection
Bradley-Terry: Based on Bradley-Terry coefficient posterior distribution
Most Draws: Selects queries with most draws against baseline
Uncertainty: Uncertainty sampling-based approach
Confidence: Confidence sampling-based approach

Implementation Details

Oracle Judge: GPT-4 for text tasks, Prometheus-Vision for vision-language tasks
Number of Weak Judges: z=10
Parameter Optimization: Grid search for determining ε_loss and ε_draw
Experimental Setup: Multiple runs per configuration for performance estimation

Experimental Results

Main Results

1. Identification Probability Performance

LLM SELECTOR significantly outperforms baseline methods across multiple datasets:

Arena-Hard: Achieves 100% identification probability with 58.33% annotation reduction
MediQA: Reduces annotations by 50.40%
MT-Bench: Reduces annotations by 40.00%
Comparable or superior to strongest baselines on other benchmarks

2. Annotation Efficiency (Near-Optimal Models)

Efficiency gains when selecting near-optimal models within win rate gap δ:

Dataset	δ=1%	δ=2.5%	δ=5%
Arena-Hard	↓59.62%	↓59.62%	↓58.42%
AlpacaEval	↑7.06%	↓30.99%	↓35.85%
MT-Bench	↓40.00%	↓40.00%	↓42.68%
Flickr30k	↓3.39%	↓6.25%	↓36.47%

Ablation Studies

1. Parameter Sensitivity Analysis

Optimal parameters determined through 1000 runs:

Arena-Hard: ε_loss=0.20, ε_draw=0.40
AlpacaEval: ε_loss=0.20, ε_draw=0.40
MT-Bench: ε_loss=0.15, ε_draw=0.35

2. Impact of Weak Judge Quantity

z=10 determined as optimal choice, with weak judges beyond this quantity providing limited new information.

Robustness Analysis

95th percentile win rate gap analysis demonstrates that LLM SELECTOR maintains small accuracy gaps across different budgets, achieving best or near-best performance in most cases.

1. LLM Evaluation Methods

Traditional Benchmarks: Multiple-choice and short-answer benchmarks (MMLU, HellaSwag, etc.)
Reference-Based Benchmarks: BLEU, ROUGE evaluation for summarization and translation tasks
Judge-Based Benchmarks: LMArena, Arena-Hard, AlpacaEval based on LLM-as-a-Judge

2. Active Model Selection

Existing work primarily focuses on:

Classification Tasks: Application of traditional active learning in classification scenarios
Online Settings: Scenarios where data arrives in streams
Dual-Model Comparison: Limited to two candidate models

3. Advantages of This Work

First active model selection for LLM generation tasks
Supports arbitrary number of candidate models
Data-centric perspective prioritizing annotation samples over model pairs

Conclusions and Discussion

Main Conclusions

Effectiveness Validation: LLM SELECTOR significantly reduces annotation costs across multiple benchmarks
Consistent Performance: Demonstrates consistent competitiveness compared to unstable baseline performance
Practical Value: Completely model-agnostic design makes it suitable for real-world deployment scenarios

Limitations

Baseline Dependency: Method performance partially depends on baseline model selection quality
Parameter Tuning: Requires pre-determining ε_loss and ε_draw parameters
Judge Quality: Depends on oracle judge quality and consistency
Computational Overhead: Weak judge computation may become a bottleneck in large-scale scenarios

Future Directions

Adaptive Parameters: Develop adaptive versions without preset parameters
Multi-Task Extension: Extend to joint multi-task selection scenarios
Online Learning: Incorporate online learning for dynamic model collections
Theoretical Analysis: Provide deeper theoretical guarantees and convergence analysis

In-Depth Evaluation

Strengths

Problem Importance: Addresses an important practical problem in the LLM era
Methodological Innovation: First systematic application of active learning ideas to LLM selection
Theoretical Foundation: Solid information-theoretic theoretical basis
Comprehensive Experiments: Extensive validation across multiple domains and 151 models
Practical Design: Model-agnostic design applicable to API scenarios

Weaknesses

Judge Dependency: Method effectiveness strongly depends on oracle judge quality
Parameter Sensitivity: Requires per-dataset parameter tuning, potentially limiting generalization
Insufficient Theoretical Analysis: Lacks convergence and sample complexity theoretical guarantees
Computational Complexity: Insufficient analysis of weak judge computational overhead

Impact

Academic Contribution: Opens new research direction in active LLM selection
Practical Value: Provides effective tools for real-world LLM deployment
Reproducibility: Provides complete open-source implementation
Extensibility: Establishes foundational framework for subsequent research

Applicable Scenarios

Resource-Constrained Environments: Real-world applications with limited annotation budgets
Domain-Specific Applications: Scenarios requiring model selection for specific data distributions
API Service Selection: Choosing among multiple commercial API services
Continuous Evaluation: Dynamic environments requiring periodic model selection updates

References

The paper cites extensive related work including:

LLM Evaluation Benchmarks: HELM (Liang et al., 2023), OpenCompass (2023)
Active Learning: Chen et al. (2015), Okanovic et al. (2025)
LLM-as-a-Judge: Zheng et al. (2023), Li et al. (2024)
Preference Learning: Rafailov et al. (2023), Ouyang et al. (2022)

Overall Assessment: This is a high-quality paper addressing an important practical problem, proposing the first active model selection framework for LLMs with significant contributions in methodological innovation, experimental validation, and practical value. While there remains room for improvement in theoretical analysis and parameter adaptation, it opens new research directions in the LLM selection field with substantial academic and practical significance.