We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.
This paper introduces LLM SELECTOR, the first active model selection framework for large language models (LLMs). Unlike traditional evaluation and benchmarking methods that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the optimal LLM for a given task with limited annotations. For any given task, LLM SELECTOR adaptively selects a small set of the most informative queries for annotation to determine the best model. To further reduce annotation costs, the method employs an oracle-based annotation model using judges. Through extensive experiments on 151 LLMs across 6 benchmarks, results demonstrate that LLM SELECTOR can reduce annotation costs by up to 59.62% when selecting the best and near-optimal LLMs.
With the rapid proliferation of large language models, selecting the optimal LLM for specific applications or data distributions without retraining has become increasingly challenging. Existing model selection methods face the following challenges:
The number of available models is growing exponentially, including diverse pre-trained models on academic and commercial platforms
Different LLMs exhibit significant performance variations across domains, tasks, and languages
Existing benchmarks struggle to keep pace with rapid model releases and often focus on standardized tasks
Given a set of n unannotated queries Q = {qi ∈ Q | i ∈ n} and m pre-trained language models M = {fj : Q → R | j ∈ m}, the objective is to identify the optimal model f* that produces the highest quality responses for queries Q under a limited annotation budget b ≪ n.
The problem is formalized as maximizing mutual information:
A_opt[b] = argmax_{A⊆{(qi,ri)|i∈[n]}, |A|≤b} I(F; A)
95th percentile win rate gap analysis demonstrates that LLM SELECTOR maintains small accuracy gaps across different budgets, achieving best or near-best performance in most cases.
LLM Evaluation Benchmarks: HELM (Liang et al., 2023), OpenCompass (2023)
Active Learning: Chen et al. (2015), Okanovic et al. (2025)
LLM-as-a-Judge: Zheng et al. (2023), Li et al. (2024)
Preference Learning: Rafailov et al. (2023), Ouyang et al. (2022)
Overall Assessment: This is a high-quality paper addressing an important practical problem, proposing the first active model selection framework for LLMs with significant contributions in methodological innovation, experimental validation, and practical value. While there remains room for improvement in theoretical analysis and parameter adaptation, it opens new research directions in the LLM selection field with substantial academic and practical significance.