2025-11-19T17:04:14.740983

Harnessing Consistency for Robust Test-Time LLM Ensemble

Zeng, Yu, Lin et al.

Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. Model-level consistency models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness.

academic

一貫性を活用したロバストなテスト時LLMアンサンブル

基本情報

論文ID: 2510.13855
タイトル: Harnessing Consistency for Robust Test-Time LLM Ensemble
著者: Zhichen Zeng, Qi Yu, Xiao Lin, Ruizhong Qiu, Xuying Ning, Tianxin Wei, Yuchen Yan, Jingrui He, Hanghang Tong (University of Illinois Urbana-Champaign)
分類: cs.CL, cs.AI
発表日: 2025年10月12日 (arXiv プレプリント)
論文リンク: https://arxiv.org/abs/2510.13855

要約

異なる大規模言語モデル(LLM)は異なる強みと弱みを示し、LLMアンサンブルはそれらの相補的能力を統合する有望な方法として機能する。アンサンブル品質の向上において実質的な進展が達成されているにもかかわらず、異種トークン化スキームと異なるモデル専門知識から生じる潜在的なエラー信号に直面したときのアンサンブルのロバスト性への関心は限定的である。本論文の分析は、アンサンブル失敗が通常、トークンレベルとモデルレベルの2つのレベルから生じることを示している。前者はトークン予測における深刻な不一致を反映し、後者は低信頼度とモデル間の顕著な差異を含む。これに基づいて、著者らはCOREを提案する。これはモデル一貫性を活用してロバストなLLMアンサンブルを実現するプラグアンドプレイ技術であり、様々なアンサンブル方法にシームレスに統合できる。

研究背景と動機

問題定義

既存のLLMアンサンブル方法は主にアンサンブル品質の向上に焦点を当てているが、以下の課題に直面した場合、ロバスト性が不足している:

異種トークン化スキーム: 異なるLLMは異なるトークナイザーを使用し、トークン空間の不一致を招く
モデル専門知識の差異: 異なるモデルは異なる領域で顕著なパフォーマンス差を示す
エラー信号の伝播: トークン対齢エラーとモデル予測エラーはアンサンブル出力の正確性を損なう

研究の重要性

LLMアンサンブルのロバスト性は実用的応用にとって重要である。理由は以下の通り:

不正なトークン対齢は確率融合の誤りを招く可能性がある
モデル予測のエラーはアンサンブル出力の正確性をさらに損なう可能性がある
ロバスト性の欠如は「負のアンサンブル」現象を招く。すなわち、アンサンブルパフォーマンスが最良の単一モデルより劣る

既存方法の限界

既存のアンサンブル方法は2つのカテゴリに分類される:

トークンレベルアンサンブル: 各デコードステップで異なるLLMのトークン確率を対齢および融合するが、トークン対齢エラーの影響を受けやすい
応答レベルアンサンブル: 完全な応答またはスパンを選択するが、細粒度のトークンレベルの一貫性を無視する

核心的貢献

LLMアンサンブルのロバスト性問題を初めて体系的に研究し、この分野の重要な空白を埋める
COREフレームワークを提案し、トークンレベルとモデルレベルの2つのレベルから一貫性を評価してアンサンブルパフォーマンスとロバスト性を向上させる
プラグアンドプレイ設計により、様々なLLMアンサンブル戦略にシームレスに統合でき、追加の推論コストがない
包括的な実験検証により、複数のベンチマークタスク、モデル組み合わせ、アンサンブル方法全体で一貫した改善を達成し、Top-2およびTop-3モデルアンサンブルでそれぞれ平均1.3%および2.8%のパフォーマンス向上を獲得

方法の詳細

タスク定義

主モデル(語彙 $V_{main}$ )とN個の補助モデル(語彙 $V_{assist_i}$ )が与えられた場合、目標はトークン対齢行列 $A_i \in \mathbb{R}^{|V_{assist_i}| \times |V_{main}|}$ を学習し、加重融合を通じてアンサンブル確率分布を生成することである: