2025-11-19T17:04:14.740983

Harnessing Consistency for Robust Test-Time LLM Ensemble

Zeng, Yu, Lin et al.

Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. Model-level consistency models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness.

academic

견고한 테스트 시간 LLM 앙상블을 위한 일관성 활용

기본 정보

논문 ID: 2510.13855
제목: Harnessing Consistency for Robust Test-Time LLM Ensemble
저자: Zhichen Zeng, Qi Yu, Xiao Lin, Ruizhong Qiu, Xuying Ning, Tianxin Wei, Yuchen Yan, Jingrui He, Hanghang Tong (University of Illinois Urbana-Champaign)
분류: cs.CL, cs.AI
발표일: 2025년 10월 12일 (arXiv 사전인쇄본)
논문 링크: https://arxiv.org/abs/2510.13855

초록

서로 다른 대규모 언어 모델(LLMs)은 상이한 강점과 약점을 나타내며, LLM 앙상블은 이들의 상호 보완적 능력을 통합하는 유망한 방법으로 작용합니다. 앙상블 품질 향상에 있어 상당한 진전이 있었음에도 불구하고, 이질적인 토큰화 방식과 서로 다른 모델 전문성으로부터 비롯되는 잠재적 오류 신호에 직면했을 때 앙상블의 견고성에 대한 관심은 제한적입니다. 본 논문의 분석은 앙상블 실패가 토큰 수준과 모델 수준 두 가지 층면에서 비롯됨을 보여줍니다: 전자는 토큰 예측의 심각한 불일치를 반영하고, 후자는 낮은 신뢰도와 모델 간의 현저한 차이를 포함합니다. 이를 바탕으로 저자들은 CORE를 제안하며, 이는 모델 일관성을 활용하여 견고한 LLM 앙상블을 위한 플러그 앤 플레이 기술로서 다양한 앙상블 방법에 무결하게 통합될 수 있습니다.

연구 배경 및 동기

문제 정의

기존의 LLM 앙상블 방법은 주로 앙상블 품질 향상에 중점을 두고 있으나, 다음과 같은 과제에 직면했을 때 견고성이 부족합니다:

이질적 토큰화 방식: 서로 다른 LLM은 상이한 분절기를 사용하여 토큰 공간 불일치 초래
모델 전문성 차이: 서로 다른 모델은 다양한 영역에서 현저한 성능 차이 표시
오류 신호 전파: 토큰 정렬 오류와 모델 예측 오류는 앙상블 출력의 정확성 손상

연구의 중요성

LLM 앙상블의 견고성은 실제 응용에 있어 중요합니다:

잘못된 토큰 정렬은 오류 확률 융합 초래 가능
모델 예측의 오류는 앙상블 출력의 정확성을 추가로 손상
견고성 부족은 "음의 앙상블" 현상 초래, 즉 앙상블 성능이 최고 단일 모델보다 저하

기존 방법의 한계

기존 앙상블 방법은 두 가지로 분류됩니다:

토큰 수준 앙상블: 각 디코딩 단계에서 서로 다른 LLM의 토큰 확률을 정렬 및 융합하나, 토큰 정렬 오류에 취약
응답 수준 앙상블: 완전한 응답 또는 범위를 선택하나, 세분화된 토큰 수준 일관성 무시

핵심 기여

LLM 앙상블의 견고성 문제에 대한 최초의 체계적 연구, 해당 분야의 중요한 공백 메우기
CORE 프레임워크 제안, 토큰 수준과 모델 수준 두 가지 층면에서 일관성 평가하여 앙상블 성능 및 견고성 강화
플러그 앤 플레이 설계, 다양한 LLM 앙상블 전략에 무결하게 통합 가능하며 추가 추론 비용 없음
포괄적 실험 검증, 다중 벤치마크 작업, 모델 조합 및 앙상블 방법에서 일관된 개선 달성, Top-2 및 Top-3 모델 앙상블에서 각각 평균 1.3% 및 2.8%의 성능 향상

방법 상세 설명

작업 정의

주 모델(어휘 $V_{main}$ )과 N개의 보조 모델(어휘 $V_{assist_i}$ )이 주어졌을 때, 목표는 토큰 정렬 행렬 $A_i \in \mathbb{R}^{|V_{assist_i}| \times |V_{main}|}$ 를 학습하고 가중 융합을 통해 앙상블 확률 분포를 생성하는 것입니다: