2025-11-20T00:19:14.561040

MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

Zibakhsh, Samragh, Nishu et al.

The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction. To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

academic

MoEは思っているより強い：RoEによるハイパー並列推論スケーリング

基本情報

論文ID: 2509.17238
タイトル: MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE
著者: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho (Apple & UCSD)
分類: cs.AI, cs.CL, cs.LG
発表状況: プレプリント。査読中
論文リンク: https://arxiv.org/abs/2509.17238v2

要約

本論文は、トークンレベルで複数の出力提案を計算・集約することにより予測品質を向上させる新しい推論パラダイムであるハイパー並列スケーリング(hyper-parallel scaling)を提案している。具体的な実装は専門家名簿(Roster of Experts, RoE)方法であり、これは訓練不要な推論アルゴリズムであり、単一のMoEモデルを動的なMoEアンサンブルに変換する。RoEは専門家ルーティング機構に制御された確率性を注入することで、各トークンに対して複数の異なる専門家をサンプリングし、その出力を集約してより正確な最終予測を得る。効率的なバッチ処理戦略と専用のKVキャッシュ機構により、RoEは7B MoEモデルが10.5B MoEモデルの性能を達成することを可能にし、同時に推論計算量を30%削減する。

研究背景と動機

問題定義

従来の推論時スケーリング方法は主に2つのカテゴリに分類される：

逐次スケーリング(Sequential Scaling): 思考の連鎖(Chain-of-Thought)など、より長く構造化された出力を生成することでパフォーマンスを向上させる
並列スケーリング(Parallel Scaling): 自己一貫性(Self-Consistency)など、複数の独立した配列を生成し結果を集約する