2025-11-25T18:43:18.843313

Test-Time Alignment for Large Language Models via Textual Model Predictive Control

Wang, Chen, Hung et al.

Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality.

academic

Test-Time Alignment for Large Language Models via Textual Model Predictive Control

基本信息

论文ID: 2502.20795
标题: Test-Time Alignment for Large Language Models via Textual Model Predictive Control
作者: Kuang-Da Wang, Teng-Ruei Chen, Yu-Heng Hung, Guo-Xun Ko, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh
机构: National Yang Ming Chiao Tung University, NVIDIA
分类: cs.CL (Computational Linguistics)
发表时间: 2025年2月
论文链接: https://arxiv.org/abs/2502.20795v3

摘要

大语言模型与人类偏好的对齐通常需要通过微调实现，但这种方法资源消耗巨大，因此需要轻量级的测试时对齐替代方案。本文从序列决策制定的角度来解决测试时对齐问题，揭示了两个根本性挑战：当动作定义在token级别时（如引导解码），对齐面临"维度诅咒"；当动作定义在响应级别时（如传统迭代优化），则面临"时间维度诅咒"。为解决这一权衡，作者从控制论中的模型预测控制（MPC）获得灵感，提出了文本模型预测控制（TMPC），这是一个适用于推理时LLM对齐的新型预测规划框架。

研究背景与动机

问题背景

对齐问题的重要性：虽然大语言模型在各种NLP任务上表现出色，但将其输出与人类偏好对齐仍然是一个关键挑战，特别是对于较小规模的LLM（如10B参数以下）。
传统方法的局限性：
- 训练时对齐方法（如RLHF、DPO）资源密集，需要昂贵的重新训练
- 测试时对齐方法存在根本性权衡：
  - Token级别的引导解码面临"时间维度诅咒"（curse of horizon）
  - 响应级别的迭代优化面临"维度诅咒"（curse of dimensionality）
研究动机：需要一种既能避免昂贵的模型重训练，又能有效平衡时间维度和搜索空间复杂性的测试时对齐方法。

核心贡献

新颖的问题建模：首次将测试时对齐问题建模为序列决策制定问题，统一了现有方法并揭示了其根本性权衡。
TMPC框架：提出了文本模型预测控制框架，将控制论概念适配到语言生成任务。
两个核心原理：
- 事后子目标识别（Hindsight Subgoal Identification）：从回滚中发现有意义的规划步骤
- 子目标条件重生成（Subgoal-Conditioned Re-Generation）：基于已验证的子目标进行迭代改进
广泛的实验验证：在三个不同特性的任务上验证了方法的有效性和通用性。

方法详解

任务定义

将文本生成建模为有限时间马尔可夫决策过程（MDP）：

状态空间 S：所有可能的文本前缀
动作空间 A：所有可能的生成单元
转移函数 P：确定性转移
奖励函数 R：评估对齐质量的标量反馈
目标：找到最优动作序列 $a^* = \arg\max_{a_{0:T-1}} \sum_{t=0}^{T-1} R(s_t, a_t)$

TMPC框架架构

1. 基础MPC适配

TMPC将传统MPC适配到文本生成：

a^{TMPC}(s) ← G({τ^{(i)}}_{i=1}^K, {J(τ^{(i)})}_{i=1}^K; s)

其中G是聚合函数，τ是轨迹，J是累积奖励。

2. 核心原理实现

事后子目标识别：

生成多个候选响应后，回顾性分析识别高质量中间点作为子目标
更新规则：

B ← {
  B ∪ ã^{TMPC}_t(s), if |B| < capacity,
  B \ {a ∈ B | R(s,a) < R(s,a')} ∪ {a'}, otherwise
}

子目标条件重生成：

聚合函数：

ã^{TMPC}_t(s) ← G({τ^{(i)}_t}_{i=1}^K, R(·) | s, B) := {a | R(s,a) ≥ α and a ∈ {τ^{(i)}_t}_{i=1}^K}

新的回滚通过显式利用缓冲区B中的高奖励目标作为条件信号生成

技术创新点

动态边界发现：不依赖预定义的硬分割边界，能够发现任务特定的有意义规划步骤
分层强化学习启发：结合了分层RL的思想，通过子目标分解长期规划任务
稳定的累积进步：通过在已验证的子目标基础上构建，确保稳定的性能提升
无需额外训练：利用预训练LLM作为动力学模型和提议分布，无需微调

实验设置

数据集

段落级机器翻译：
- WMT'24 Discourse-Level Literary Translation benchmark
- 语言对：中文→英文、中文→德文、中文→俄文
- 每个实例分割为最多1024个token
长文本响应生成：
- Dahoas/full-hh-rlhf数据集
- 选择6K最长响应样本用于训练，1024个用于测试
程序合成：
- MBPP数据集官方测试集
- 500个问题（Task IDs 11-510）