2025-11-25T18:43:18.843313

Test-Time Alignment for Large Language Models via Textual Model Predictive Control

Wang, Chen, Hung et al.

Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality.

academic

Test-Time Alignment for Large Language Models via Textual Model Predictive Control

Basic Information

Paper ID: 2502.20795
Title: Test-Time Alignment for Large Language Models via Textual Model Predictive Control
Authors: Kuang-Da Wang, Teng-Ruei Chen, Yu-Heng Hung, Guo-Xun Ko, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh
Institutions: National Yang Ming Chiao Tung University, NVIDIA
Classification: cs.CL (Computational Linguistics)
Publication Date: February 2025
Paper Link: https://arxiv.org/abs/2502.20795v3

Abstract

Aligning large language models with human preferences typically requires fine-tuning, which is resource-intensive. This paper addresses test-time alignment from a sequential decision-making perspective, revealing two fundamental challenges: when actions are defined at the token level (e.g., guided decoding), alignment faces the "curse of dimensionality"; when actions are defined at the response level (e.g., traditional iterative optimization), it faces the "curse of horizon." To address this trade-off, the authors draw inspiration from Model Predictive Control (MPC) in control theory and propose Textual Model Predictive Control (TMPC), a novel predictive planning framework applicable to inference-time LLM alignment.

Research Background and Motivation

Problem Background

Importance of Alignment: While large language models demonstrate excellent performance on various NLP tasks, aligning their outputs with human preferences remains a critical challenge, particularly for smaller-scale LLMs (e.g., under 10B parameters).
Limitations of Traditional Methods:
- Training-time alignment methods (e.g., RLHF, DPO) are resource-intensive, requiring expensive retraining
- Test-time alignment methods face fundamental trade-offs:
  - Token-level guided decoding suffers from the "curse of horizon"
  - Response-level iterative optimization suffers from the "curse of dimensionality"
Research Motivation: There is a need for a test-time alignment method that avoids expensive model retraining while effectively balancing temporal and search space complexity.

Core Contributions

Novel Problem Formulation: First to model test-time alignment as a sequential decision-making problem, unifying existing methods and revealing their fundamental trade-offs.
TMPC Framework: Proposes a Textual Model Predictive Control framework that adapts control-theoretic concepts to language generation tasks.
Two Core Principles:
- Hindsight Subgoal Identification: Discovering meaningful planning steps from rollouts
- Subgoal-Conditioned Re-Generation: Iterative refinement based on verified subgoals
Comprehensive Experimental Validation: Validates the method's effectiveness and generality across three tasks with different characteristics.

Methodology Details

Task Definition

Text generation is modeled as a finite-horizon Markov Decision Process (MDP):

State Space S: All possible text prefixes
Action Space A: All possible generation units
Transition Function P: Deterministic transitions
Reward Function R: Scalar feedback evaluating alignment quality
Objective: Find the optimal action sequence $a^* = \arg\max_{a_{0:T-1}} \sum_{t=0}^{T-1} R(s_t, a_t)$

TMPC Framework Architecture

1. Basic MPC Adaptation

TMPC adapts traditional MPC to text generation:

a^{TMPC}(s) ← G({τ^{(i)}}_{i=1}^K, {J(τ^{(i)})}_{i=1}^K; s)

where G is an aggregation function, τ represents trajectories, and J represents cumulative rewards.

2. Core Principle Implementation

Hindsight Subgoal Identification:

After generating multiple candidate responses, retrospectively identifies high-quality intermediate points as subgoals
Update rule:

B ← {
  B ∪ ã^{TMPC}_t(s), if |B| < capacity,
  B \ {a ∈ B | R(s,a) < R(s,a')} ∪ {a'}, otherwise
}

Subgoal-Conditioned Re-Generation:

Aggregation function:

ã^{TMPC}_t(s) ← G({τ^{(i)}_t}_{i=1}^K, R(·) | s, B) := {a | R(s,a) ≥ α and a ∈ {τ^{(i)}_t}_{i=1}^K}

New rollouts are generated by explicitly conditioning on high-reward targets in buffer B as conditioning signals

Technical Innovations

Dynamic Boundary Discovery: Does not rely on predefined hard segmentation boundaries; discovers task-specific meaningful planning steps.
Hierarchical Reinforcement Learning Inspiration: Incorporates ideas from hierarchical RL through subgoal decomposition of long-horizon planning tasks.
Stable Cumulative Progress: Ensures stable performance improvements by building upon verified subgoals.
Training-Free: Leverages pretrained LLMs as dynamics models and proposal distributions without fine-tuning.

Experimental Setup

Datasets

Discourse-Level Machine Translation:
- WMT'24 Discourse-Level Literary Translation benchmark
- Language pairs: Chinese→English, Chinese→German, Chinese→Russian
- Each instance segmented to at most 1024 tokens
Long-Form Response Generation:
- Dahoas/full-hh-rlhf dataset
- 6K longest response samples for training, 1024 for testing
Program Synthesis:
- MBPP official test set
- 500 problems (Task IDs 11-510)

Evaluation Metrics

Machine Translation: SEGALEcomet score, Null Alignment (NA) Ratio
Long-Form Response: Average reward score, GPT-4 win rate
Program Synthesis: Pass Rate

Baseline Methods

Test-Time Alignment Methods:

ARGS: Token-level guided decoding
RAIN: Tree-structured self-evaluation
RE-Control: Gradient optimization modifying internal representations
GenARM: Autoregressive reward model
TPO: Text optimization method
Best-of-N sampling

Training-Time Alignment Methods:

Supervised Fine-Tuning (SFT)
Direct Preference Optimization (DPO)
SimPO

Implementation Details

Backbone Model: LLaMA-3.1-8B-Instruct
Number of Iterations: 3-5
Rollouts per Iteration: 2-3
Quality Threshold α: Task-specific settings
Buffer Capacity: 3-6 subgoals

Experimental Results

Main Results

Discourse-Level Machine Translation

On WMT'24 literary translation tasks, TMPC achieves the best performance among all test-time alignment baselines:

Direction	TMPC SEGALEcomet	Best-of-60	TPO	NA Ratio
zh→en	94.62	90.97	88.81	0.00
zh→ru	91.53	84.86	92.63	1.19
zh→de	91.73	82.74	87.67	2.40

TMPC even surpasses GPT-4o (94.58) on the zh→en direction
Significantly outperforms strong baseline Best-of-60 with lower computational cost

Long-Form Response Generation

Average Reward: 4.60 (TMPC) vs 4.18 (Best-of-20) vs 3.95 (DPO)
GPT-4 Win Rate: Wins against both DPO and Best-of-20
Requires only 10 generations (3 iterations × 3 rollouts + 1 initial generation)

Program Synthesis

Pass Rate: 61% (TMPC) vs 50% (Best-of-35) vs 48% (TPO)
Systematically explores solution paths by building upon partial correctness

Ablation Studies

Hyperparameter Robustness: Variations in buffer size and segment length show performance impact less than 0.1 points.
Reward Model Sensitivity:
- Maintains good performance even with weaker reward models
- Limited impact from noise injection, demonstrating the filtering effect of the subgoal buffer
Iteration Analysis: Performance steadily improves over the first 3 iterations, with slight decline thereafter.

Case Analysis

The paper demonstrates how TMPC discovers and leverages subgoals across different tasks:

Machine Translation: Sentence-level alignment
Response Generation: Semantically coherent text chunks
Program Synthesis: Functional milestones passing unit tests

Preference Alignment Methods

Training-Time Methods: RLHF, DPO, SimPO, CPO, etc., computationally expensive but effective
Test-Time Methods: Guided decoding, iterative optimization, tree search, etc., lightweight but with inherent limitations

Control Theory Applications in NLP

TMPC is the first to systematically apply Model Predictive Control to preference alignment in language generation, filling a gap in the intersection of control theory and NLP.

Hierarchical Reinforcement Learning

Borrows ideas of subgoal discovery and hierarchical planning from HRL, but adapts them to discrete text generation scenarios.

Conclusions and Discussion

Main Conclusions

Unified Framework: Successfully unifies test-time alignment as a sequential decision-making problem, revealing fundamental trade-offs in existing methods
Effective Balance: TMPC effectively balances the curse of horizon and curse of dimensionality
Broad Applicability: Achieves consistent improvements across three tasks with different characteristics

Limitations

Model Capacity Constraints: Limited by the expressive capacity of the underlying language model
Distribution Shift: May perform poorly when expected outputs deviate significantly from the model's original distribution
Reward Signal Dependency: Performance largely depends on the quality of the reward model

Future Directions

Integration with Training-Time Methods: Explore lightweight fine-tuning or collaborative reward model optimization
Enhanced Distribution Adaptation: Improve robustness under distribution shift
Automatic Subgoal Discovery: Develop more intelligent subgoal identification mechanisms

In-Depth Evaluation

Strengths

Significant Theoretical Contribution: First systematic analysis of fundamental challenges in test-time alignment, providing a unified theoretical framework
Strong Method Innovation: Successfully adapts MPC to text generation with clear principles and elegant design
Comprehensive Experiments: Validation across three tasks with different characteristics, including detailed ablation studies and robustness analysis
High Practical Value: No retraining required, computationally efficient, easy to deploy

Weaknesses

Heuristic Nature of Subgoal Discovery: While effective, subgoal identification still relies on heuristic methods
Task-Specific Tuning: Different tasks require specific prompt design and parameter adjustment
Long-Sequence Handling: Capability for processing extremely long sequences remains to be verified
Lack of Theoretical Guarantees: Absence of convergence or optimality guarantees

Impact

Academic Value: Provides a new research paradigm for test-time alignment, potentially inspiring future work
Practical Significance: Offers a viable solution for LLM alignment in resource-constrained environments
Cross-Disciplinary Contribution: Promotes integration between control theory and NLP

Applicable Scenarios

Resource-Constrained Deployment: Scenarios where large-scale fine-tuning is infeasible
Dynamic Preference Adjustment: Applications requiring rapid adaptation to different preferences
Multi-Task Systems: Systems needing flexible alignment strategy switching across tasks
Safety-Critical Applications: Scenarios requiring additional safety checks at inference time

References

The paper cites extensive related work, primarily including:

Foundational LLM research (GPT series, LLaMA, Gemma, etc.)
Preference alignment methods (RLHF, DPO, SimPO, etc.)
Test-time alignment techniques (ARGS, RAIN, RE-Control, etc.)
Control theory foundations (MPC, MPPI, etc.)
Reinforcement learning theory (hierarchical RL, trajectory optimization, etc.)

Summary: This is a high-quality paper with significant contributions in both theoretical innovation and practical application. The authors successfully adapt the MPC framework from control theory to the preference alignment problem in language generation, proposing the innovative TMPC method, and comprehensively validate its effectiveness through extensive experiments. This work provides a new research direction for test-time alignment with important academic value and practical significance.