2025-11-25T18:43:18.843313

Test-Time Alignment for Large Language Models via Textual Model Predictive Control

Wang, Chen, Hung et al.
Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality.
academic

Test-Time Alignment for Large Language Models via Textual Model Predictive Control

Basic Information

  • Paper ID: 2502.20795
  • Title: Test-Time Alignment for Large Language Models via Textual Model Predictive Control
  • Authors: Kuang-Da Wang, Teng-Ruei Chen, Yu-Heng Hung, Guo-Xun Ko, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh
  • Institutions: National Yang Ming Chiao Tung University, NVIDIA
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: February 2025
  • Paper Link: https://arxiv.org/abs/2502.20795v3

Abstract

Aligning large language models with human preferences typically requires fine-tuning, which is resource-intensive. This paper addresses test-time alignment from a sequential decision-making perspective, revealing two fundamental challenges: when actions are defined at the token level (e.g., guided decoding), alignment faces the "curse of dimensionality"; when actions are defined at the response level (e.g., traditional iterative optimization), it faces the "curse of horizon." To address this trade-off, the authors draw inspiration from Model Predictive Control (MPC) in control theory and propose Textual Model Predictive Control (TMPC), a novel predictive planning framework applicable to inference-time LLM alignment.

Research Background and Motivation

Problem Background

  1. Importance of Alignment: While large language models demonstrate excellent performance on various NLP tasks, aligning their outputs with human preferences remains a critical challenge, particularly for smaller-scale LLMs (e.g., under 10B parameters).
  2. Limitations of Traditional Methods:
    • Training-time alignment methods (e.g., RLHF, DPO) are resource-intensive, requiring expensive retraining
    • Test-time alignment methods face fundamental trade-offs:
      • Token-level guided decoding suffers from the "curse of horizon"
      • Response-level iterative optimization suffers from the "curse of dimensionality"
  3. Research Motivation: There is a need for a test-time alignment method that avoids expensive model retraining while effectively balancing temporal and search space complexity.

Core Contributions

  1. Novel Problem Formulation: First to model test-time alignment as a sequential decision-making problem, unifying existing methods and revealing their fundamental trade-offs.
  2. TMPC Framework: Proposes a Textual Model Predictive Control framework that adapts control-theoretic concepts to language generation tasks.
  3. Two Core Principles:
    • Hindsight Subgoal Identification: Discovering meaningful planning steps from rollouts
    • Subgoal-Conditioned Re-Generation: Iterative refinement based on verified subgoals
  4. Comprehensive Experimental Validation: Validates the method's effectiveness and generality across three tasks with different characteristics.

Methodology Details

Task Definition

Text generation is modeled as a finite-horizon Markov Decision Process (MDP):

  • State Space S: All possible text prefixes
  • Action Space A: All possible generation units
  • Transition Function P: Deterministic transitions
  • Reward Function R: Scalar feedback evaluating alignment quality
  • Objective: Find the optimal action sequence a=argmaxa0:T1t=0T1R(st,at)a^* = \arg\max_{a_{0:T-1}} \sum_{t=0}^{T-1} R(s_t, a_t)

TMPC Framework Architecture

1. Basic MPC Adaptation

TMPC adapts traditional MPC to text generation:

a^{TMPC}(s) ← G({τ^{(i)}}_{i=1}^K, {J(τ^{(i)})}_{i=1}^K; s)

where G is an aggregation function, τ represents trajectories, and J represents cumulative rewards.

2. Core Principle Implementation

Hindsight Subgoal Identification:

  • After generating multiple candidate responses, retrospectively identifies high-quality intermediate points as subgoals
  • Update rule:
B ← {
  B ∪ ã^{TMPC}_t(s), if |B| < capacity,
  B \ {a ∈ B | R(s,a) < R(s,a')} ∪ {a'}, otherwise
}

Subgoal-Conditioned Re-Generation:

  • Aggregation function:
ã^{TMPC}_t(s) ← G({τ^{(i)}_t}_{i=1}^K, R(·) | s, B) := {a | R(s,a) ≥ α and a ∈ {τ^{(i)}_t}_{i=1}^K}
  • New rollouts are generated by explicitly conditioning on high-reward targets in buffer B as conditioning signals

Technical Innovations

  1. Dynamic Boundary Discovery: Does not rely on predefined hard segmentation boundaries; discovers task-specific meaningful planning steps.
  2. Hierarchical Reinforcement Learning Inspiration: Incorporates ideas from hierarchical RL through subgoal decomposition of long-horizon planning tasks.
  3. Stable Cumulative Progress: Ensures stable performance improvements by building upon verified subgoals.
  4. Training-Free: Leverages pretrained LLMs as dynamics models and proposal distributions without fine-tuning.

Experimental Setup

Datasets

  1. Discourse-Level Machine Translation:
    • WMT'24 Discourse-Level Literary Translation benchmark
    • Language pairs: Chinese→English, Chinese→German, Chinese→Russian
    • Each instance segmented to at most 1024 tokens
  2. Long-Form Response Generation:
    • Dahoas/full-hh-rlhf dataset
    • 6K longest response samples for training, 1024 for testing
  3. Program Synthesis:
    • MBPP official test set
    • 500 problems (Task IDs 11-510)

Evaluation Metrics

  • Machine Translation: SEGALEcomet score, Null Alignment (NA) Ratio
  • Long-Form Response: Average reward score, GPT-4 win rate
  • Program Synthesis: Pass Rate

Baseline Methods

Test-Time Alignment Methods:

  • ARGS: Token-level guided decoding
  • RAIN: Tree-structured self-evaluation
  • RE-Control: Gradient optimization modifying internal representations
  • GenARM: Autoregressive reward model
  • TPO: Text optimization method
  • Best-of-N sampling

Training-Time Alignment Methods:

  • Supervised Fine-Tuning (SFT)
  • Direct Preference Optimization (DPO)
  • SimPO

Implementation Details

  • Backbone Model: LLaMA-3.1-8B-Instruct
  • Number of Iterations: 3-5
  • Rollouts per Iteration: 2-3
  • Quality Threshold α: Task-specific settings
  • Buffer Capacity: 3-6 subgoals

Experimental Results

Main Results

Discourse-Level Machine Translation

On WMT'24 literary translation tasks, TMPC achieves the best performance among all test-time alignment baselines:

DirectionTMPC SEGALEcometBest-of-60TPONA Ratio
zh→en94.6290.9788.810.00
zh→ru91.5384.8692.631.19
zh→de91.7382.7487.672.40
  • TMPC even surpasses GPT-4o (94.58) on the zh→en direction
  • Significantly outperforms strong baseline Best-of-60 with lower computational cost

Long-Form Response Generation

  • Average Reward: 4.60 (TMPC) vs 4.18 (Best-of-20) vs 3.95 (DPO)
  • GPT-4 Win Rate: Wins against both DPO and Best-of-20
  • Requires only 10 generations (3 iterations × 3 rollouts + 1 initial generation)

Program Synthesis

  • Pass Rate: 61% (TMPC) vs 50% (Best-of-35) vs 48% (TPO)
  • Systematically explores solution paths by building upon partial correctness

Ablation Studies

  1. Hyperparameter Robustness: Variations in buffer size and segment length show performance impact less than 0.1 points.
  2. Reward Model Sensitivity:
    • Maintains good performance even with weaker reward models
    • Limited impact from noise injection, demonstrating the filtering effect of the subgoal buffer
  3. Iteration Analysis: Performance steadily improves over the first 3 iterations, with slight decline thereafter.

Case Analysis

The paper demonstrates how TMPC discovers and leverages subgoals across different tasks:

  • Machine Translation: Sentence-level alignment
  • Response Generation: Semantically coherent text chunks
  • Program Synthesis: Functional milestones passing unit tests

Preference Alignment Methods

  1. Training-Time Methods: RLHF, DPO, SimPO, CPO, etc., computationally expensive but effective
  2. Test-Time Methods: Guided decoding, iterative optimization, tree search, etc., lightweight but with inherent limitations

Control Theory Applications in NLP

TMPC is the first to systematically apply Model Predictive Control to preference alignment in language generation, filling a gap in the intersection of control theory and NLP.

Hierarchical Reinforcement Learning

Borrows ideas of subgoal discovery and hierarchical planning from HRL, but adapts them to discrete text generation scenarios.

Conclusions and Discussion

Main Conclusions

  1. Unified Framework: Successfully unifies test-time alignment as a sequential decision-making problem, revealing fundamental trade-offs in existing methods
  2. Effective Balance: TMPC effectively balances the curse of horizon and curse of dimensionality
  3. Broad Applicability: Achieves consistent improvements across three tasks with different characteristics

Limitations

  1. Model Capacity Constraints: Limited by the expressive capacity of the underlying language model
  2. Distribution Shift: May perform poorly when expected outputs deviate significantly from the model's original distribution
  3. Reward Signal Dependency: Performance largely depends on the quality of the reward model

Future Directions

  1. Integration with Training-Time Methods: Explore lightweight fine-tuning or collaborative reward model optimization
  2. Enhanced Distribution Adaptation: Improve robustness under distribution shift
  3. Automatic Subgoal Discovery: Develop more intelligent subgoal identification mechanisms

In-Depth Evaluation

Strengths

  1. Significant Theoretical Contribution: First systematic analysis of fundamental challenges in test-time alignment, providing a unified theoretical framework
  2. Strong Method Innovation: Successfully adapts MPC to text generation with clear principles and elegant design
  3. Comprehensive Experiments: Validation across three tasks with different characteristics, including detailed ablation studies and robustness analysis
  4. High Practical Value: No retraining required, computationally efficient, easy to deploy

Weaknesses

  1. Heuristic Nature of Subgoal Discovery: While effective, subgoal identification still relies on heuristic methods
  2. Task-Specific Tuning: Different tasks require specific prompt design and parameter adjustment
  3. Long-Sequence Handling: Capability for processing extremely long sequences remains to be verified
  4. Lack of Theoretical Guarantees: Absence of convergence or optimality guarantees

Impact

  1. Academic Value: Provides a new research paradigm for test-time alignment, potentially inspiring future work
  2. Practical Significance: Offers a viable solution for LLM alignment in resource-constrained environments
  3. Cross-Disciplinary Contribution: Promotes integration between control theory and NLP

Applicable Scenarios

  1. Resource-Constrained Deployment: Scenarios where large-scale fine-tuning is infeasible
  2. Dynamic Preference Adjustment: Applications requiring rapid adaptation to different preferences
  3. Multi-Task Systems: Systems needing flexible alignment strategy switching across tasks
  4. Safety-Critical Applications: Scenarios requiring additional safety checks at inference time

References

The paper cites extensive related work, primarily including:

  • Foundational LLM research (GPT series, LLaMA, Gemma, etc.)
  • Preference alignment methods (RLHF, DPO, SimPO, etc.)
  • Test-time alignment techniques (ARGS, RAIN, RE-Control, etc.)
  • Control theory foundations (MPC, MPPI, etc.)
  • Reinforcement learning theory (hierarchical RL, trajectory optimization, etc.)

Summary: This is a high-quality paper with significant contributions in both theoretical innovation and practical application. The authors successfully adapt the MPC framework from control theory to the preference alignment problem in language generation, proposing the innovative TMPC method, and comprehensively validate its effectiveness through extensive experiments. This work provides a new research direction for test-time alignment with important academic value and practical significance.