Test-time alignment of large language models (LLMs) has gained attention for avoiding expensive fine-tuning costs. This paper proposes a novel test-time alignment method—Adaptive Importance Sampling in Pre-logit space (AISP)—based on sampling-based model predictive control with stochastic control inputs. AISP applies Gaussian perturbations to pre-logit outputs (penultimate layer activations) and achieves alignment by maximizing the expected reward of the perturbed mean. The paper proves that the optimal mean can be obtained through importance sampling of sampled rewards. AISP outperforms best-of-n sampling in sample efficiency and surpasses other reward-based test-time alignment methods in reward values.
Alignment of large language models is a critical technology for ensuring LLM safety and broad applicability. Traditional reinforcement learning from human feedback (RLHF) requires fine-tuning LLM parameters, incurring substantial computational costs. Test-time alignment aims to generate human-preference-aligned responses without updating model parameters.
Can we control LLMs to explore optimal responses through training-free methods? This paper approaches the problem from control theory, adopting sampling-based model predictive path integral (MPPI) control to propose a training-free test-time alignment method.
Given input prompt , an LLM generates response . The objective is to maximize expected reward given reward model while maintaining KL divergence constraint with the base LLM:
Unlike RE-Control which uses deterministic control inputs, AISP employs stochastic control inputs :
\text{softmax}(W_{LLM}(z_t + v_t) + b_{LLM}), & v_t \sim \mathcal{N}(u_t, \sigma^2I), \text{ for } 1 \leq t \leq \tau \\ \text{softmax}(W_{LLM}z_t + b_{LLM}), & \text{for } \tau < t \end{cases}$$ Where: - $z_t = \phi_{LLM}(y_{<t})$ is the pre-logit (penultimate layer output) - $u_t$ is the perturbation mean to be optimized - $\sigma^2I$ is the fixed covariance matrix - $\tau$ is the control time window #### 2. Input Trajectory Distribution Input trajectory $V = [v_1, ..., v_\tau]$ follows a joint Gaussian distribution: $$q(V|U, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{d\tau/2}} \exp\left(-\frac{1}{2\sigma^2}\sum_{t=1}^\tau (v_t - u_t)^\top(v_t - u_t)\right)$$ Base distribution is zero-mean Gaussian: $p(V|0, \sigma^2)$ #### 3. Optimal Distribution Derivation Through free energy: $$F(r, p, x, \lambda) = \log\left(\mathbb{E}_{V\sim P}\left[\exp\left(\frac{1}{\lambda}r(x,y(V))\right)\right]\right)$$ **Theorem 3.1** proves the optimal density function is: $$q^*(V) = \frac{1}{\eta}\exp\left(\frac{1}{\lambda}r(x,y(V))\right)p(V)$$ where $\eta$ is the normalization constant. #### 4. Adaptive Importance Sampling Since the optimal distribution is difficult to compute directly, importance sampling approximation is used. **Theorem 3.2** proves the optimal mean is: $$u_t^* = \mathbb{E}_{V\sim Q^*}[v_t] = \mathbb{E}_{V\sim Q_{\hat{U},\sigma^2}}[w(V)v_t]$$ Weight function is: $$\tilde{w}^i = \frac{\exp\left(\frac{1}{\lambda}r(x,y(V^i)) - \frac{1-\alpha}{\sigma^2}\sum_{t=1}^\tau \hat{u}_t^\top v_t^i\right)}{\sum_j \exp\left(\frac{1}{\lambda}r(x,y(V^j)) - \frac{1-\alpha}{\sigma^2}\sum_{t=1}^\tau \hat{u}_t^\top v_t^j\right)}$$ where relaxation parameter $\alpha \in (0,1)$ is introduced to enhance numerical stability. #### 5. Iterative Update Through $\kappa$ iterations, generating $n$ samples per iteration: $$\hat{u}_t^{k+1} = \sum_{i=1}^n \tilde{w}^i v_t^{i,k}, \quad v_t^{i,k} \sim \mathcal{N}(\hat{u}_t^k, \sigma^2I)$$ Finally, the response with highest reward among all samples is selected. ### Technical Innovations #### 1. Pre-logit Space vs Token Space - **Advantages**: Pre-logit distribution can be expressed with closed-form Gaussian distribution, while token sequence distribution is difficult to model - **Computability**: Weight function is easy to compute without requiring complex techniques like normalizing flows #### 2. Justification of Gaussian Assumption The paper theoretically analyzes the connection between Gaussian assumption and softmax layer: If $p(z_t|y_t=y_i) = \mathcal{N}(\mu_{y_i}, \Sigma)$, then by Bayes' theorem: $$P(y_t=y_i|z_t) = \frac{\exp(\mu_{y_i}^\top\Sigma^{-1}z - \frac{1}{2}\mu_{y_i}\Sigma^{-1}\mu_{y_i} + \ln P(y_t=y_i))}{\sum_j \exp(\mu_{y_j}^\top\Sigma^{-1}z - \frac{1}{2}\mu_{y_j}\Sigma^{-1}\mu_{y_j} + \ln P(y_t=y_j))}$$ This corresponds exactly to the softmax function form, demonstrating that the Gaussian assumption aligns with implicit assumptions of neural language models. #### 3. Theoretical Connection with BoN **Theorem 3.3** proves: when $\lambda \to 0^+$ and $\kappa=1$, AISP degenerates to BoN. This shows AISP is a continuous approximation and generalization of BoN, providing a more flexible optimization framework. #### 4. Fixed Control Window Unlike MPPI's sliding window, AISP uses a fixed window $t \in [1, \tau]$, avoiding diversity loss from fixed prefix tokens. ## Experimental Setup ### Datasets 1. **Anthropic HH-RLHF**: For aligning LLM helpfulness and harmlessness 2. **Stanford Human Preferences (SHP)**: Human preference dataset 3. **Scale**: 1000 randomly selected samples from test sets (limited by computational resources) ### Base Models - **LLMs**: Llama-3-8B, Vicuna-7B-v1.5, Gemma3-4B - **Reward Models**: UltraRM-13b, Eurus-RM-7b ### Evaluation Metrics 1. **Reward Value**: Evaluated using UltraRM $r(x,y)$ 2. **Diversity**: $\sum_{n=2}^4 \frac{\text{unique n-gram}(y)}{\text{total n-gram}(y)}$, measuring repetition in responses 3. **Coherence**: Cosine similarity between prompt and response embeddings using SimCSE 4. **Win Rate**: GPT-4 evaluation of AISP's win rate relative to BoN ### Comparison Methods 1. **BoN (top-p)**: Best-of-N with nucleus sampling, N=1024 (= κn) 2. **RE-Control**: Control method based on trained value functions 3. **ARGS-greedy**: Method adding weighted rewards to logits ### Implementation Details - **AISP Parameters**: $n=32$, $\kappa=32$, total samples 1024 - **Hyperparameter Tuning**: Grid search on 10 training samples - $\lambda \in [0.1, 0.3, 0.5, 0.7]$ (UltraRM), $[60, 120, 240, 480]$ (Eurus) - $\sigma^2 \in [0.1, 0.3, 0.5, 0.7]$ - $\alpha \in [0.99, 0.999, 0.9999, 0.99999]$ - **Generation Settings**: Maximum new token length 128, half-precision (bfloat16) - **Hardware**: NVIDIA A100 (40GB) and H100 (80GB) ## Experimental Results ### Main Results #### Average Reward Comparison (Table 1) Results across 6 model-reward model combinations and 2 datasets show: **SHP Dataset**: - **Llama3 & UltraRM**: AISP (-1.39) vs BoN (-2.38), **41.6% improvement** - **Vicuna & UltraRM**: AISP (-1.46) vs BoN (-1.78), 18.0% improvement - **Gemma3 & UltraRM**: AISP (-2.39) vs BoN (-3.43), 30.3% improvement **HH-RLHF Dataset**: - **Llama3 & UltraRM**: AISP (-5.02) vs BoN (-5.074), 1.1% improvement - **Vicuna & UltraRM**: AISP (-4.73) vs BoN (-4.85), 2.5% improvement **Key Findings**: - AISP achieves or exceeds BoN's average reward across all settings - Compared to training-required RE-Control, AISP performs better in most cases (e.g., Llama3 & UltraRM: -1.39 vs -9.28) - ARGS performs poorly in this experiment, possibly because trajectory-level reward models are unsuitable for token-level evaluation #### Win Rate Analysis (Table 2) GPT-4 evaluation of 100 sample pairs: **SHP Dataset**: - Llama & UltraRM: AISP 51.3% vs BoN 42.0% - Gemma3 & UltraRM: AISP 53.0% vs BoN 41.3% - Average win rate significantly higher than BoN **HH-RLHF Dataset**: - More balanced results, but AISP maintains advantage in most settings - Some settings (e.g., Vicuna) show higher tie rates (27.7%-36.0%) ### Sample Efficiency Analysis (Figure 3) **Convergence curves** demonstrate AISP's key advantages: - **Early Stage**: BoN performs better initially (due to high diversity from direct sampling) - **Middle Stage**: AISP quickly catches up, surpassing BoN around k=10-15 iterations - **Late Stage**: AISP continues improving, significantly outperforming BoN eventually **Three Curve Analysis**: 1. **AISP (Mean at k)**: $\frac{1}{n}\sum_i r(x,y(V^{i,k}))$, steadily increases with iterations 2. **AISP (Best at k)**: $\max_i r(x,y(V^{i,k}))$, best single iteration 3. **AISP (Best so far)**: $\max_{i,1\leq j\leq k} r(x,y(V^{i,j}))$, global best **Important Insight**: AISP not only optimizes individual responses but also the response distribution, with the Mean curve's rise proving effective distribution optimization. ### Batched AISP Experiment (Figure 4) Comparison under same iteration count (BoN N=128 vs AISP κ=b, n=N/b): **Setting Comparison**: - AISP1: (b=8, n=16) - AISP2: (b=16, n=8) - AISP3: (b=32, n=4) - AISP4: (b=64, n=2) **Results**: - All AISP settings outperform BoN (-4.2 to -4.4 vs BoN ~-4.7) - AISP surpasses BoN as long as each iteration has at least 4 samples - Demonstrates AISP's practicality under time constraints ### KL Divergence Analysis (Table 3) **KL divergence under different hyperparameters**: - AISP (λ=0.1, α=0.9999): KL=140.9, Reward=-2.15 - AISP (λ=10.0, α=0.99): KL=2.98, Reward=-3.37 - RE-Control: KL=0.172, Reward=-9.30 - ARGS: KL=78.8, Reward=-5.11 **Key Findings**: - AISP can flexibly control deviation from base LLM by adjusting λ and α - Even with smaller KL divergence than ARGS (18.9 vs 78.8), AISP achieves higher reward (-2.75 vs -5.11) - Demonstrates good balance between reward improvement and base LLM fidelity ### Ablation Studies #### Hyperparameter Sensitivity (Appendix D.1, Figures 6-7) **Effect of λ**: - Small λ (0.1): Mean does not grow, optimization fails - Large λ (0.7): Increased mean growth rate, but numerical stability must be maintained - Final reward outperforms BoN across λ∈[0.1, 0.7] **Effect of σ**: - Small σ (0.1): Limited exploration space, reward saturates early - Large σ (0.7): Sufficient exploration but slight instability - Optimal value approximately σ=0.5 **Effect of α**: - Small α (0.5-0.8): Over-penalizes deviation, limited reward improvement - Large α (0.999-0.9999): Allows sufficient exploration, steady reward improvement **Overall Assessment**: Hyperparameter behavior is intuitive, tuning is relatively straightforward ### Experimental Findings 1. **Sample Efficiency**: AISP achieves higher rewards with same sample count, showing faster improvement speed during iteration 2. **Training-Free Advantage**: Without pre-collecting datasets or training value functions, surpasses RE-Control 3. **Distribution Optimization**: Optimizes not just individual responses but overall response distribution 4. **Flexibility**: Hyperparameters enable control over reward improvement vs. base LLM fidelity trade-off 5. **Parallelization Potential**: Batched AISP maintains performance advantage under time constraints 6. **Cross-Model Generalization**: Effective across multiple LLMs (Llama3, Vicuna, Gemma3) and reward models ## Related Work ### Classification of Test-Time Alignment Methods #### 1. Training-Based Methods - **RE-Control** (Kong et al., 2024): Trains value function to optimize pre-logits - **Critic-Guide Decoding** (Kim et al., 2023): Trains critic network to predict state values - **Controlled Decoding** (Mudgal et al., 2024): Trains value function for chunk-level generation - **Limitations**: Requires large-scale datasets (e.g., RE-Control uses 349,000 samples) and training cost #### 2. Sampling-Based Methods - **Best-of-N (BoN)**: Simple and effective, but low sample efficiency - Yang et al. (2024) prove BoN asymptotically optimizes KL-constrained RL objectives - Beirami et al. (2024) prove BoN win rate upper bound is N/(N+1) - **Soft Reasoning** (Zhu et al., 2025): Bayesian optimization-based, but only perturbs initial token embeddings - **Importance Sampling Methods** (Loula et al., 2025): Uses importance sampling in token space, requires task-specific potential functions #### 3. Logit Manipulation Methods - **ARGS** (Khanov et al., 2024): Adds weighted rewards to logits - **Limitations**: Requires token-level reward models ### Advantages of This Work 1. **vs BoN**: Actively explores optimal responses, higher sample efficiency 2. **vs RE-Control**: Training-free, avoids data collection and training costs 3. **vs Soft Reasoning**: Optimizes complete pre-logit sequence, not just initial embeddings 4. **vs Loula et al.**: Uses tractable Gaussian distribution in pre-logit space ### Theoretical Foundation **Control Theory Perspective**: - Traditional optimal control (e.g., Pontryagin's maximum principle) unsuitable for nonlinear large-scale LLMs - **MPPI** (Williams et al., 2017, 2018): Sampling-based model predictive control, leveraging GPU parallelization - AISP applies MPPI to LLM alignment, introducing adaptive importance sampling ## Conclusions and Discussion ### Main Conclusions 1. **Method Effectiveness**: AISP as a training-free test-time alignment method significantly outperforms BoN and RE-Control in reward optimization 2. **Theoretical Contributions**: Establishes stochastic control framework in pre-logit space, proves optimal distribution approximable via adaptive importance sampling 3. **Sample Efficiency**: AISP outperforms BoN in sample efficiency, achieving higher rewards with same sample count 4. **Practicality**: Batched AISP maintains performance under time constraints, suitable for practical deployment 5. **Controllability**: Hyperparameters enable flexible adjustment of reward improvement vs. base LLM fidelity trade-off ### Limitations #### 1. Computational Complexity - **Sequential Iteration**: Requires κ sequential iterations, time complexity O(κ) - **Additional Computation**: Weight function requires computing $\sum_{t=1}^\tau \hat{u}_t^\top v_t^i$, overhead O(τd) relatively negligible #### 2. Gaussian Assumption - **Assumption Limitation**: Gaussian assumption on pre-logit distribution may not be perfectly accurate - **Simplification Cost**: Trade-off for tractable closed-form solution #### 3. Hyperparameter Tuning - **Three Hyperparameters**: λ, σ², α require tuning - **Dataset Dependence**: Different reward models (UltraRM vs Eurus) require different λ ranges #### 4. Experimental Scale - **Sample Limitation**: Only 1000 test samples due to computational constraints - **Model Scale**: Primarily tested on 7B-13B parameter models, performance on larger models unknown #### 5. Diversity and Coherence - In some settings, AISP's diversity and coherence inferior to BoN - Possibly because reward models don't prioritize these dimensions ### Future Directions 1. **Combining with Fine-tuning**: Explore combination of AISP with parameter-efficient fine-tuning (e.g., LoRA) 2. **Alternative Sampling Techniques**: Investigate other importance sampling variants (e.g., sequential Monte Carlo) 3. **More Complex Distributions**: Model more complex pre-logit distributions using normalizing flows 4. **Multi-Objective Optimization**: Simultaneously optimize reward, diversity, and coherence 5. **Larger-Scale Models**: Validate method on larger-scale LLMs (70B+) 6. **Theoretical Analysis**: Provide convergence rate and sample complexity guarantees ## In-Depth Evaluation ### Strengths #### 1. Novelty - **Interdisciplinary Fusion**: First application of MPPI control theory to LLM alignment, opening new research directions - **Pre-logit Space**: Operating in pre-logit rather than token space, leveraging Gaussian distribution tractability - **Theoretical Completeness**: Provides complete theoretical derivation (Theorems 3.1-3.3) and closed-form solutions #### 2. Practicality - **Training-Free**: Saves substantial data collection and training costs compared to RE-Control - **Plug-and-Play**: Directly applicable to pre-trained LLMs without model structure modification - **Batched Version**: Provides parallelization scheme suitable for practical deployment #### 3. Experimental Sufficiency - **Multi-Dimensional Evaluation**: Reward, diversity, coherence, win rate, KL divergence - **Multiple Settings**: 3 LLMs × 2 reward models × 2 datasets = 12 combinations - **Ablation Studies**: Detailed hyperparameter sensitivity analysis (appendix) - **Convergence Analysis**: Demonstrates dynamic sample efficiency advantages #### 4. Theoretical Insights - **Gaussian Assumption Justification**: Derives pre-logit Gaussian distribution rationality from softmax layer - **Connection with BoN**: Proves AISP is BoN generalization, providing unified framework - **Free Energy Bound**: Uses variational inference ideas, establishing elegant theoretical framework #### 5. Writing Quality - Clear structure, progressing from problem definition through theoretical derivation to experimental validation - Provides detailed algorithm pseudocode (Algorithm 1) and implementation details - Appendix contains complete proofs and additional experiments ### Weaknesses #### 1. Method Limitations - **Computational Overhead**: While training-free, inference requires κn forward passes; with κ=32, n=32, total 1024 forward passes - **Sequential Dependency**: κ iterations must execute sequentially, limiting parallelization potential - **Memory Requirements**: Must store n samples' pre-logit trajectories, space complexity O(nτd) #### 2. Experimental Design - **Sample Scale**: Only 1000 test samples, statistical significance may be insufficient - **Token Length Limitation**: Strict constraints on prompt and generation length (128 tokens) due to memory - **Missing Large-Model Experiments**: No validation on larger-scale models (e.g., Llama-70B) #### 3. Comparison Fairness - **BoN Setting**: BoN uses top-p sampling while AISP uses greedy decoding internally, potentially unfair - **RE-Control Training**: RE-Control trains value function on test set, possible overfitting #### 4. Insufficient Theoretical Analysis - **Convergence Guarantees**: Lacks convergence rate analysis for adaptive importance sampling - **Effective Sample Size**: Doesn't analyze importance sampling's effective sample size (ESS) - **Gaussian Assumption Verification**: Lacks empirical verification of actual pre-logit distribution #### 5. Diversity Issues - In some settings, AISP's diversity and coherence inferior to BoN - Lacks in-depth analysis and solutions for this phenomenon ### Impact #### 1. Academic Contribution - **New Paradigm**: Provides control theory perspective for test-time alignment, potentially inspiring follow-up research - **Theoretical Bridge**: Connects control theory, variational inference, and LLM alignment - **Methodology**: Successful application of adaptive importance sampling in pre-logit space generalizable to other generation tasks #### 2. Practical Value - **Cost-Benefit**: Training-free characteristic valuable in resource-constrained scenarios - **Flexibility**: Combines with different LLMs and reward models, strong adaptability - **Scalability**: Batched AISP provides practical deployment path #### 3. Reproducibility - **Code Availability**: Paper doesn't explicitly mention code release, but provides detailed algorithms and hyperparameters - **Implementation Complexity**: Algorithm relatively simple, based on standard importance sampling, easy to reproduce - **Computational Requirements**: Requires GPU resources (H100 80GB or A100 40GB), threshold for individual researchers #### 4. Limitations - **Applicable Scenarios**: Primarily suitable for scenarios with explicit reward models - **Extensibility**: Performance on larger models or longer sequences unknown - **Industrial Application**: 1024 forward passes' inference cost may be unacceptable in production ### Applicable Scenarios #### Most Suitable Scenarios 1. **Explicit Reward Models**: Such as safety detection, factual accuracy assessment 2. **Medium-Scale Models**: 7B-13B parameter LLMs 3. **Offline Batch Processing**: Can tolerate κ sequential iteration latency 4. **Resource-Constrained**: Cannot afford fine-tuning costs but have inference resources #### Less Suitable Scenarios 1. **Real-Time Interaction**: Dialogue systems requiring low-latency responses 2. **Ultra-Large Models**: Memory and computational costs potentially prohibitive 3. **No Reward Model**: Depends on explicit reward signals 4. **Extreme-Length Sequences**: Large control window τ significantly increases computation #### Potential Extensions 1. **Multimodal Generation**: Extend method to image-text generation 2. **Reinforcement Learning**: Use as exploration strategy 3. **Active Learning**: For uncertainty sampling 4. **Adversarial Robustness**: Explore worst-case responses ## References ### Core Citations 1. **Williams et al. (2017, 2018)**: Model Predictive Path Integral Control - AISP's theoretical foundation 2. **Kong et al. (2024)**: RE-Control - primary comparison method 3. **Yang et al. (2024)**: Theoretical analysis of BoN 4. **Lee et al. (2018)**: Gaussian assumption applications in neural networks ### Related Work 5. **Ouyang et al. (2022)**: Original RLHF paper 6. **Snell et al. (2024)**: Optimal allocation of test-time computation 7. **Beirami et al. (2024)**: Theoretical guarantees for BoN 8. **Khanov et al. (2024)**: ARGS method --- ## Summary This paper proposes the AISP method, which introduces control theory into LLM alignment, providing a theoretically elegant and practically effective test-time alignment solution. Its core innovation lies in applying Gaussian perturbations in pre-logit space and optimizing the perturbation distribution through adaptive importance sampling, achieving superior performance without training compared to existing methods. **Main Advantages** include high sample efficiency, training-free operation, and theoretical completeness; **Main Limitations** include higher inference cost, sequential iteration requirement, and unknown scalability to ultra-large models. The method provides a new research direction for test-time alignment, particularly valuable in resource-constrained scenarios with explicit reward models. Future research can further improve inference cost reduction, extension to larger models, and combination with fine-tuning methods. Overall, this is high-quality research work combining theoretical depth with practical value.