2025-11-16T11:43:12.671286

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

Kanai, Yoshida, Takahashi et al.

Test-time alignment of large language models (LLMs) attracts attention because fine-tuning LLMs requires high computational costs. In this paper, we propose a new test-time alignment method called adaptive importance sampling on pre-logits (AISP) on the basis of the sampling-based model predictive control with the stochastic control input. AISP applies the Gaussian perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation. We demonstrate that the optimal mean is obtained by importance sampling with sampled rewards. AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.

academic

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

Basic Information

Paper ID: 2510.26219
Title: Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space
Authors: Sekitoshi Kanai, Tsukasa Yoshida, Hiroshi Takahashi (NTT, Inc.), Haru Kuroki, Kazumune Hashimoto (The University of Osaka)
Classification: cs.LG cs.AI
Publication Date: October 30, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.26219v1

Abstract

Test-time alignment of large language models (LLMs) has gained attention for avoiding expensive fine-tuning costs. This paper proposes a novel test-time alignment method—Adaptive Importance Sampling in Pre-logit space (AISP)—based on sampling-based model predictive control with stochastic control inputs. AISP applies Gaussian perturbations to pre-logit outputs (penultimate layer activations) and achieves alignment by maximizing the expected reward of the perturbed mean. The paper proves that the optimal mean can be obtained through importance sampling of sampled rewards. AISP outperforms best-of-n sampling in sample efficiency and surpasses other reward-based test-time alignment methods in reward values.

Research Background and Motivation

Problem Statement

Alignment of large language models is a critical technology for ensuring LLM safety and broad applicability. Traditional reinforcement learning from human feedback (RLHF) requires fine-tuning LLM parameters, incurring substantial computational costs. Test-time alignment aims to generate human-preference-aligned responses without updating model parameters.

Problem Significance

Computational Cost: Fine-tuning large-scale LLMs requires substantial GPU resources and training time
Flexibility: Test-time alignment enables dynamic adjustment of model behavior during inference
Practicality: Eliminates the need for model retraining for each specific task

Limitations of Existing Methods

Best-of-N (BoN) Sampling: While simple and effective, it does not actively explore optimal responses, resulting in low sample efficiency
RE-Control: Requires training value functions with large datasets (e.g., 349,000 training samples) and storage overhead
Traditional Optimal Control: Unsuitable for nonlinear, large-scale LLM systems

Research Motivation

Can we control LLMs to explore optimal responses through training-free methods? This paper approaches the problem from control theory, adopting sampling-based model predictive path integral (MPPI) control to propose a training-free test-time alignment method.

Core Contributions

Proposes AISP Method: First application of sampling-based model predictive control (MPPI) to LLM alignment, achieving training-free test-time alignment through Gaussian perturbations in pre-logit space
Theoretical Contributions:
- Proves that the optimal pre-logit distribution can be obtained via free energy bounds
- Derives closed-form solutions based on adaptive importance sampling
- Reveals theoretical connection between AISP and BoN (AISP degenerates to BoN under specific parameters)
Justification of Gaussian Assumption: Argues the theoretical connection between Gaussian distribution assumption on pre-logits and the inherent assumptions of neural network softmax layers
Performance Improvements:
- Significantly outperforms BoN in sample efficiency (achieves higher rewards with same sample count)
- Surpasses RE-Control without requiring training
- Proposes Batched AISP for parallel acceleration

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

Basic Information

Abstract

Research Background and Motivation

Problem Statement

Problem Significance

Limitations of Existing Methods

Research Motivation

Core Contributions

Method Details

Task Definition

Model Architecture

1. Stochastic Control Input Design