2025-11-16T11:43:12.671286

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

Kanai, Yoshida, Takahashi et al.

Test-time alignment of large language models (LLMs) attracts attention because fine-tuning LLMs requires high computational costs. In this paper, we propose a new test-time alignment method called adaptive importance sampling on pre-logits (AISP) on the basis of the sampling-based model predictive control with the stochastic control input. AISP applies the Gaussian perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation. We demonstrate that the optimal mean is obtained by importance sampling with sampled rewards. AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.

academic

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

基本信息

论文ID: 2510.26219
标题: Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space
作者: Sekitoshi Kanai, Tsukasa Yoshida, Hiroshi Takahashi (NTT, Inc.), Haru Kuroki, Kazumune Hashimoto (The University of Osaka)
分类: cs.LG cs.AI
发表时间: 2025年10月30日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.26219v1

摘要

大语言模型(LLM)的测试时对齐因避免高昂的微调成本而受到关注。本文提出了一种新的测试时对齐方法——预logit自适应重要性采样(AISP)，该方法基于带随机控制输入的采样模型预测控制。AISP对倒数第二层输出(pre-logits)施加高斯扰动，通过最大化扰动均值的期望奖励来实现对齐。论文证明了最优均值可通过对采样奖励进行重要性采样获得。AISP在样本使用效率上优于best-of-n采样，并在奖励值上超越其他基于奖励的测试时对齐方法。

研究背景与动机

要解决的问题

大语言模型的对齐是确保LLM安全和广泛应用的关键技术。传统的人类反馈强化学习(RLHF)方法需要微调LLM参数，带来巨大的计算成本。测试时对齐(test-time alignment)旨在不更新模型参数的情况下，使LLM生成符合人类偏好的响应。

问题的重要性

计算成本：微调大规模LLM需要大量GPU资源和训练时间
灵活性：测试时对齐允许在推理阶段动态调整模型行为
实用性：无需为每个特定任务重新训练模型

现有方法的局限性

Best-of-N (BoN)采样：虽然简单有效，但不主动探索最优响应，样本效率较低
RE-Control：需要训练值函数，需要大量数据集(如349,000个训练样本)和存储成本
传统最优控制：对于非线性、大规模的LLM系统不适用

研究动机

能否通过无需训练的方法来控制LLM探索最优响应？本文从控制理论角度出发，采用采样型模型预测控制(MPPI)技术，提出了一种无需训练的测试时对齐方法。

核心贡献

提出AISP方法：首次将采样型模型预测控制(MPPI)应用于LLM对齐，通过在pre-logit空间施加高斯扰动来实现无需训练的测试时对齐
理论贡献：
- 证明了最优pre-logit分布可通过自由能(free energy)边界获得
- 推导了基于自适应重要性采样的闭式解
- 揭示了AISP与BoN的理论联系(在特定参数下AISP退化为BoN)
高斯假设的合理性分析：论证了pre-logit的高斯分布假设与神经网络softmax层的内在联系
性能提升：
- 在样本效率上显著优于BoN(相同样本数下获得更高奖励)
- 无需训练即超越RE-Control
- 提出Batched AISP实现并行加速

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

基本信息

摘要

研究背景与动机

要解决的问题

问题的重要性

现有方法的局限性

研究动机

核心贡献

方法详解

任务定义

模型架构

1. 随机控制输入设计