2025-11-25T12:19:17.889498

Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

Sun, Liang, Zhang et al.

Self-improvement is among the most prominent techniques within the realm of large language models (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM's solver capability and verifier capability. Based on the theoretical framework, we further show how to model the entire training trajectory. This framework allows quantifying the capability limit of self-improvement by fitting the theoretical model to the experiment results. We empirically validate the effectiveness of the theoretical framework on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.

academic

Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

基本信息

论文ID: 2507.00075
标题: Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap
作者: Yifan Sun*, Yushan Liang*, Zhen Zhang, Jiaye Teng (上海财经大学统计与数据科学学院)
分类: cs.LG cs.AI
发表时间: arXiv:2507.00075v3 cs.LG 10 Oct 2025
论文链接: https://arxiv.org/abs/2507.00075v3

摘要

大语言模型的自我改进是当前最重要的技术之一，旨在不依赖外部数据的情况下提升LLM性能。尽管其重要性显著，但LLM在自我改进过程中的性能演化机制仍未得到充分探索。本文通过求解器-验证器差距(solver-verifier gap)的概念，对自我改进的训练动力学进行理论建模。该研究基于一个猜想：自我改进的性能提升源于LLM求解器能力与验证器能力之间的差距。基于理论框架，作者进一步展示了如何建模整个训练轨迹，并通过拟合理论模型到实验结果来量化自我改进的能力极限。作者在多个LLM和数据集上验证了理论框架的有效性，并扩展分析了外部数据如何影响这些动力学过程。

研究背景与动机

问题定义

核心问题: 缺乏对LLM自我改进过程中性能演化的理论理解，特别是训练动力学的数学建模
重要性:
- 数据瓶颈：大规模数据收集面临挑战，未来可能面临数据枯竭
- 自主学习需求：需要模型能够自主适应和演化
- 理论空白：现有工作主要关注方法有效性，缺乏对机制的深入理解

现有方法局限性

理论不足: 缺乏对自我改进动力学的理论模型
机制不明: 对性能提升的驱动因素理解有限
预测能力弱: 无法预测训练轨迹和性能极限

研究动机

基于Song et al. (2025)和Huang et al. (2025)的工作，作者提出求解器-验证器差距是自我改进的关键驱动力，并建立数学框架来描述这一过程。

核心贡献

理论框架: 提出基于求解器-验证器能力差距的自我改进动力学理论模型，导出指数收敛规律
数学建模: 建立耦合微分方程组描述训练动力学，并求得解析解
实验验证: 在多个模型(Phi系列、Llama系列)和数据集(Math、GSM8k)上验证理论预测
交叉改进分析: 扩展框架分析外部数据的影响，发现在有限外部数据条件下，使用时机对最终性能影响不大

方法详解

任务定义

求解器(Solver): 模型直接生成响应的能力，用不确定性衡量： $U_s(t) = -\frac{1}{n}\sum_{i=1}^n \log \pi_f(\hat{y}_i(t)|x_i)$

验证器(Verifier): 模型评估和选择最佳响应的能力，基于Best-of-N策略： $\hat{y}_i^{BoN} = \arg\min_{\{\hat{y}_{i,j}: s(\hat{y}_{i,j}) \geq \sigma\}} \frac{1}{L(\hat{y}_{i,j})} U_f(\hat{y}_{i,j}|x_i)$

验证器不确定性： $U_v(t) = -\frac{1}{n}\sum_{i=1}^n \log \pi_f(\hat{y}_i^{BoN}(t)|x_i)$

理论框架

1. 能力差距定义

$G(t) = U_s(t) - U_v(t) = -\frac{1}{n}\sum_{i=1}^n \log \frac{\pi_f(\hat{y}_i(t)|x_i)}{\pi_f(\hat{y}_i^{BoN}(t)|x_i)}$

2. 动力学方程

受物理学势能概念启发，建立耦合微分方程： $\frac{dU_s(t)}{dt} = -\alpha E(t), \quad \frac{dU_v(t)}{dt} = -\beta E(t)$

其中 $E(t)$ 为"差距势能"， $\alpha > \beta > 0$ 为系数。

3. 线性近似

对势能函数进行一阶泰勒展开： $E(t) \approx kG(t) - b$

4. 解析解

命题3.1: 在 $k(\alpha-\beta) > 0$ 条件下，能力动力学遵循指数衰减：

$U_s(t) \approx \alpha' e^{-k(\alpha-\beta)t} + U_{s,\infty}$ $U_v(t) \approx \beta' e^{-k(\alpha-\beta)t} + U_{v,\infty}$ $G(t) \approx \delta e^{-k(\alpha-\beta)t} + G_\infty$

其中：

$\alpha' = \frac{\alpha\delta}{\alpha-\beta}$ , $\beta' = \frac{\beta\delta}{\alpha-\beta}$
$\delta = U_{s,0} - U_{v,0} - \frac{b}{k}$
$U_{s,\infty} = U_{s,0} - \alpha'$ , $U_{v,\infty} = U_{v,0} - \beta'$