2025-11-30T00:01:19.060859

Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

Tang, Huang, Liu et al.

Reinforcement learning with verifiable rewards (RLVR) has significantly boosted the reasoning capability of large language models (LLMs) recently. However, existing RLVR approaches merely train LLMs based on their own generated responses and are constrained by the initial capability of LLMs, thus prone to exploration stagnation, in which LLMs fail to solve more training problems and cannot further learn from the training data. Some work tries to address this by leveraging off-policy solutions to training problems but requires external guidance from experts which suffers from limited availability. In this work, we propose LTE (Learning to reason from Trial and Error), an approach hinting LLMs with their previously self-generated incorrect answers and problem of overlong responses, which does not require any external expert guidance. Experiments validate the effectiveness of LTE, which outperforms the normal group relative policy optimization (GRPO) by 6.38 in Pass@1 and 9.00 in Pass@k on average across six mathematics benchmarks for Qwen3-4B-Base. Further analysis confirms that LTE successfully mitigates the problem of exploration stagnation and enhances both exploitation and exploration during training.

academic

Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

基本信息

论文ID: 2510.26109
标题: Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error
作者: Chenming Tang, Hsiu-Yuan Huang, Weijie Liu, Saiyong Yang, Yunfang Wu (北京大学 & 腾讯)
分类: cs.LG (机器学习)
发表时间: 2025年10月30日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.26109v1

摘要

本文提出LTE (Learning to reason from Trial and Error)方法，旨在解决大语言模型(LLMs)在可验证奖励强化学习(RLVR)中的探索停滞问题。现有RLVR方法仅基于模型自身生成的响应进行训练，受限于初始能力，难以解决超出能力上界的问题。LTE通过利用模型先前生成的错误答案作为提示，无需外部专家指导即可突破能力瓶颈。在Qwen3-4B-Base上的实验表明，LTE在六个数学基准上平均超越标准GRPO方法6.38 (Pass@1)和9.00 (Pass@k)。

研究背景与动机

要解决的核心问题

本文针对大语言模型强化学习训练中的**探索停滞(exploration stagnation)**问题。具体表现为：当训练样本难度超出模型当前能力上界时，所有采样的响应都无法通过验证（即none-pass样本），导致所有优势函数为零，模型无法从这些样本中学习。

问题的重要性

能力瓶颈: 现有RLVR方法使模型被困在初始能力范围内，无法突破自身上界
训练效率: 大量训练样本因探索停滞而无法提供有效学习信号
推理能力: 限制了模型在数学推理等需要深度思考任务上的性能提升

现有方法的局限性

现有解决方案主要依赖外部指导：

人工标注的标准答案: 成本高昂，可扩展性差
更强模型生成的推理链: 在训练旗舰模型时不可用
简单增加采样次数: 未利用已有rollout信息，效率低下

研究动机

提出一种自主学习的方法，仅利用模型自身的试错经验，无需任何外部专家指导即可突破探索瓶颈。

核心贡献

提出LTE方法: 首个利用LLM自身试错经验（错误答案）作为提示来解决探索停滞的方法，无需外部专家指导
混合策略优化机制: 设计了结合on-policy和off-policy样本的训练框架，通过正则化重要性采样处理提示生成的正确解
全面实验验证: 在两个LLM（4B和8B）和六个数学基准上验证有效性，显著提升Pass@1和Pass@k性能
深入机制分析:
- 理论证明LTE增加了到达正确答案的概率
- 实证分析确认LTE成功缓解探索停滞
- 揭示LTE同时增强exploitation和exploration能力

方法详解

任务定义

输入: 数学问题查询 $q \sim D$
输出: 推理链和最终答案 $o$
目标: 通过RLVR最大化生成正确答案的概率，同时突破模型初始能力上界

整体框架

LTE的核心流程包括三个阶段：

1. 初始Rollouts

对每个训练问题 $q$ ，采样 $G$ 个响应 $\{o_1, o_2, ..., o_G\}$ ，并验证正确性。

2. Hinted Extra Rollouts（关键创新）

对于none-pass样本（所有初始rollout都失败），根据截断情况选择提示策略：

a) All-truncated（所有响应都被截断）

提示模板: "Let's think concisely and output the final answer within \boxed{}."

归因于响应过长，提示模型简洁思考。

b) Some-truncated（部分响应被截断）

提示模板: "Hint: possible incorrect answers include [a1, a2, ...]
Do not use or mention the hint explicitly. Let's think concisely..."

收集未截断响应的错误答案作为提示，同时要求简洁。

c) None-truncated（无截断响应）

提示模板: "Hint: possible incorrect answers include [a1, a2, ...]
Do not use or mention the hint explicitly. Let's think step by step..."

仅提供错误答案提示，允许正常长度推理。

基于选定的提示模板，再采样 $G$ 个额外rollouts $\{o_1^{hinted}, o_2^{hinted}, ..., o_G^{hinted}\}$ 。

3. Mixed-policy Optimization

如果额外rollouts中有 $G'$ 个正确解 $\{o'_1, ..., o'_{G'}\}$ ，随机替换初始rollouts中的 $G'$ 个响应。

关键技术: 使用正则化重要性采样处理off-policy样本：

$\hat{r}'_{i,t}(\theta) = \frac{\pi_\theta(o'_{i,t} | q, o_{i,<t})}{\pi_{\theta_{old}}(o'_{i,t} | H_q, q, o_{i,<t})}$

$f(\hat{r}'_{i,t}(\theta)) = \frac{\hat{r}'_{i,t}(\theta)}{\hat{r}'_{i,t}(\theta) + \gamma}$

其中 $\gamma = 0.1$ ， $H_q$ 是提示信息。

混合策略目标函数:

$J_{Mixed}(\theta) = \mathbb{E}_{q,\{o'_i, o_{s_i}\}} \left[ \frac{1}{Z'} \sum_{i=1}^{G'} \sum_{t=1}^{|o'_i|} (f(\hat{r}'_{i,t}(\theta)) \cdot \hat{A}'_{i,t}) + \frac{1}{Z} \sum_{i=1}^{G-G'} \sum_{t=1}^{|o_{s_i}|} \text{CLIP}(r_{s_i,t}(\theta), \hat{A}_{s_i,t}, \epsilon) \right]$

技术创新点

自主学习机制: 不依赖外部监督，仅利用模型自身的错误尝试
- 错误答案作为"反面教材"，缩小解空间
- 指示模型避免重复相同错误
状态空间剪枝: 理论分析表明，提示将状态空间从 $S_q$ 剪枝到 $S'_q = S_q \backslash S^f_q$ （排除失败子空间），增加到达正确答案的概率
自适应提示策略: 根据截断情况动态调整提示内容
- 处理过长响应问题
- 平衡探索深度和效率
混合策略训练: 优雅处理on-policy和off-policy数据
- 保持训练稳定性
- 充分利用额外rollouts的信息

实验设置

数据集

训练数据: Skywork-OR1-RL-Data

Qwen3-4B-Base: Level 1子集，9,189个样本
Qwen3-8B-Base: Level 3子集，3,236个样本
选择标准: 中等难度，确保最佳可学习性

评价指标

六个数学基准测试:

MATH-500: 采样4次，报告Mean@4和Pass@4
Minerva: 采样4次，报告Mean@4和Pass@4
OlympiadBench: 采样4次，报告Mean@4和Pass@4
AMC'23: 采样16次，报告Mean@16和Pass@16
AIME'24: 采样16次，报告Mean@16和Pass@16
AIME'25: 采样16次，报告Mean@16和Pass@16

核心指标:

Pass@1: 单次采样准确率（exploitation能力）
Pass@k: k次采样中至少一次正确的概率（exploration上界）

对比方法

Base: 基础模型性能
GRPO: 标准Group Relative Policy Optimization
GRPO + Extra Rollouts: 对none-pass样本简单增加rollouts（无提示）
LTE: 本文方法

每种方法测试两个版本：

w/o Entropy Loss: 不使用熵损失
w/ Entropy Loss: 添加系数为0.003的熵损失

实现细节

训练框架: verl
关键超参数:

学习率: 1e-6
训练步数: 300
批大小: 128
每个prompt采样数: 8
温度: 1.0（训练），0.6（评估）
最大响应长度: 16,384（训练），32,768（评估）
KL系数: 0.001
Clip ratio: 0.2

评估设置: 严格遵循标准协议，训练时的提示仅用于训练阶段，评估时不使用。

方法	MATH-500	Minerva	Olympiad	AMC'23	AIME'24	AIME'25	Avg.
Base	45.40	19.49	22.81	35.31	8.75	3.75	22.59
GRPO (w/o entropy)	69.65	32.17	34.33	50.62	12.08	4.38	33.87
Extra Rollouts (w/o entropy)	69.30	31.99	35.59	55.78	11.88	6.46	35.17
LTE (w/o entropy)	71.95	33.82	38.44	58.91	16.88	12.29	38.72
LTE (w/ entropy)	76.00	34.01	40.63	65.16	24.17	18.96	43.16

关键发现:

LTE (w/ entropy)相比GRPO + Extra Rollouts提升**+6.38**平均分
在AIME'24和AIME'25等高难度任务上提升尤为显著（+5.00和+10.00）

Qwen3-8B-Base:

LTE (w/ entropy)平均得分42.40，相比GRPO提升+1.78
表现相对不稳定，归因于训练数据规模过小（3,236样本）

Pass@k性能（表3）

Qwen3-4B-Base:

方法	MATH-500	Minerva	Olympiad	AMC'23	AIME'24	AIME'25	Avg.
Base	69.80	37.87	39.70	82.50	33.33	26.67	48.31
GRPO (w/o entropy)	77.20	37.50	42.07	75.00	26.67	26.67	47.52
LTE (w/ entropy)	82.40	42.28	51.11	90.00	60.00	40.00	60.97