2025-11-22T18:43:16.829121

You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Roy, Hajimirsadeghi, Zhai et al.

Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa

academic

You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

基本信息

论文ID: 2511.04902
标题: You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models
作者: Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei (RBC Borealis)
分类: cs.LG, cs.AI
发表会议: NeurIPS 2025 Workshop: MATH-AI
论文链接: https://arxiv.org/abs/2511.04902
代码链接: https://github.com/BorealisAI/CuMa

摘要

本文系统性研究了无标签强化学习（Label-Free RL）方法在不同规模（0.5B到7B参数）和推理能力的语言模型上的表现。研究揭示了一个关键局限：无标签RL高度依赖基础模型的预存推理能力，对于较弱模型，性能常常降至基线水平以下。研究发现小模型无法生成足够长或多样化的思维链（CoT）来实现有效的自我反思，且训练数据难度在决定成功与否中起关键作用。为应对这些挑战，作者提出CuMa方法，利用课程学习逐步引入更难问题，并在训练中屏蔽无多数投票结果的样本。该方法在所有模型规模上均展现出一致的改进。

研究背景与动机

要解决的核心问题

近年来，大语言模型的推理能力提升主要依赖于强化学习技术，但传统方法（如RLHF、RLVR）严重依赖外部监督信号（人工标注或领域特定的真值标签）。为解决这一可扩展性瓶颈，研究者提出了无标签RL方法（如TTRL和Intuitor），但这些方法主要在大型、推理能力较强的模型（如Qwen2.5-Math-7B）上验证。本文要解决的核心问题是：这些无标签RL方法能否泛化到推理能力有限的小型基础模型？

问题的重要性

资源受限场景：在边缘设备或计算资源受限的环境中，小型模型更具实用价值
可扩展性：理解小模型的学习机制对于构建可扩展的推理系统至关重要
理论意义：揭示推理能力自举（bootstrap）的最小前提条件

现有方法的局限性

TTRL：通过多数投票在未标注测试数据上估计奖励，但小模型早期训练时正确输出太少，导致伪标签错误
Intuitor：使用模型自身置信度（self-certainty）作为内在奖励，但小模型置信度校准较差
缺乏针对弱模型的研究：现有方法未考虑基础推理能力不足时的失效模式

研究动机

通过系统性实验揭示无标签RL在弱模型上失败的根本原因，并提出针对性解决方案，使资源受限模型也能从无监督RL中受益。

核心贡献

首次系统性分析：揭示了无标签RL方法在不同模型规模（0.5B-7B）上的性能差异，发现弱模型存在显著的性能退化甚至崩溃现象
关键发现：
- 无标签RL高度依赖基础模型的预存推理能力
- 小模型无法生成足够长或多样的思维链进行自我反思
- 训练数据难度是决定成功的关键因素
- CoT长度不是强推理能力的直接反映
提出CuMa方法：结合课程学习、奖励屏蔽和数据生成的综合框架
- 从简单到困难的渐进式训练策略
- 屏蔽无多数共识样本的奖励信号
- 基于LLM的难度可控数据生成管道
实证验证：在Math 500、GPQA、AIME24、GSM8K、LCB等多个推理基准上验证，证明方法在所有模型规模上均有效，特别对弱模型提升显著

方法详解

任务定义

输入：无标签的推理问题数据集 $D = \{x_1, ..., x_M\}$ （如数学问题）
输出：优化后的策略模型 $\pi_\theta$ ，能够生成正确的推理链和答案
约束：训练过程中无法访问真值标签，只能通过模型自身生成的多个候选解进行学习

模型架构

1. 课程学习框架

将数据集划分为K=5个难度级别： $D = D_1 \cup D_2 \cup ... \cup D_K$ 其中 $D_1$ 包含最简单的问题， $D_K$ 包含最困难的问题。训练按 $D_1 \to D_K$ 顺序进行。

2. 多数投票奖励机制

对每个提示 $x_i$ ，生成N个候选解 $\{y_i^{(1)}, ..., y_i^{(N)}\}$ ，奖励函数定义为： $r(x_i, y_i^{(j)}) = \mathbb{I}[y_i^{(j)} = \text{majority\_vote}(\{y_i^{(1)}, ..., y_i^{(N)}\})]$