2025-11-22T18:43:16.829121

You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Roy, Hajimirsadeghi, Zhai et al.

Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa

academic

You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Basic Information

Paper ID: 2511.04902
Title: You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models
Authors: Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei (RBC Borealis)
Classification: cs.LG, cs.AI
Conference: NeurIPS 2025 Workshop: MATH-AI
Paper Link: https://arxiv.org/abs/2511.04902
Code Link: https://github.com/BorealisAI/CuMa

Abstract

This paper systematically investigates the performance of label-free reinforcement learning (RL) methods across language models of varying scales (0.5B to 7B parameters) and reasoning capabilities. The study reveals a critical limitation: label-free RL heavily depends on the pre-existing reasoning capacity of base models, with performance often degrading below baseline levels for weaker models. The research finds that small models cannot generate sufficiently long or diverse chain-of-thought (CoT) sequences for effective self-reflection, and training data difficulty plays a crucial role in determining success. To address these challenges, the authors propose CuMa, which employs curriculum learning to progressively introduce harder problems and masks samples lacking majority voting consensus during training. The method demonstrates consistent improvements across all model scales.

Research Background and Motivation

Core Problem to Address

Recent improvements in large language model reasoning capabilities have primarily relied on reinforcement learning techniques. However, traditional approaches (such as RLHF and RLVR) heavily depend on external supervision signals (human annotations or domain-specific ground truth labels). To address this scalability bottleneck, researchers have proposed label-free RL methods (such as TTRL and Intuitor), but these have been primarily validated on large, reasoning-capable models (e.g., Qwen2.5-Math-7B). The core question this paper addresses is: Can these label-free RL methods generalize to small-scale base models with limited reasoning capabilities?

Problem Significance

Resource-Constrained Scenarios: Small models are more practical in edge devices or computationally limited environments
Scalability: Understanding learning mechanisms in small models is crucial for building scalable reasoning systems
Theoretical Significance: Revealing the minimum prerequisites for bootstrapping reasoning capabilities

Limitations of Existing Methods

TTRL: Estimates rewards through majority voting on unlabeled test data, but small models produce too few correct outputs early in training, leading to pseudo-label errors
Intuitor: Uses model self-certainty as intrinsic reward, but small models have poor confidence calibration
Lack of Research on Weak Models: Existing methods do not account for failure modes when base reasoning capacity is insufficient

Research Motivation

Through systematic experiments, reveal the fundamental reasons for label-free RL failure in weak models and propose targeted solutions, enabling resource-constrained models to benefit from unsupervised RL.

Core Contributions

First Systematic Analysis: Reveals performance differences of label-free RL methods across model scales (0.5B-7B), discovering significant performance degradation and even collapse in weak models
Key Findings:
- Label-free RL heavily depends on pre-existing reasoning capacity of base models
- Small models cannot generate sufficiently long or diverse chain-of-thought sequences for self-reflection
- Training data difficulty is the key factor determining success
- CoT length is not a direct reflection of strong reasoning ability
Proposes CuMa Method: A comprehensive framework combining curriculum learning, reward masking, and data generation
- Progressive training strategy from simple to difficult problems
- Masks reward signals for samples without majority consensus
- LLM-based difficulty-controlled data generation pipeline
Empirical Validation: Verified on multiple reasoning benchmarks including Math 500, GPQA, AIME24, GSM8K, and LCB, demonstrating effectiveness across all model scales, with particularly significant improvements for weak models

Method Details

Task Definition

Input: Unlabeled reasoning problem dataset $D = \{x_1, ..., x_M\}$ (e.g., math problems)
Output: Optimized policy model $\pi_\theta$ capable of generating correct reasoning chains and answers
Constraint: No access to ground truth labels during training; learning only through multiple candidate solutions generated by the model itself

Model Architecture

1. Curriculum Learning Framework

Partition the dataset into K=5 difficulty levels: $D = D_1 \cup D_2 \cup ... \cup D_K$

where $D_1$ contains the simplest problems and $D_K$ contains the most difficult ones. Training proceeds in the order $D_1 \to D_K$ .

2. Majority Voting Reward Mechanism

For each prompt $x_i$ , generate N candidate solutions $\{y_i^{(1)}, ..., y_i^{(N)}\}$ , with reward function defined as: $r(x_i, y_i^{(j)}) = \mathbb{I}[y_i^{(j)} = \text{majority\_vote}(\{y_i^{(1)}, ..., y_i^{(N)}\})]$

3. Reward Masking Mechanism

Mask learning signals when samples lack majority consensus (i.e., maximum occurrence count < 2): $\text{mask}(x_i) = \mathbb{I}\left[\max_j |\{k : y_i^{(k)} = y_i^{(j)}\}| \geq 2\right]$

This prevents the model from learning noisy feedback from uncertain predictions.

4. Data Generation Pipeline

Use LLM to generate synthetic data of predefined difficulty:

Structured prompting strategy explicitly specifying difficulty levels (1-5)
Example problems provided for each level as reference
Dynamic example refreshing to increase diversity
Generate 25 samples per iteration covering different mathematical sub-topics

Technical Innovations

1. Progressive Difficulty Adjustment

Differences from Baseline:

TTRL/Intuitor: Train on fixed-difficulty data
CuMa: Start with simple problems, progressively increase difficulty

Design Rationale:

Small models produce almost no correct solutions on difficult problems (as shown in Figure 2, 0.5B model has near-zero accuracy early in training)
Build foundational reasoning ability from simple problems before transferring to complex ones
Aligns with human cognitive learning principles

2. Selective Learning Signals

Innovation: Update model only when clear majority consensus exists

Problem Solved:

Early in training, small models generate highly dispersed candidate solutions
Lack of majority consensus indicates model uncertainty on that problem
Forced learning introduces noise, causing performance degradation

Experimental Evidence: Table 2 ablation shows performance drops from 32.8 to 30.7 without reward masking

3. Difficulty-Controlled Data Augmentation

Technical Details:

Use structured prompt engineering to generate math problems of varying difficulty
Cover multiple sub-domains: algebra, geometry, probability, etc.
Dynamically sample example problems to avoid overfitting to specific patterns

Function: Provides sufficient samples of each difficulty level for curriculum learning

Experimental Setup

Datasets

Math 500: 500 high-quality mathematical problems
GPQA: Graduate-level physics question answering
AIME24: American Invitational Mathematics Examination 2024 problems
GSM8K: Elementary school math word problems (8,000+ problems)
LCB: Logic reasoning benchmark

Evaluation Metrics

Accuracy: Proportion of generated answers exactly matching standard answers
All experiments report percentage accuracy

Comparison Methods

Base Model: Untuned base model without RL training
GRPO: Supervised reinforcement learning using ground truth labels (upper bound reference)
Intuitor: Label-free RL based on self-certainty
TTRL: Test-time RL based on majority voting

Implementation Details

Optimizer: AdamW
Learning Rate: Peak 3×10⁻⁶, cosine decay
Sampling Strategy: Generate 8 candidates per prompt, temperature 0.6
Maximum Generation Length: 3,072 tokens
Training Epochs: 1 episode
Hardware: 4×NVIDIA H100 80GB GPUs
Model Family: Qwen2.5 (0.5B, 1.5B, 3B, 7B)

Experimental Results

Main Results

1. Performance Comparison Across Model Scales (Table 1)

0.5B Model:

Base: Math 500=23.4, GSM8K=26.38
TTRL: Complete collapse (Math 500=0.0)
Intuitor: Performance degradation (GSM8K=0.68)
CuMa: Math 500=32.8 (+40%), GSM8K=32.9 (+25%)

7B Model:

Base: Math 500=58.2, GSM8K=81.5
GRPO: 73.8, 85.67 (labeled upper bound)
TTRL/Intuitor: 73.6/72.2, 84.39/78.19
CuMa: 74.0, 84.49 (approaching labeled methods)

Key Findings:

All label-free methods are effective on large models
Only CuMa shows stable improvement on small models; other methods degrade or collapse
CuMa avoids collapse on 0.5B models, achieving significant improvements

2. Cross-Benchmark Generalization

CuMa demonstrates improvements across 5 different reasoning benchmarks:

Math 500: Improvements across all model scales
GPQA: 7B model from 27.77→32.32
AIME24: 7B model from 6.67→13.33 (doubled)
LCB: 3B model from 5.20→8.04

Ablation Study

Table 2 shows contributions of CuMa components (0.5B model, Math 500):

Configuration	Performance	Degradation
Full CuMa	32.8	-
Without Reward Masking	30.7	-6.4%
Without Data Generation	24.5	-25.3%
Without Curriculum Learning	20.1	-38.7%

Key Insights:

Curriculum Learning Most Critical: Performance near collapse without it (20.1 vs base 23.4)
Data Generation Important: Provides sufficient samples of each difficulty level supporting curriculum learning
Reward Masking Effective: Prevents learning from noise signals, stabilizes training

Case Analysis

Figure 2: Early Training Correct Answer Generation

0.5B Model: Almost no correct outputs in first 50 steps
Consequence: TTRL's majority voting produces incorrect pseudo-labels → model collapse
CuMa Solution: Start with simple problems, enabling partial correct answers early in training

Figure 3: CoT Length Changes During Training

7B Model: Length increases from 500→1400 tokens, including self-reflection
0.5B/1.5B Models: Length remains 500-700, no significant growth
Finding: Length growth is not a reliable indicator for small models

Figure 4: Impact of Training Data Difficulty

Testing different difficulty levels (Level 1-2 to 1-5) on 0.5B model:

Math 500: 0.35 at L1-2 → near 0 at L1-4 (collapse)
GSM8K: Gradually decreases from 0.28 to 0.15
Conclusion: Overly difficult data causes small model learning failure

Experimental Findings

Reasoning Capacity Threshold: Label-free RL requires minimum reasoning capacity as prerequisite
Data-Capacity Alignment: Training data difficulty must match model capability
Majority Voting Reliability: Depends on base model generating some correct solutions
Curriculum Learning Universality: Helps all model scales, but more critical for weak models
CoT Length Misleading: Cannot be sole indicator of reasoning improvement in small models

1. Supervised Reinforcement Learning

RLHF: Aligning models through human feedback
GRPO: Rule-based reward methods for mathematical reasoning
DeepSeek-R1: Large-scale reasoning models
Limitation: Requires annotated data, limited scalability

2. Label-Free/Self-Improvement Methods

Self-rewarding LMs: Model self-evaluation
Self-play Fine-tuning: Self-play improvement
DPO: Direct Preference Optimization
Paper's Distinction: Focuses on RL method applicability to weak models

3. Test-Time Optimization

TTRL: Test-time majority voting RL
Intuitor: Self-certainty based approach
Paper's Contribution: Reveals failure modes in weak models and proposes solutions

4. Curriculum Learning

Traditional curriculum learning mainly applied to supervised learning
Paper's Innovation: First systematic application of curriculum learning to label-free RL reasoning tasks

Conclusions and Discussion

Main Conclusions

Core Finding: Label-free RL is not a "free lunch"; it requires base reasoning capacity as prerequisite
Failure Mechanisms:
- Weak models cannot generate sufficient correct solutions → majority voting fails
- Lack of diverse CoT → self-reflection mechanism ineffective
- Overly difficult data → sparse learning signals
Solution Effectiveness: CuMa improves performance across 0.5B-7B scales, with particularly significant gains for weak models
Theoretical Significance: Reveals minimum conditions and pathways for reasoning capability bootstrapping

Limitations

Model Range: Validated only on Qwen models; generalization to other architectures (LLaMA, Mistral) unknown
Domain Restriction: Primarily focuses on mathematical reasoning; applicability to other reasoning types (commonsense, logical reasoning) requires further verification
Curriculum Design: Difficulty grading depends on manual definition or LLM generation, lacking automated difficulty assessment
Computational Cost: Requires generating many candidate solutions (8 per problem), high inference cost
Minimum Capacity Threshold: No clear quantitative standard for "sufficient reasoning capacity"
Data Generation Quality: Synthetic data diversity and quality depend on generating model

Future Directions

Adaptive Curriculum: Dynamically adjust difficulty based on real-time model performance
Hybrid Rewards: Combine majority voting and confidence signals
Cross-Domain Validation: Extend to code generation, scientific reasoning
Theoretical Analysis: Establish formal relationships between reasoning capacity and RL effectiveness
Efficiency Optimization: Reduce candidate generation count, lower computational cost

In-Depth Evaluation

Strengths

1. Precise Problem Identification

First systematic revelation of label-free RL failure in weak models
In-depth root cause analysis through multi-dimensional experiments (model scale, data difficulty, CoT length)
Figure 2 visualization intuitively demonstrates early training collapse mechanism

2. Reasonable Method Design

Simple and Effective: Three components (curriculum learning, reward masking, data generation) each well-motivated
Theoretical Support: Curriculum learning aligns with cognitive science and ML theory
Engineering Feasibility: Easy to implement without introducing complex new components

3. Comprehensive Experiments

Full Scale Coverage: Covers 0.5B-7B four model scales
Diverse Benchmarks: 5 different types of reasoning tasks
Complete Comparisons: Includes labeled upper bound (GRPO) and multiple label-free baselines
Detailed Ablations: Verifies contribution of each component

4. High Practical Value

Provides viable solution for resource-constrained scenarios (edge devices, low-cost deployment)
Open-source code with strong reproducibility
General method, extensible to other RL paradigms

5. Clear Writing

Rigorous logical structure: problem → analysis → method → verification
Effective visualizations (Figures 1-4 intuitively present key findings)
Well-summarized core contributions

Weaknesses

1. Limited Theoretical Depth

Lacks Formal Analysis: No theoretical relationship established between reasoning capacity and RL convergence
Vague Difficulty Definition: Level 1-5 division depends on subjective judgment
Unquantified Threshold: What degree of reasoning capacity suffices for label-free RL?

2. Experimental Design Flaws

Single Model Family: Only Qwen models; architectural bias not eliminated
Data Generation Dependency: Synthetic data quality depends on Qwen-72B, potential bias
Missing Statistical Significance: No variance and confidence intervals from multiple runs
Unreported Computational Cost: Training time, GPU hours not disclosed

3. Method Limitations

Fixed Curriculum: 5 difficulty levels and order are hyperparameters, lacking adaptive mechanisms
Fragile Majority Voting: Still depends on base model generating some correct solutions
Conservative Reward Masking: May miss valuable learning opportunities from difficult samples

4. Insufficient Analysis

Missing Failure Cases: No demonstration of CuMa failures
Incomplete Human Learning Comparison: Curriculum learning analogy not deeply explored
Unknown Long-term Effects: Only 1 episode training; stability with continued training unverified

5. Questionable Generalization

Single Task Type: Primarily mathematical reasoning; other types insufficiently verified
Language Limitation: English data only; multilingual scenarios unconsidered
Domain Knowledge: Applicability to specialized domains (medicine, law) unknown

Impact

Contribution to Field

Fills Research Gap: First systematic study of label-free RL behavior in weak models
Methodological Insights: Demonstrates curriculum learning effectiveness in RL reasoning tasks
Practical Guidance: Provides viable pathway for small model reasoning improvement
Theoretical Foundation: Establishes basis for subsequent research on reasoning capability bootstrapping

Practical Value

Edge Deployment: Enables small models to improve through RL, reducing deployment costs
Educational Applications: Progressive learning strategy applicable to personalized education systems
Research Tools: Open-source code and data generation pipeline available to community

Reproducibility

✅ Code open-sourced (GitHub)
✅ Detailed hyperparameters (learning rate, temperature, generation length, etc.)
✅ Data generation prompts disclosed (Appendix B)
⚠️ High computational resource requirements (4×H100)
⚠️ Synthetic data not directly released

Applicable Scenarios

Suitable Scenarios

Resource-Constrained Environments: Need reasoning improvement on small models
Unlabeled Data: Abundant reasoning problems but lacking standard answers
Progressive Learning: Tasks with clear difficulty hierarchy (education, competition training)
Math/Code Reasoning: Closed-domain tasks with objective correct answers

Unsuitable Scenarios

Open-Domain Generation: Creative writing, dialogue systems (no clear correct answers)
Extremely Weak Models: <0.5B or near-random base reasoning capacity
Real-Time Systems: Cannot afford multiple sampling overhead
Subjective Tasks: Sentiment analysis, style transfer (majority voting meaningless)

References

DeepSeekMath 1: Open model benchmark for mathematical reasoning
DeepSeek-R1 2: Large-scale reasoning models and RL training
TTRL 3: Test-time reinforcement learning framework
Intuitor 4: Unsupervised RL based on intrinsic confidence
RLHF 6: Classic learning from human feedback method
PPO 7: Proximal Policy Optimization algorithm
Chain-of-Thought 8: Chain-of-thought prompting technique

RL Foundations 5: Sutton & Barto classic textbook
DPO 17: Direct Preference Optimization
Self-rewarding LMs 14-16: Self-reward and self-improvement

Summary

This paper conducts in-depth empirical research and methodological innovation on label-free reinforcement learning failure in weak reasoning models. The core value lies in revealing the prerequisites for reasoning capability bootstrapping: base models must possess minimum reasoning capacity to benefit from unsupervised RL. CuMa achieves stable improvement even in 0.5B weak models through synergistic design of curriculum learning, reward masking, and data generation.

Highlights: Precise problem identification, simple effective method, comprehensive experiments, high practical value.
Shortcomings: Limited theoretical analysis, restricted generalization verification, missing statistical significance.

Recommendation Score: ⭐⭐⭐⭐ (4/5)
Recommended for researchers interested in small model reasoning, unsupervised learning, and curriculum learning. Also valuable reference for industry practitioners deploying reasoning models in resource-constrained scenarios.