2025-11-22T18:43:16.829121

You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Roy, Hajimirsadeghi, Zhai et al.
Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa
academic

You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Basic Information

  • Paper ID: 2511.04902
  • Title: You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models
  • Authors: Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei (RBC Borealis)
  • Classification: cs.LG, cs.AI
  • Conference: NeurIPS 2025 Workshop: MATH-AI
  • Paper Link: https://arxiv.org/abs/2511.04902
  • Code Link: https://github.com/BorealisAI/CuMa

Abstract

This paper systematically investigates the performance of label-free reinforcement learning (RL) methods across language models of varying scales (0.5B to 7B parameters) and reasoning capabilities. The study reveals a critical limitation: label-free RL heavily depends on the pre-existing reasoning capacity of base models, with performance often degrading below baseline levels for weaker models. The research finds that small models cannot generate sufficiently long or diverse chain-of-thought (CoT) sequences for effective self-reflection, and training data difficulty plays a crucial role in determining success. To address these challenges, the authors propose CuMa, which employs curriculum learning to progressively introduce harder problems and masks samples lacking majority voting consensus during training. The method demonstrates consistent improvements across all model scales.

Research Background and Motivation

Core Problem to Address

Recent improvements in large language model reasoning capabilities have primarily relied on reinforcement learning techniques. However, traditional approaches (such as RLHF and RLVR) heavily depend on external supervision signals (human annotations or domain-specific ground truth labels). To address this scalability bottleneck, researchers have proposed label-free RL methods (such as TTRL and Intuitor), but these have been primarily validated on large, reasoning-capable models (e.g., Qwen2.5-Math-7B). The core question this paper addresses is: Can these label-free RL methods generalize to small-scale base models with limited reasoning capabilities?

Problem Significance

  1. Resource-Constrained Scenarios: Small models are more practical in edge devices or computationally limited environments
  2. Scalability: Understanding learning mechanisms in small models is crucial for building scalable reasoning systems
  3. Theoretical Significance: Revealing the minimum prerequisites for bootstrapping reasoning capabilities

Limitations of Existing Methods

  1. TTRL: Estimates rewards through majority voting on unlabeled test data, but small models produce too few correct outputs early in training, leading to pseudo-label errors
  2. Intuitor: Uses model self-certainty as intrinsic reward, but small models have poor confidence calibration
  3. Lack of Research on Weak Models: Existing methods do not account for failure modes when base reasoning capacity is insufficient

Research Motivation

Through systematic experiments, reveal the fundamental reasons for label-free RL failure in weak models and propose targeted solutions, enabling resource-constrained models to benefit from unsupervised RL.

Core Contributions

  1. First Systematic Analysis: Reveals performance differences of label-free RL methods across model scales (0.5B-7B), discovering significant performance degradation and even collapse in weak models
  2. Key Findings:
    • Label-free RL heavily depends on pre-existing reasoning capacity of base models
    • Small models cannot generate sufficiently long or diverse chain-of-thought sequences for self-reflection
    • Training data difficulty is the key factor determining success
    • CoT length is not a direct reflection of strong reasoning ability
  3. Proposes CuMa Method: A comprehensive framework combining curriculum learning, reward masking, and data generation
    • Progressive training strategy from simple to difficult problems
    • Masks reward signals for samples without majority consensus
    • LLM-based difficulty-controlled data generation pipeline
  4. Empirical Validation: Verified on multiple reasoning benchmarks including Math 500, GPQA, AIME24, GSM8K, and LCB, demonstrating effectiveness across all model scales, with particularly significant improvements for weak models

Method Details

Task Definition

Input: Unlabeled reasoning problem dataset D={x1,...,xM}D = \{x_1, ..., x_M\} (e.g., math problems)
Output: Optimized policy model πθ\pi_\theta capable of generating correct reasoning chains and answers
Constraint: No access to ground truth labels during training; learning only through multiple candidate solutions generated by the model itself

Model Architecture

1. Curriculum Learning Framework

Partition the dataset into K=5 difficulty levels: D=D1D2...DKD = D_1 \cup D_2 \cup ... \cup D_K

where D1D_1 contains the simplest problems and DKD_K contains the most difficult ones. Training proceeds in the order D1DKD_1 \to D_K.

2. Majority Voting Reward Mechanism

For each prompt xix_i, generate N candidate solutions {yi(1),...,yi(N)}\{y_i^{(1)}, ..., y_i^{(N)}\}, with reward function defined as: r(xi,yi(j))=I[yi(j)=majority_vote({yi(1),...,yi(N)})]r(x_i, y_i^{(j)}) = \mathbb{I}[y_i^{(j)} = \text{majority\_vote}(\{y_i^{(1)}, ..., y_i^{(N)}\})]

3. Reward Masking Mechanism

Mask learning signals when samples lack majority consensus (i.e., maximum occurrence count < 2): mask(xi)=I[maxj{k:yi(k)=yi(j)}2]\text{mask}(x_i) = \mathbb{I}\left[\max_j |\{k : y_i^{(k)} = y_i^{(j)}\}| \geq 2\right]

This prevents the model from learning noisy feedback from uncertain predictions.

4. Data Generation Pipeline

Use LLM to generate synthetic data of predefined difficulty:

  • Structured prompting strategy explicitly specifying difficulty levels (1-5)
  • Example problems provided for each level as reference
  • Dynamic example refreshing to increase diversity
  • Generate 25 samples per iteration covering different mathematical sub-topics

Technical Innovations

1. Progressive Difficulty Adjustment

Differences from Baseline:

  • TTRL/Intuitor: Train on fixed-difficulty data
  • CuMa: Start with simple problems, progressively increase difficulty

Design Rationale:

  • Small models produce almost no correct solutions on difficult problems (as shown in Figure 2, 0.5B model has near-zero accuracy early in training)
  • Build foundational reasoning ability from simple problems before transferring to complex ones
  • Aligns with human cognitive learning principles

2. Selective Learning Signals

Innovation: Update model only when clear majority consensus exists

Problem Solved:

  • Early in training, small models generate highly dispersed candidate solutions
  • Lack of majority consensus indicates model uncertainty on that problem
  • Forced learning introduces noise, causing performance degradation

Experimental Evidence: Table 2 ablation shows performance drops from 32.8 to 30.7 without reward masking

3. Difficulty-Controlled Data Augmentation

Technical Details:

  • Use structured prompt engineering to generate math problems of varying difficulty
  • Cover multiple sub-domains: algebra, geometry, probability, etc.
  • Dynamically sample example problems to avoid overfitting to specific patterns

Function: Provides sufficient samples of each difficulty level for curriculum learning

Experimental Setup

Datasets

  1. Math 500: 500 high-quality mathematical problems
  2. GPQA: Graduate-level physics question answering
  3. AIME24: American Invitational Mathematics Examination 2024 problems
  4. GSM8K: Elementary school math word problems (8,000+ problems)
  5. LCB: Logic reasoning benchmark

Evaluation Metrics

  • Accuracy: Proportion of generated answers exactly matching standard answers
  • All experiments report percentage accuracy

Comparison Methods

  1. Base Model: Untuned base model without RL training
  2. GRPO: Supervised reinforcement learning using ground truth labels (upper bound reference)
  3. Intuitor: Label-free RL based on self-certainty
  4. TTRL: Test-time RL based on majority voting

Implementation Details

  • Optimizer: AdamW
  • Learning Rate: Peak 3×10⁻⁶, cosine decay
  • Sampling Strategy: Generate 8 candidates per prompt, temperature 0.6
  • Maximum Generation Length: 3,072 tokens
  • Training Epochs: 1 episode
  • Hardware: 4×NVIDIA H100 80GB GPUs
  • Model Family: Qwen2.5 (0.5B, 1.5B, 3B, 7B)

Experimental Results

Main Results

1. Performance Comparison Across Model Scales (Table 1)

0.5B Model:

  • Base: Math 500=23.4, GSM8K=26.38
  • TTRL: Complete collapse (Math 500=0.0)
  • Intuitor: Performance degradation (GSM8K=0.68)
  • CuMa: Math 500=32.8 (+40%), GSM8K=32.9 (+25%)

7B Model:

  • Base: Math 500=58.2, GSM8K=81.5
  • GRPO: 73.8, 85.67 (labeled upper bound)
  • TTRL/Intuitor: 73.6/72.2, 84.39/78.19
  • CuMa: 74.0, 84.49 (approaching labeled methods)

Key Findings:

  • All label-free methods are effective on large models
  • Only CuMa shows stable improvement on small models; other methods degrade or collapse
  • CuMa avoids collapse on 0.5B models, achieving significant improvements

2. Cross-Benchmark Generalization

CuMa demonstrates improvements across 5 different reasoning benchmarks:

  • Math 500: Improvements across all model scales
  • GPQA: 7B model from 27.77→32.32
  • AIME24: 7B model from 6.67→13.33 (doubled)
  • LCB: 3B model from 5.20→8.04

Ablation Study

Table 2 shows contributions of CuMa components (0.5B model, Math 500):

ConfigurationPerformanceDegradation
Full CuMa32.8-
Without Reward Masking30.7-6.4%
Without Data Generation24.5-25.3%
Without Curriculum Learning20.1-38.7%

Key Insights:

  1. Curriculum Learning Most Critical: Performance near collapse without it (20.1 vs base 23.4)
  2. Data Generation Important: Provides sufficient samples of each difficulty level supporting curriculum learning
  3. Reward Masking Effective: Prevents learning from noise signals, stabilizes training

Case Analysis

Figure 2: Early Training Correct Answer Generation

  • 0.5B Model: Almost no correct outputs in first 50 steps
  • Consequence: TTRL's majority voting produces incorrect pseudo-labels → model collapse
  • CuMa Solution: Start with simple problems, enabling partial correct answers early in training

Figure 3: CoT Length Changes During Training

  • 7B Model: Length increases from 500→1400 tokens, including self-reflection
  • 0.5B/1.5B Models: Length remains 500-700, no significant growth
  • Finding: Length growth is not a reliable indicator for small models

Figure 4: Impact of Training Data Difficulty

Testing different difficulty levels (Level 1-2 to 1-5) on 0.5B model:

  • Math 500: 0.35 at L1-2 → near 0 at L1-4 (collapse)
  • GSM8K: Gradually decreases from 0.28 to 0.15
  • Conclusion: Overly difficult data causes small model learning failure

Experimental Findings

  1. Reasoning Capacity Threshold: Label-free RL requires minimum reasoning capacity as prerequisite
  2. Data-Capacity Alignment: Training data difficulty must match model capability
  3. Majority Voting Reliability: Depends on base model generating some correct solutions
  4. Curriculum Learning Universality: Helps all model scales, but more critical for weak models
  5. CoT Length Misleading: Cannot be sole indicator of reasoning improvement in small models

1. Supervised Reinforcement Learning

  • RLHF: Aligning models through human feedback
  • GRPO: Rule-based reward methods for mathematical reasoning
  • DeepSeek-R1: Large-scale reasoning models
  • Limitation: Requires annotated data, limited scalability

2. Label-Free/Self-Improvement Methods

  • Self-rewarding LMs: Model self-evaluation
  • Self-play Fine-tuning: Self-play improvement
  • DPO: Direct Preference Optimization
  • Paper's Distinction: Focuses on RL method applicability to weak models

3. Test-Time Optimization

  • TTRL: Test-time majority voting RL
  • Intuitor: Self-certainty based approach
  • Paper's Contribution: Reveals failure modes in weak models and proposes solutions

4. Curriculum Learning

  • Traditional curriculum learning mainly applied to supervised learning
  • Paper's Innovation: First systematic application of curriculum learning to label-free RL reasoning tasks

Conclusions and Discussion

Main Conclusions

  1. Core Finding: Label-free RL is not a "free lunch"; it requires base reasoning capacity as prerequisite
  2. Failure Mechanisms:
    • Weak models cannot generate sufficient correct solutions → majority voting fails
    • Lack of diverse CoT → self-reflection mechanism ineffective
    • Overly difficult data → sparse learning signals
  3. Solution Effectiveness: CuMa improves performance across 0.5B-7B scales, with particularly significant gains for weak models
  4. Theoretical Significance: Reveals minimum conditions and pathways for reasoning capability bootstrapping

Limitations

  1. Model Range: Validated only on Qwen models; generalization to other architectures (LLaMA, Mistral) unknown
  2. Domain Restriction: Primarily focuses on mathematical reasoning; applicability to other reasoning types (commonsense, logical reasoning) requires further verification
  3. Curriculum Design: Difficulty grading depends on manual definition or LLM generation, lacking automated difficulty assessment
  4. Computational Cost: Requires generating many candidate solutions (8 per problem), high inference cost
  5. Minimum Capacity Threshold: No clear quantitative standard for "sufficient reasoning capacity"
  6. Data Generation Quality: Synthetic data diversity and quality depend on generating model

Future Directions

  1. Adaptive Curriculum: Dynamically adjust difficulty based on real-time model performance
  2. Hybrid Rewards: Combine majority voting and confidence signals
  3. Cross-Domain Validation: Extend to code generation, scientific reasoning
  4. Theoretical Analysis: Establish formal relationships between reasoning capacity and RL effectiveness
  5. Efficiency Optimization: Reduce candidate generation count, lower computational cost

In-Depth Evaluation

Strengths

1. Precise Problem Identification

  • First systematic revelation of label-free RL failure in weak models
  • In-depth root cause analysis through multi-dimensional experiments (model scale, data difficulty, CoT length)
  • Figure 2 visualization intuitively demonstrates early training collapse mechanism

2. Reasonable Method Design

  • Simple and Effective: Three components (curriculum learning, reward masking, data generation) each well-motivated
  • Theoretical Support: Curriculum learning aligns with cognitive science and ML theory
  • Engineering Feasibility: Easy to implement without introducing complex new components

3. Comprehensive Experiments

  • Full Scale Coverage: Covers 0.5B-7B four model scales
  • Diverse Benchmarks: 5 different types of reasoning tasks
  • Complete Comparisons: Includes labeled upper bound (GRPO) and multiple label-free baselines
  • Detailed Ablations: Verifies contribution of each component

4. High Practical Value

  • Provides viable solution for resource-constrained scenarios (edge devices, low-cost deployment)
  • Open-source code with strong reproducibility
  • General method, extensible to other RL paradigms

5. Clear Writing

  • Rigorous logical structure: problem → analysis → method → verification
  • Effective visualizations (Figures 1-4 intuitively present key findings)
  • Well-summarized core contributions

Weaknesses

1. Limited Theoretical Depth

  • Lacks Formal Analysis: No theoretical relationship established between reasoning capacity and RL convergence
  • Vague Difficulty Definition: Level 1-5 division depends on subjective judgment
  • Unquantified Threshold: What degree of reasoning capacity suffices for label-free RL?

2. Experimental Design Flaws

  • Single Model Family: Only Qwen models; architectural bias not eliminated
  • Data Generation Dependency: Synthetic data quality depends on Qwen-72B, potential bias
  • Missing Statistical Significance: No variance and confidence intervals from multiple runs
  • Unreported Computational Cost: Training time, GPU hours not disclosed

3. Method Limitations

  • Fixed Curriculum: 5 difficulty levels and order are hyperparameters, lacking adaptive mechanisms
  • Fragile Majority Voting: Still depends on base model generating some correct solutions
  • Conservative Reward Masking: May miss valuable learning opportunities from difficult samples

4. Insufficient Analysis

  • Missing Failure Cases: No demonstration of CuMa failures
  • Incomplete Human Learning Comparison: Curriculum learning analogy not deeply explored
  • Unknown Long-term Effects: Only 1 episode training; stability with continued training unverified

5. Questionable Generalization

  • Single Task Type: Primarily mathematical reasoning; other types insufficiently verified
  • Language Limitation: English data only; multilingual scenarios unconsidered
  • Domain Knowledge: Applicability to specialized domains (medicine, law) unknown

Impact

Contribution to Field

  1. Fills Research Gap: First systematic study of label-free RL behavior in weak models
  2. Methodological Insights: Demonstrates curriculum learning effectiveness in RL reasoning tasks
  3. Practical Guidance: Provides viable pathway for small model reasoning improvement
  4. Theoretical Foundation: Establishes basis for subsequent research on reasoning capability bootstrapping

Practical Value

  • Edge Deployment: Enables small models to improve through RL, reducing deployment costs
  • Educational Applications: Progressive learning strategy applicable to personalized education systems
  • Research Tools: Open-source code and data generation pipeline available to community

Reproducibility

  • ✅ Code open-sourced (GitHub)
  • ✅ Detailed hyperparameters (learning rate, temperature, generation length, etc.)
  • ✅ Data generation prompts disclosed (Appendix B)
  • ⚠️ High computational resource requirements (4×H100)
  • ⚠️ Synthetic data not directly released

Applicable Scenarios

Suitable Scenarios

  1. Resource-Constrained Environments: Need reasoning improvement on small models
  2. Unlabeled Data: Abundant reasoning problems but lacking standard answers
  3. Progressive Learning: Tasks with clear difficulty hierarchy (education, competition training)
  4. Math/Code Reasoning: Closed-domain tasks with objective correct answers

Unsuitable Scenarios

  1. Open-Domain Generation: Creative writing, dialogue systems (no clear correct answers)
  2. Extremely Weak Models: <0.5B or near-random base reasoning capacity
  3. Real-Time Systems: Cannot afford multiple sampling overhead
  4. Subjective Tasks: Sentiment analysis, style transfer (majority voting meaningless)

References

  1. DeepSeekMath 1: Open model benchmark for mathematical reasoning
  2. DeepSeek-R1 2: Large-scale reasoning models and RL training
  3. TTRL 3: Test-time reinforcement learning framework
  4. Intuitor 4: Unsupervised RL based on intrinsic confidence
  5. RLHF 6: Classic learning from human feedback method
  6. PPO 7: Proximal Policy Optimization algorithm
  7. Chain-of-Thought 8: Chain-of-thought prompting technique
  • RL Foundations 5: Sutton & Barto classic textbook
  • DPO 17: Direct Preference Optimization
  • Self-rewarding LMs 14-16: Self-reward and self-improvement

Summary

This paper conducts in-depth empirical research and methodological innovation on label-free reinforcement learning failure in weak reasoning models. The core value lies in revealing the prerequisites for reasoning capability bootstrapping: base models must possess minimum reasoning capacity to benefit from unsupervised RL. CuMa achieves stable improvement even in 0.5B weak models through synergistic design of curriculum learning, reward masking, and data generation.

Highlights: Precise problem identification, simple effective method, comprehensive experiments, high practical value.
Shortcomings: Limited theoretical analysis, restricted generalization verification, missing statistical significance.

Recommendation Score: ⭐⭐⭐⭐ (4/5)
Recommended for researchers interested in small model reasoning, unsupervised learning, and curriculum learning. Also valuable reference for industry practitioners deploying reasoning models in resource-constrained scenarios.