Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa
- Paper ID: 2511.04902
- Title: You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models
- Authors: Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei (RBC Borealis)
- Classification: cs.LG, cs.AI
- Conference: NeurIPS 2025 Workshop: MATH-AI
- Paper Link: https://arxiv.org/abs/2511.04902
- Code Link: https://github.com/BorealisAI/CuMa
This paper systematically investigates the performance of label-free reinforcement learning (RL) methods across language models of varying scales (0.5B to 7B parameters) and reasoning capabilities. The study reveals a critical limitation: label-free RL heavily depends on the pre-existing reasoning capacity of base models, with performance often degrading below baseline levels for weaker models. The research finds that small models cannot generate sufficiently long or diverse chain-of-thought (CoT) sequences for effective self-reflection, and training data difficulty plays a crucial role in determining success. To address these challenges, the authors propose CuMa, which employs curriculum learning to progressively introduce harder problems and masks samples lacking majority voting consensus during training. The method demonstrates consistent improvements across all model scales.
Recent improvements in large language model reasoning capabilities have primarily relied on reinforcement learning techniques. However, traditional approaches (such as RLHF and RLVR) heavily depend on external supervision signals (human annotations or domain-specific ground truth labels). To address this scalability bottleneck, researchers have proposed label-free RL methods (such as TTRL and Intuitor), but these have been primarily validated on large, reasoning-capable models (e.g., Qwen2.5-Math-7B). The core question this paper addresses is: Can these label-free RL methods generalize to small-scale base models with limited reasoning capabilities?
- Resource-Constrained Scenarios: Small models are more practical in edge devices or computationally limited environments
- Scalability: Understanding learning mechanisms in small models is crucial for building scalable reasoning systems
- Theoretical Significance: Revealing the minimum prerequisites for bootstrapping reasoning capabilities
- TTRL: Estimates rewards through majority voting on unlabeled test data, but small models produce too few correct outputs early in training, leading to pseudo-label errors
- Intuitor: Uses model self-certainty as intrinsic reward, but small models have poor confidence calibration
- Lack of Research on Weak Models: Existing methods do not account for failure modes when base reasoning capacity is insufficient
Through systematic experiments, reveal the fundamental reasons for label-free RL failure in weak models and propose targeted solutions, enabling resource-constrained models to benefit from unsupervised RL.
- First Systematic Analysis: Reveals performance differences of label-free RL methods across model scales (0.5B-7B), discovering significant performance degradation and even collapse in weak models
- Key Findings:
- Label-free RL heavily depends on pre-existing reasoning capacity of base models
- Small models cannot generate sufficiently long or diverse chain-of-thought sequences for self-reflection
- Training data difficulty is the key factor determining success
- CoT length is not a direct reflection of strong reasoning ability
- Proposes CuMa Method: A comprehensive framework combining curriculum learning, reward masking, and data generation
- Progressive training strategy from simple to difficult problems
- Masks reward signals for samples without majority consensus
- LLM-based difficulty-controlled data generation pipeline
- Empirical Validation: Verified on multiple reasoning benchmarks including Math 500, GPQA, AIME24, GSM8K, and LCB, demonstrating effectiveness across all model scales, with particularly significant improvements for weak models
Input: Unlabeled reasoning problem dataset D={x1,...,xM} (e.g., math problems)
Output: Optimized policy model πθ capable of generating correct reasoning chains and answers
Constraint: No access to ground truth labels during training; learning only through multiple candidate solutions generated by the model itself
Partition the dataset into K=5 difficulty levels:
D=D1∪D2∪...∪DK
where D1 contains the simplest problems and DK contains the most difficult ones. Training proceeds in the order D1→DK.
For each prompt xi, generate N candidate solutions {yi(1),...,yi(N)}, with reward function defined as:
r(xi,yi(j))=I[yi(j)=majority_vote({yi(1),...,yi(N)})]
Mask learning signals when samples lack majority consensus (i.e., maximum occurrence count < 2):
mask(xi)=I[maxj∣{k:yi(k)=yi(j)}∣≥2]
This prevents the model from learning noisy feedback from uncertain predictions.
Use LLM to generate synthetic data of predefined difficulty:
- Structured prompting strategy explicitly specifying difficulty levels (1-5)
- Example problems provided for each level as reference
- Dynamic example refreshing to increase diversity
- Generate 25 samples per iteration covering different mathematical sub-topics
Differences from Baseline:
- TTRL/Intuitor: Train on fixed-difficulty data
- CuMa: Start with simple problems, progressively increase difficulty
Design Rationale:
- Small models produce almost no correct solutions on difficult problems (as shown in Figure 2, 0.5B model has near-zero accuracy early in training)
- Build foundational reasoning ability from simple problems before transferring to complex ones
- Aligns with human cognitive learning principles
Innovation: Update model only when clear majority consensus exists
Problem Solved:
- Early in training, small models generate highly dispersed candidate solutions
- Lack of majority consensus indicates model uncertainty on that problem
- Forced learning introduces noise, causing performance degradation
Experimental Evidence: Table 2 ablation shows performance drops from 32.8 to 30.7 without reward masking
Technical Details:
- Use structured prompt engineering to generate math problems of varying difficulty
- Cover multiple sub-domains: algebra, geometry, probability, etc.
- Dynamically sample example problems to avoid overfitting to specific patterns
Function: Provides sufficient samples of each difficulty level for curriculum learning
- Math 500: 500 high-quality mathematical problems
- GPQA: Graduate-level physics question answering
- AIME24: American Invitational Mathematics Examination 2024 problems
- GSM8K: Elementary school math word problems (8,000+ problems)
- LCB: Logic reasoning benchmark
- Accuracy: Proportion of generated answers exactly matching standard answers
- All experiments report percentage accuracy
- Base Model: Untuned base model without RL training
- GRPO: Supervised reinforcement learning using ground truth labels (upper bound reference)
- Intuitor: Label-free RL based on self-certainty
- TTRL: Test-time RL based on majority voting
- Optimizer: AdamW
- Learning Rate: Peak 3×10⁻⁶, cosine decay
- Sampling Strategy: Generate 8 candidates per prompt, temperature 0.6
- Maximum Generation Length: 3,072 tokens
- Training Epochs: 1 episode
- Hardware: 4×NVIDIA H100 80GB GPUs
- Model Family: Qwen2.5 (0.5B, 1.5B, 3B, 7B)
0.5B Model:
- Base: Math 500=23.4, GSM8K=26.38
- TTRL: Complete collapse (Math 500=0.0)
- Intuitor: Performance degradation (GSM8K=0.68)
- CuMa: Math 500=32.8 (+40%), GSM8K=32.9 (+25%)
7B Model:
- Base: Math 500=58.2, GSM8K=81.5
- GRPO: 73.8, 85.67 (labeled upper bound)
- TTRL/Intuitor: 73.6/72.2, 84.39/78.19
- CuMa: 74.0, 84.49 (approaching labeled methods)
Key Findings:
- All label-free methods are effective on large models
- Only CuMa shows stable improvement on small models; other methods degrade or collapse
- CuMa avoids collapse on 0.5B models, achieving significant improvements
CuMa demonstrates improvements across 5 different reasoning benchmarks:
- Math 500: Improvements across all model scales
- GPQA: 7B model from 27.77→32.32
- AIME24: 7B model from 6.67→13.33 (doubled)
- LCB: 3B model from 5.20→8.04
Table 2 shows contributions of CuMa components (0.5B model, Math 500):
| Configuration | Performance | Degradation |
|---|
| Full CuMa | 32.8 | - |
| Without Reward Masking | 30.7 | -6.4% |
| Without Data Generation | 24.5 | -25.3% |
| Without Curriculum Learning | 20.1 | -38.7% |
Key Insights:
- Curriculum Learning Most Critical: Performance near collapse without it (20.1 vs base 23.4)
- Data Generation Important: Provides sufficient samples of each difficulty level supporting curriculum learning
- Reward Masking Effective: Prevents learning from noise signals, stabilizes training
- 0.5B Model: Almost no correct outputs in first 50 steps
- Consequence: TTRL's majority voting produces incorrect pseudo-labels → model collapse
- CuMa Solution: Start with simple problems, enabling partial correct answers early in training
- 7B Model: Length increases from 500→1400 tokens, including self-reflection
- 0.5B/1.5B Models: Length remains 500-700, no significant growth
- Finding: Length growth is not a reliable indicator for small models
Testing different difficulty levels (Level 1-2 to 1-5) on 0.5B model:
- Math 500: 0.35 at L1-2 → near 0 at L1-4 (collapse)
- GSM8K: Gradually decreases from 0.28 to 0.15
- Conclusion: Overly difficult data causes small model learning failure
- Reasoning Capacity Threshold: Label-free RL requires minimum reasoning capacity as prerequisite
- Data-Capacity Alignment: Training data difficulty must match model capability
- Majority Voting Reliability: Depends on base model generating some correct solutions
- Curriculum Learning Universality: Helps all model scales, but more critical for weak models
- CoT Length Misleading: Cannot be sole indicator of reasoning improvement in small models
- RLHF: Aligning models through human feedback
- GRPO: Rule-based reward methods for mathematical reasoning
- DeepSeek-R1: Large-scale reasoning models
- Limitation: Requires annotated data, limited scalability
- Self-rewarding LMs: Model self-evaluation
- Self-play Fine-tuning: Self-play improvement
- DPO: Direct Preference Optimization
- Paper's Distinction: Focuses on RL method applicability to weak models
- TTRL: Test-time majority voting RL
- Intuitor: Self-certainty based approach
- Paper's Contribution: Reveals failure modes in weak models and proposes solutions
- Traditional curriculum learning mainly applied to supervised learning
- Paper's Innovation: First systematic application of curriculum learning to label-free RL reasoning tasks
- Core Finding: Label-free RL is not a "free lunch"; it requires base reasoning capacity as prerequisite
- Failure Mechanisms:
- Weak models cannot generate sufficient correct solutions → majority voting fails
- Lack of diverse CoT → self-reflection mechanism ineffective
- Overly difficult data → sparse learning signals
- Solution Effectiveness: CuMa improves performance across 0.5B-7B scales, with particularly significant gains for weak models
- Theoretical Significance: Reveals minimum conditions and pathways for reasoning capability bootstrapping
- Model Range: Validated only on Qwen models; generalization to other architectures (LLaMA, Mistral) unknown
- Domain Restriction: Primarily focuses on mathematical reasoning; applicability to other reasoning types (commonsense, logical reasoning) requires further verification
- Curriculum Design: Difficulty grading depends on manual definition or LLM generation, lacking automated difficulty assessment
- Computational Cost: Requires generating many candidate solutions (8 per problem), high inference cost
- Minimum Capacity Threshold: No clear quantitative standard for "sufficient reasoning capacity"
- Data Generation Quality: Synthetic data diversity and quality depend on generating model
- Adaptive Curriculum: Dynamically adjust difficulty based on real-time model performance
- Hybrid Rewards: Combine majority voting and confidence signals
- Cross-Domain Validation: Extend to code generation, scientific reasoning
- Theoretical Analysis: Establish formal relationships between reasoning capacity and RL effectiveness
- Efficiency Optimization: Reduce candidate generation count, lower computational cost
- First systematic revelation of label-free RL failure in weak models
- In-depth root cause analysis through multi-dimensional experiments (model scale, data difficulty, CoT length)
- Figure 2 visualization intuitively demonstrates early training collapse mechanism
- Simple and Effective: Three components (curriculum learning, reward masking, data generation) each well-motivated
- Theoretical Support: Curriculum learning aligns with cognitive science and ML theory
- Engineering Feasibility: Easy to implement without introducing complex new components
- Full Scale Coverage: Covers 0.5B-7B four model scales
- Diverse Benchmarks: 5 different types of reasoning tasks
- Complete Comparisons: Includes labeled upper bound (GRPO) and multiple label-free baselines
- Detailed Ablations: Verifies contribution of each component
- Provides viable solution for resource-constrained scenarios (edge devices, low-cost deployment)
- Open-source code with strong reproducibility
- General method, extensible to other RL paradigms
- Rigorous logical structure: problem → analysis → method → verification
- Effective visualizations (Figures 1-4 intuitively present key findings)
- Well-summarized core contributions
- Lacks Formal Analysis: No theoretical relationship established between reasoning capacity and RL convergence
- Vague Difficulty Definition: Level 1-5 division depends on subjective judgment
- Unquantified Threshold: What degree of reasoning capacity suffices for label-free RL?
- Single Model Family: Only Qwen models; architectural bias not eliminated
- Data Generation Dependency: Synthetic data quality depends on Qwen-72B, potential bias
- Missing Statistical Significance: No variance and confidence intervals from multiple runs
- Unreported Computational Cost: Training time, GPU hours not disclosed
- Fixed Curriculum: 5 difficulty levels and order are hyperparameters, lacking adaptive mechanisms
- Fragile Majority Voting: Still depends on base model generating some correct solutions
- Conservative Reward Masking: May miss valuable learning opportunities from difficult samples
- Missing Failure Cases: No demonstration of CuMa failures
- Incomplete Human Learning Comparison: Curriculum learning analogy not deeply explored
- Unknown Long-term Effects: Only 1 episode training; stability with continued training unverified
- Single Task Type: Primarily mathematical reasoning; other types insufficiently verified
- Language Limitation: English data only; multilingual scenarios unconsidered
- Domain Knowledge: Applicability to specialized domains (medicine, law) unknown
- Fills Research Gap: First systematic study of label-free RL behavior in weak models
- Methodological Insights: Demonstrates curriculum learning effectiveness in RL reasoning tasks
- Practical Guidance: Provides viable pathway for small model reasoning improvement
- Theoretical Foundation: Establishes basis for subsequent research on reasoning capability bootstrapping
- Edge Deployment: Enables small models to improve through RL, reducing deployment costs
- Educational Applications: Progressive learning strategy applicable to personalized education systems
- Research Tools: Open-source code and data generation pipeline available to community
- ✅ Code open-sourced (GitHub)
- ✅ Detailed hyperparameters (learning rate, temperature, generation length, etc.)
- ✅ Data generation prompts disclosed (Appendix B)
- ⚠️ High computational resource requirements (4×H100)
- ⚠️ Synthetic data not directly released
- Resource-Constrained Environments: Need reasoning improvement on small models
- Unlabeled Data: Abundant reasoning problems but lacking standard answers
- Progressive Learning: Tasks with clear difficulty hierarchy (education, competition training)
- Math/Code Reasoning: Closed-domain tasks with objective correct answers
- Open-Domain Generation: Creative writing, dialogue systems (no clear correct answers)
- Extremely Weak Models: <0.5B or near-random base reasoning capacity
- Real-Time Systems: Cannot afford multiple sampling overhead
- Subjective Tasks: Sentiment analysis, style transfer (majority voting meaningless)
- DeepSeekMath 1: Open model benchmark for mathematical reasoning
- DeepSeek-R1 2: Large-scale reasoning models and RL training
- TTRL 3: Test-time reinforcement learning framework
- Intuitor 4: Unsupervised RL based on intrinsic confidence
- RLHF 6: Classic learning from human feedback method
- PPO 7: Proximal Policy Optimization algorithm
- Chain-of-Thought 8: Chain-of-thought prompting technique
- RL Foundations 5: Sutton & Barto classic textbook
- DPO 17: Direct Preference Optimization
- Self-rewarding LMs 14-16: Self-reward and self-improvement
This paper conducts in-depth empirical research and methodological innovation on label-free reinforcement learning failure in weak reasoning models. The core value lies in revealing the prerequisites for reasoning capability bootstrapping: base models must possess minimum reasoning capacity to benefit from unsupervised RL. CuMa achieves stable improvement even in 0.5B weak models through synergistic design of curriculum learning, reward masking, and data generation.
Highlights: Precise problem identification, simple effective method, comprehensive experiments, high practical value.
Shortcomings: Limited theoretical analysis, restricted generalization verification, missing statistical significance.
Recommendation Score: ⭐⭐⭐⭐ (4/5)
Recommended for researchers interested in small model reasoning, unsupervised learning, and curriculum learning. Also valuable reference for industry practitioners deploying reasoning models in resource-constrained scenarios.