2025-11-22T01:34:16.289617

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Xiong, Ye, Liao et al.

Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.

academic

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Basic Information

Paper ID: 2510.04996
Title: Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
Authors: Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang
Categories: cs.LG cs.AI cs.CL stat.ML
Publication Date: October 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2510.04996
Code Link: https://github.com/RLHFlow/Reinforce-Ada

Abstract

Reinforcement learning applied to large language models (LLMs) for reasoning tasks often suffers from unstable gradient estimation due to fixed and uniform response sampling strategies. This paper proposes Reinforce-Ada, an adaptive sampling framework for online reinforcement learning post-training of LLMs, which continuously reallocates sampling effort to prompts with maximum uncertainty or learning potential. Unlike traditional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, automatically stopping sampling for prompts after collecting sufficient signals. To stabilize updates, the method forms fixed-size batches and enforces reward diversity, computing advantage baselines using global statistics aggregated from the adaptive sampling phase.

Research Background and Motivation

Core Problems

Unstable Gradient Estimation: Traditional reinforcement learning methods sample with a fixed small number (n) of responses when training LLMs, resulting in excessive gradient estimation variance and training instability.
Signal Collapse Problem: When all n responses for a prompt receive the same reward (all correct or all incorrect), advantage calculation in GRPO produces zero gradients, causing training signal loss.
Low Sampling Efficiency: Uniform sampling strategies cannot dynamically allocate computational resources based on prompt difficulty and learning value.

Problem Significance

In mathematical reasoning tasks, over 50% of prompts fall into "zero gradient" states
Simply increasing the number of samples, while alleviating the problem, incurs excessive computational costs (e.g., cost explosion when n=512)
Existing passive filtering methods discard large quantities of already-generated responses, causing resource waste

Limitations of Existing Methods

Fixed Sampling in GRPO: Cannot adapt to difficulty variations across different prompts
Passive Filtering Methods: Inefficient, generating large numbers of useless responses before discarding them
Two-Stage Budget Allocation: Methods like GVM-RAFT separate estimation and sampling, offering poor efficiency and difficulty in online implementation

Core Contributions

Proposes Reinforce-Ada Adaptive Sampling Framework: Unifies estimation and sampling into an online successive elimination process, dynamically allocating inference budget
Designs Two Exit Conditions:
- Reinforce-Ada-pos: Focuses on positive sample collection
- Reinforce-Ada-balance: Balances positive and negative samples, maintaining exploration
Introduces Global Statistics Normalization: Uses statistical information from the entire sampling process to compute advantages, improving estimation stability
Enables Plug-and-Play Replacement: Can directly replace the generation step in existing RL pipelines without architectural modifications
Validates Effectiveness Across Multiple Models and Benchmarks: Continuously improves convergence speed and final performance on mathematical reasoning tasks

Method Details

Task Definition

Given a prompt distribution d₀, policy πθ generates responses a ~ πθ(·|x), and a verifier provides rewards r⋆(x,a) ∈ {0,1}. The objective is to maximize expected reward:

J(θ) = E_{x∼d₀,a∼πθ(·|x)}r⋆(x,a)

Core Algorithm Architecture

1. Adaptive Sampling Process

Algorithm Flow:
1. Initialization: Mark all prompts as active
2. Multi-round Sampling:
   - Sample M responses for each active prompt
   - Evaluate exit conditions
   - Mark prompts satisfying conditions as inactive
3. Repeat until all prompts exit or maximum rounds N reached

2. Exit Condition Design

Reinforce-Ada-pos: Exit upon collecting at least one correct response
Reinforce-Ada-balance: Exit only after collecting at least n/2 correct and n/2 incorrect responses

3. Training Batch Construction

Downsample responses from each prompt's pool to fixed size n
Prioritize maintaining positive-negative sample balance (n/2 each)
Compute advantages using global statistics: A(x,aᵢ) = rᵢ - r̄

4. Objective Function

Employs importance sampling correction and PPO-style gradient clipping:

L(θ) = 1/|B| ∑{(x,aᵢ)∈B} ∑^{|aᵢ|} min(ρᵢ,t·A(x,aᵢ), clip(ρᵢ,t, 1-ε_, 1+ε_)·A(x,aᵢ))

Technical Innovations

Online Unified Process: Merges the estimation and decision-making of traditional two-stage methods into a single online process
Successive Elimination Mechanism: Borrows ideas from multi-armed bandit theory, dynamically stopping prompts that require no further sampling
Global Normalization Strategy: Uses statistical information from the complete sampling pool rather than final selected subsets, improving estimation robustness
Balanced Sampling Guarantee: Ensures each training batch maintains non-zero variance, avoiding gradient vanishing

Experimental Setup

Datasets

Training Data: Default subset of OpenR1-Math-220k dataset
Preprocessing: Deduplication, verification filtering, medium difficulty filtering (at least 1 correct in 16 samples)

Models

Qwen2.5-Math-7B/1.5B
Qwen3-4B-it
Llama-3.2-3B-it

Evaluation Metrics

Training Metrics: Reward curves, entropy changes
Test Benchmarks: MATH500, Minerva Math, OlympiadBench, AIME-like
Evaluation Method: Ave@32 (temperature 1.0, max 4096 tokens)

Implementation Details

Batch size: 512 prompts
Effective group size: n=4
Maximum samples: 32 responses/prompt
Learning rate: 1×10⁻⁶ (AdamW)
Entropy regularization: 1×10⁻⁴
Training steps: 600

Experimental Results

Main Results

Training Efficiency Improvements

Convergence Speed: Reinforce-Ada shows clear advantages within 50-150 steps
Final Performance: Achieves higher reward ceilings across all tested models
Stability: Reinforce-Ada-balance demonstrates most stable performance

Test Benchmark Performance

Model	Method	Math500	Minerva	Olympiad	AIME-like	Weighted Avg
Qwen2.5-Math-1.5B	GRPO	74.2	34.4	38.4	16.2	45.3
	Reinforce-Ada-balance	77.4	36.5	40.5	17.5	47.6 (+2.3)
Qwen2.5-Math-7B	GRPO	82.2	44.7	45.6	23.2	53.3
	Reinforce-Ada-balance	84.0	45.2	47.1	23.7	54.6 (+1.3)

Ablation Studies

Importance of Balanced Sampling

Reinforce-Ada-balance consistently outperforms Reinforce-Ada-pos
In later training stages, balanced sampling maintains exploration, avoiding entropy collapse

Computational Overhead Analysis

Model	Method	Avg Step Time (sec)	Relative Cost
Qwen2.5-Math-1.5B	GRPO	102	1.0×
	Reinforce-Ada-balance	290	2.8×
Qwen2.5-Math-7B	GRPO	236	1.0×
	Reinforce-Ada-balance	375	1.59×

Prompt Difficulty Impact

Reinforce-Ada's advantages are more pronounced on difficult prompt sets
Benefits are relatively smaller on simple prompt sets, as most prompts satisfy exit conditions within the first two rounds

Sampling Dynamics Analysis

Early Training: Main bottleneck is lack of positive samples; both Reinforce-Ada-pos and balance are effective
Late Training: Bottleneck shifts to lack of negative samples; balance version shows clear advantages
Adaptive Allocation: Difficult prompts receive more sampling budget; simple prompts exit early

Data Filtering and Selection

Passive Filtering Methods: Yu et al. (2025), Xiong et al. (2025) directly discard uniformly rewarded groups
Budget Allocation Methods: GVM-RAFT (Yao et al., 2025) adopts two-stage exploration-exploitation paradigm
Curriculum Learning: Shi et al. (2025), Zhang et al. (2025) focus on prompt-level selection

GRPO Variant Design

Advantage Estimation Improvements: Hu (2025), Zhu et al. (2025) modify core update rules
Signal Loss Solutions: Nan et al. (2025) adds constants to avoid zero variance; Le et al. (2025) uses entropy information

Multi-Armed Bandit Theory

Borrows ideas from successive elimination algorithms (Slivkins et al., 2019) for online decision-making
Treats prompts as arms, dynamically allocating sampling budget

Conclusions and Discussion

Main Conclusions

Adaptive Sampling is Effective: Significantly improves training efficiency and final performance compared to fixed sampling strategies
Balanced Sampling is Critical: Maintaining positive-negative sample balance is essential for sustaining exploration and avoiding overfitting
Plug-and-Play Practicality: Can be directly integrated into existing RL training frameworks

Limitations

Computational Overhead: 1.5-2.8× computational cost increase compared to GRPO
Domain Limitations: Experiments primarily focus on mathematical reasoning
Prompt Difficulty Dependence: Limited benefits on datasets dominated by simple prompts
Hyperparameter Sensitivity: Requires careful tuning of maximum rounds N and samples per round M

Future Directions

End-to-End Data Management: Combine with curriculum learning and other macro-level strategies
Multi-Domain Validation: Extend to code generation, dialogue, and other tasks
Theoretical Analysis: Provide theoretical guarantees for convergence and sample complexity
Efficiency Optimization: Investigate more efficient exit conditions and sampling strategies

In-Depth Evaluation

Strengths

Accurate Problem Identification: Clearly identifies the root cause of signal collapse in GRPO
Clever Method Design: Innovatively applies multi-armed bandit ideas to LLM training
Comprehensive Experiments: Full validation across multiple models and benchmarks
Engineering-Friendly: Provides plug-and-play implementation for practical application
In-Depth Analysis: Detailed dynamic analysis and ablation studies

Weaknesses

Weak Theoretical Foundation: Lacks theoretical analysis such as convergence guarantees
Cost-Benefit Trade-off: Whether increased computational overhead is worthwhile requires further analysis
Limited Applicability: Primarily validated on mathematical reasoning; generalization remains uncertain
Complex Parameter Tuning: Introduces additional hyperparameters requiring adjustment

Impact

Academic Value: Provides new perspective on data sampling for LLM reinforcement learning
Practical Value: Can be directly applied to existing training pipelines
Inspirational Significance: Promotes application of adaptive data management in RL

Applicable Scenarios

High-Quality Requirements: Applications with high performance demands
Sufficient Computational Resources: Scenarios capable of bearing additional computational costs
Reasoning Tasks: Particularly suitable for mathematical reasoning, code generation, and other multi-step reasoning tasks
Online Training: Scenarios requiring dynamic training strategy adjustment

References

Shao et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.
Yao et al. (2025). Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl.
Yu et al. (2025). Dapo: An open-source llm reinforcement learning system at scale.
Slivkins et al. (2019). Introduction to multi-armed bandits.
Dong et al. (2023). RAFT: Reward ranked finetuning for generative foundation model alignment.

Summary: Reinforce-Ada proposes an innovative adaptive sampling framework that effectively addresses the signal collapse problem in LLM reinforcement learning. While introducing additional computational costs, it demonstrates significant improvements in training efficiency and final performance, providing valuable new insights for LLM reinforcement learning training.