2025-11-22T01:34:16.289617

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Xiong, Ye, Liao et al.
Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.
academic

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Basic Information

  • Paper ID: 2510.04996
  • Title: Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
  • Authors: Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang
  • Categories: cs.LG cs.AI cs.CL stat.ML
  • Publication Date: October 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2510.04996
  • Code Link: https://github.com/RLHFlow/Reinforce-Ada

Abstract

Reinforcement learning applied to large language models (LLMs) for reasoning tasks often suffers from unstable gradient estimation due to fixed and uniform response sampling strategies. This paper proposes Reinforce-Ada, an adaptive sampling framework for online reinforcement learning post-training of LLMs, which continuously reallocates sampling effort to prompts with maximum uncertainty or learning potential. Unlike traditional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, automatically stopping sampling for prompts after collecting sufficient signals. To stabilize updates, the method forms fixed-size batches and enforces reward diversity, computing advantage baselines using global statistics aggregated from the adaptive sampling phase.

Research Background and Motivation

Core Problems

  1. Unstable Gradient Estimation: Traditional reinforcement learning methods sample with a fixed small number (n) of responses when training LLMs, resulting in excessive gradient estimation variance and training instability.
  2. Signal Collapse Problem: When all n responses for a prompt receive the same reward (all correct or all incorrect), advantage calculation in GRPO produces zero gradients, causing training signal loss.
  3. Low Sampling Efficiency: Uniform sampling strategies cannot dynamically allocate computational resources based on prompt difficulty and learning value.

Problem Significance

  • In mathematical reasoning tasks, over 50% of prompts fall into "zero gradient" states
  • Simply increasing the number of samples, while alleviating the problem, incurs excessive computational costs (e.g., cost explosion when n=512)
  • Existing passive filtering methods discard large quantities of already-generated responses, causing resource waste

Limitations of Existing Methods

  1. Fixed Sampling in GRPO: Cannot adapt to difficulty variations across different prompts
  2. Passive Filtering Methods: Inefficient, generating large numbers of useless responses before discarding them
  3. Two-Stage Budget Allocation: Methods like GVM-RAFT separate estimation and sampling, offering poor efficiency and difficulty in online implementation

Core Contributions

  1. Proposes Reinforce-Ada Adaptive Sampling Framework: Unifies estimation and sampling into an online successive elimination process, dynamically allocating inference budget
  2. Designs Two Exit Conditions:
    • Reinforce-Ada-pos: Focuses on positive sample collection
    • Reinforce-Ada-balance: Balances positive and negative samples, maintaining exploration
  3. Introduces Global Statistics Normalization: Uses statistical information from the entire sampling process to compute advantages, improving estimation stability
  4. Enables Plug-and-Play Replacement: Can directly replace the generation step in existing RL pipelines without architectural modifications
  5. Validates Effectiveness Across Multiple Models and Benchmarks: Continuously improves convergence speed and final performance on mathematical reasoning tasks

Method Details

Task Definition

Given a prompt distribution d₀, policy πθ generates responses a ~ πθ(·|x), and a verifier provides rewards r⋆(x,a) ∈ {0,1}. The objective is to maximize expected reward:

J(θ) = E_{x∼d₀,a∼πθ(·|x)}r⋆(x,a)

Core Algorithm Architecture

1. Adaptive Sampling Process

Algorithm Flow:
1. Initialization: Mark all prompts as active
2. Multi-round Sampling:
   - Sample M responses for each active prompt
   - Evaluate exit conditions
   - Mark prompts satisfying conditions as inactive
3. Repeat until all prompts exit or maximum rounds N reached

2. Exit Condition Design

  • Reinforce-Ada-pos: Exit upon collecting at least one correct response
  • Reinforce-Ada-balance: Exit only after collecting at least n/2 correct and n/2 incorrect responses

3. Training Batch Construction

  • Downsample responses from each prompt's pool to fixed size n
  • Prioritize maintaining positive-negative sample balance (n/2 each)
  • Compute advantages using global statistics: A(x,aᵢ) = rᵢ - r̄

4. Objective Function

Employs importance sampling correction and PPO-style gradient clipping:

L(θ) = 1/|B| ∑{(x,aᵢ)∈B} ∑^{|aᵢ|} min(ρᵢ,t·A(x,aᵢ), clip(ρᵢ,t, 1-ε_, 1+ε_)·A(x,aᵢ))

Technical Innovations

  1. Online Unified Process: Merges the estimation and decision-making of traditional two-stage methods into a single online process
  2. Successive Elimination Mechanism: Borrows ideas from multi-armed bandit theory, dynamically stopping prompts that require no further sampling
  3. Global Normalization Strategy: Uses statistical information from the complete sampling pool rather than final selected subsets, improving estimation robustness
  4. Balanced Sampling Guarantee: Ensures each training batch maintains non-zero variance, avoiding gradient vanishing

Experimental Setup

Datasets

  • Training Data: Default subset of OpenR1-Math-220k dataset
  • Preprocessing: Deduplication, verification filtering, medium difficulty filtering (at least 1 correct in 16 samples)

Models

  • Qwen2.5-Math-7B/1.5B
  • Qwen3-4B-it
  • Llama-3.2-3B-it

Evaluation Metrics

  • Training Metrics: Reward curves, entropy changes
  • Test Benchmarks: MATH500, Minerva Math, OlympiadBench, AIME-like
  • Evaluation Method: Ave@32 (temperature 1.0, max 4096 tokens)

Implementation Details

  • Batch size: 512 prompts
  • Effective group size: n=4
  • Maximum samples: 32 responses/prompt
  • Learning rate: 1×10⁻⁶ (AdamW)
  • Entropy regularization: 1×10⁻⁴
  • Training steps: 600

Experimental Results

Main Results

Training Efficiency Improvements

  • Convergence Speed: Reinforce-Ada shows clear advantages within 50-150 steps
  • Final Performance: Achieves higher reward ceilings across all tested models
  • Stability: Reinforce-Ada-balance demonstrates most stable performance

Test Benchmark Performance

ModelMethodMath500MinervaOlympiadAIME-likeWeighted Avg
Qwen2.5-Math-1.5BGRPO74.234.438.416.245.3
Reinforce-Ada-balance77.436.540.517.547.6 (+2.3)
Qwen2.5-Math-7BGRPO82.244.745.623.253.3
Reinforce-Ada-balance84.045.247.123.754.6 (+1.3)

Ablation Studies

Importance of Balanced Sampling

  • Reinforce-Ada-balance consistently outperforms Reinforce-Ada-pos
  • In later training stages, balanced sampling maintains exploration, avoiding entropy collapse

Computational Overhead Analysis

ModelMethodAvg Step Time (sec)Relative Cost
Qwen2.5-Math-1.5BGRPO1021.0×
Reinforce-Ada-balance2902.8×
Qwen2.5-Math-7BGRPO2361.0×
Reinforce-Ada-balance3751.59×

Prompt Difficulty Impact

  • Reinforce-Ada's advantages are more pronounced on difficult prompt sets
  • Benefits are relatively smaller on simple prompt sets, as most prompts satisfy exit conditions within the first two rounds

Sampling Dynamics Analysis

  1. Early Training: Main bottleneck is lack of positive samples; both Reinforce-Ada-pos and balance are effective
  2. Late Training: Bottleneck shifts to lack of negative samples; balance version shows clear advantages
  3. Adaptive Allocation: Difficult prompts receive more sampling budget; simple prompts exit early

Data Filtering and Selection

  • Passive Filtering Methods: Yu et al. (2025), Xiong et al. (2025) directly discard uniformly rewarded groups
  • Budget Allocation Methods: GVM-RAFT (Yao et al., 2025) adopts two-stage exploration-exploitation paradigm
  • Curriculum Learning: Shi et al. (2025), Zhang et al. (2025) focus on prompt-level selection

GRPO Variant Design

  • Advantage Estimation Improvements: Hu (2025), Zhu et al. (2025) modify core update rules
  • Signal Loss Solutions: Nan et al. (2025) adds constants to avoid zero variance; Le et al. (2025) uses entropy information

Multi-Armed Bandit Theory

  • Borrows ideas from successive elimination algorithms (Slivkins et al., 2019) for online decision-making
  • Treats prompts as arms, dynamically allocating sampling budget

Conclusions and Discussion

Main Conclusions

  1. Adaptive Sampling is Effective: Significantly improves training efficiency and final performance compared to fixed sampling strategies
  2. Balanced Sampling is Critical: Maintaining positive-negative sample balance is essential for sustaining exploration and avoiding overfitting
  3. Plug-and-Play Practicality: Can be directly integrated into existing RL training frameworks

Limitations

  1. Computational Overhead: 1.5-2.8× computational cost increase compared to GRPO
  2. Domain Limitations: Experiments primarily focus on mathematical reasoning
  3. Prompt Difficulty Dependence: Limited benefits on datasets dominated by simple prompts
  4. Hyperparameter Sensitivity: Requires careful tuning of maximum rounds N and samples per round M

Future Directions

  1. End-to-End Data Management: Combine with curriculum learning and other macro-level strategies
  2. Multi-Domain Validation: Extend to code generation, dialogue, and other tasks
  3. Theoretical Analysis: Provide theoretical guarantees for convergence and sample complexity
  4. Efficiency Optimization: Investigate more efficient exit conditions and sampling strategies

In-Depth Evaluation

Strengths

  1. Accurate Problem Identification: Clearly identifies the root cause of signal collapse in GRPO
  2. Clever Method Design: Innovatively applies multi-armed bandit ideas to LLM training
  3. Comprehensive Experiments: Full validation across multiple models and benchmarks
  4. Engineering-Friendly: Provides plug-and-play implementation for practical application
  5. In-Depth Analysis: Detailed dynamic analysis and ablation studies

Weaknesses

  1. Weak Theoretical Foundation: Lacks theoretical analysis such as convergence guarantees
  2. Cost-Benefit Trade-off: Whether increased computational overhead is worthwhile requires further analysis
  3. Limited Applicability: Primarily validated on mathematical reasoning; generalization remains uncertain
  4. Complex Parameter Tuning: Introduces additional hyperparameters requiring adjustment

Impact

  1. Academic Value: Provides new perspective on data sampling for LLM reinforcement learning
  2. Practical Value: Can be directly applied to existing training pipelines
  3. Inspirational Significance: Promotes application of adaptive data management in RL

Applicable Scenarios

  1. High-Quality Requirements: Applications with high performance demands
  2. Sufficient Computational Resources: Scenarios capable of bearing additional computational costs
  3. Reasoning Tasks: Particularly suitable for mathematical reasoning, code generation, and other multi-step reasoning tasks
  4. Online Training: Scenarios requiring dynamic training strategy adjustment

References

  1. Shao et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.
  2. Yao et al. (2025). Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl.
  3. Yu et al. (2025). Dapo: An open-source llm reinforcement learning system at scale.
  4. Slivkins et al. (2019). Introduction to multi-armed bandits.
  5. Dong et al. (2023). RAFT: Reward ranked finetuning for generative foundation model alignment.

Summary: Reinforce-Ada proposes an innovative adaptive sampling framework that effectively addresses the signal collapse problem in LLM reinforcement learning. While introducing additional computational costs, it demonstrates significant improvements in training efficiency and final performance, providing valuable new insights for LLM reinforcement learning training.