Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
Xiong, Ye, Liao et al.
Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.
academic
Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
Reinforcement learning applied to large language models (LLMs) for reasoning tasks often suffers from unstable gradient estimation due to fixed and uniform response sampling strategies. This paper proposes Reinforce-Ada, an adaptive sampling framework for online reinforcement learning post-training of LLMs, which continuously reallocates sampling effort to prompts with maximum uncertainty or learning potential. Unlike traditional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, automatically stopping sampling for prompts after collecting sufficient signals. To stabilize updates, the method forms fixed-size batches and enforces reward diversity, computing advantage baselines using global statistics aggregated from the adaptive sampling phase.
Unstable Gradient Estimation: Traditional reinforcement learning methods sample with a fixed small number (n) of responses when training LLMs, resulting in excessive gradient estimation variance and training instability.
Signal Collapse Problem: When all n responses for a prompt receive the same reward (all correct or all incorrect), advantage calculation in GRPO produces zero gradients, causing training signal loss.
Low Sampling Efficiency: Uniform sampling strategies cannot dynamically allocate computational resources based on prompt difficulty and learning value.
Proposes Reinforce-Ada Adaptive Sampling Framework: Unifies estimation and sampling into an online successive elimination process, dynamically allocating inference budget
Designs Two Exit Conditions:
Reinforce-Ada-pos: Focuses on positive sample collection
Reinforce-Ada-balance: Balances positive and negative samples, maintaining exploration
Introduces Global Statistics Normalization: Uses statistical information from the entire sampling process to compute advantages, improving estimation stability
Enables Plug-and-Play Replacement: Can directly replace the generation step in existing RL pipelines without architectural modifications
Validates Effectiveness Across Multiple Models and Benchmarks: Continuously improves convergence speed and final performance on mathematical reasoning tasks
Given a prompt distribution d₀, policy πθ generates responses a ~ πθ(·|x), and a verifier provides rewards r⋆(x,a) ∈ {0,1}. The objective is to maximize expected reward:
Algorithm Flow:
1. Initialization: Mark all prompts as active
2. Multi-round Sampling:
- Sample M responses for each active prompt
- Evaluate exit conditions
- Mark prompts satisfying conditions as inactive
3. Repeat until all prompts exit or maximum rounds N reached
Online Unified Process: Merges the estimation and decision-making of traditional two-stage methods into a single online process
Successive Elimination Mechanism: Borrows ideas from multi-armed bandit theory, dynamically stopping prompts that require no further sampling
Global Normalization Strategy: Uses statistical information from the complete sampling pool rather than final selected subsets, improving estimation robustness
Balanced Sampling Guarantee: Ensures each training batch maintains non-zero variance, avoiding gradient vanishing
Shao et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.
Yao et al. (2025). Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl.
Yu et al. (2025). Dapo: An open-source llm reinforcement learning system at scale.
Slivkins et al. (2019). Introduction to multi-armed bandits.
Dong et al. (2023). RAFT: Reward ranked finetuning for generative foundation model alignment.
Summary: Reinforce-Ada proposes an innovative adaptive sampling framework that effectively addresses the signal collapse problem in LLM reinforcement learning. While introducing additional computational costs, it demonstrates significant improvements in training efficiency and final performance, providing valuable new insights for LLM reinforcement learning training.