2025-11-25T20:10:18.587625

ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

Lupart, Aliannejadi, Kanoulas
We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1's performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.
academic

ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

Basic Information

  • Paper ID: 2510.13312
  • Title: ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering
  • Authors: Simon Lupart, Mohammad Aliannejadi, Evangelos Kanoulas (University of Amsterdam)
  • Classification: cs.CL, cs.IR
  • Publication Date: October 15, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.13312

Abstract

This paper proposes ChatR1, a conversational question-answering reasoning framework based on reinforcement learning. In conversational QA, user intent evolves across multiple turns, utterances are often incomplete and require contextual interpretation, and dynamic coordination between query reformulation and retrieval-augmented generation is necessary. Unlike static "rewrite-retrieve-generate" pipelines, ChatR1 alternates between search and reasoning across multiple dialogue turns, enabling exploratory and adaptive behavior through reinforcement learning. To address the challenges of sparse and delayed rewards in RL, the authors propose intent-aware rewards that provide turn-level feedback by aligning retrieval and reasoning with evolving user objectives. ChatR1 demonstrates strong performance on both 3B and 7B models, surpassing competing approaches across five CQA datasets.

Research Background and Motivation

Problem Definition

The core challenges faced by conversational question answering (CQA) include:

  1. Evolving User Intent: User intent changes and evolves across multiple dialogue turns
  2. Utterance Incompleteness: User expressions often depend on context, with issues of coreference resolution and ellipsis
  3. Dynamic Coordination Requirements: Dynamic coordination between retrieval and generation is necessary

Limitations of Existing Methods

  1. Static Pipeline Constraints: Existing methods predominantly employ static "rewrite-retrieve-generate" pipelines, lacking flexibility
  2. Supervised Learning Dependency: Most methods rely on supervised fine-tuning (SFT), making it difficult to adapt to dialogue scenarios unseen during training
  3. Single-Turn Interaction Assumption: Existing RL reasoning frameworks primarily target single-turn interactions, failing to account for multi-turn dialogue complexity

Research Motivation

Commercial systems (such as Perplexity.ai and SearchGPT) increasingly favor multi-turn conversational search, yet academic research lags in this area. Reinforcement learning can enable models to learn dynamic retrieval and reasoning strategies rather than relying on static demonstration data.

Core Contributions

  1. Proposes ChatR1 Framework: The first RL-based CQA reasoning model that end-to-end optimizes multi-turn retrieval and generation, learning dynamic behavior rather than static pipelines
  2. Designs Intent-Aware Rewards: A reward mechanism specifically tailored for CQA that reduces reward sparsity by aligning with evolving user intent
  3. Comprehensive Experimental Validation: Validates performance across five CQA datasets of varying complexity, demonstrating cross-domain generalization capability
  4. In-Depth Analytical Insights: Reveals that ChatR1 generates diverse reasoning paths, effectively utilizes search tools, and exhibits cross-domain robustness

Methodology Details

Task Definition

Given a dataset D containing multi-turn user-system dialogues, where each dialogue consists of multiple turns, and a document collection C. At each turn, the system receives dialogue history H and current user query q, with the task of generating answer y, utilizing context from H and verifying facts based on C. User intent is defined as the reformulated query q_rw, which resolves contextual references and ambiguities in q.

Model Architecture

Interaction Loop

ChatR1 is a policy model π_θ that generates trajectories τ at each turn, comprising:

  • Reasoning Trajectory: Thought process (...)
  • Intermediate Search Queries: Q = {q_k}^K_ sent to search engine R
  • Retrieved Documents: Relevant documents returned based on search queries
  • Final Answer: y

RL Objective Function

The optimization objective maximizes expected rewards while minimizing distance from the original policy:

J(θ) = E_{(q,H)~D, τ~π_θ(·|q,H;R)} [R(τ)] - β D_KL(π_θ || π_ref)

PPO Optimization

Uses Proximal Policy Optimization (PPO) algorithm, maximizing the clipped surrogate objective:

L_PPO(θ) = E_{(q,H;R;i)~μ} [min(ρ_i(θ)Â_i, clip(ρ_i(θ), 1-ε, 1+ε)Â_i)]

where ρ_i(θ) is the probability ratio between new and old policies, and Â_i is the estimated advantage function.

Reward Mechanism Design

Composite Reward Function

R(τ) = R_answer(y) + α R_intent(Q)

Answer Reward

Evaluates final answer quality based on token-level F1 score:

R_answer(y) = F1(y, y*)

Intent Reward

Measures alignment between search queries and user intent:

R_intent(Q) = max_{q_k∈Q} F1(q_k, q_rw)

Taking the maximum ensures the model receives rewards for formulating semantically correct reformulations while maintaining flexibility for exploratory queries.

Technical Innovations

  1. End-to-End Optimization: Unlike traditional decoupled pipelines, ChatR1 jointly optimizes reasoning, retrieval, and generation
  2. Intent-Aware Design: A reward mechanism specifically designed for CQA that directly evaluates query quality rather than relying on retrieval results
  3. Adaptive Reasoning: Learns when and how to search through RL rather than relying on predefined static policies

Experimental Setup

Datasets

Uses five diverse CQA datasets:

DatasetTurnsPrimary Challenge
TopiOCQA45k/2.5kTopic switching, intent evolution
QReCC63k/16kLarge-scale corpus, query reformulation
INSCIT1.8k/3.3kMixed-initiative, open intent
MDoc2Dial18k/3.3kMulti-document grounding, domain reasoning
FaithDial18k/3.5kFaithfulness, hallucination control

Evaluation Metrics

  • Generation Quality: F1, BERTScore, LLM-as-judge
  • Retrieval Quality: nDCG, Recall, MRR, hit@N

Baseline Methods

  1. Zero-shot Approaches: GPT-3.5, Claude, Qwen with direct reasoning and CoT
  2. Supervised Fine-tuning: conv-ANCE+Mistral, ChatRetriever+Mistral, UniConv
  3. RL Training: CoT R1, QR Search R1, etc.

Implementation Details

  • Base Model: Qwen2.5-3B/7B-Instruct
  • Retrieval Model: intfloat/e5-base-v2 (300M parameters)
  • Training Configuration: Batch size 512, PPO mini-batch 64, learning rate 1e-6
  • Hardware: 4 H100 GPUs

Experimental Results

Main Results

Performance comparisons across five datasets show:

  1. ChatR1-3B Outperforms Large Closed-Source Models: Surpasses ChatGPT and Claude while using fewer parameters
  2. Exceeds Supervised Baselines: ChatR1-3B outperforms all 3B supervised and RL baselines on most datasets in F1 and BERTScore
  3. Clear Scaling Effects: ChatR1-7B shows average improvements of 1.4 F1 points and 0.5 BERTScore over the 3B version

Generalization Capability

Cross-domain transfer experiments (training on QReCC, testing on other datasets) demonstrate:

  • ChatR1-3B shows only 0.2 point loss on MultiDoc2Dial
  • Still surpasses ChatGPT zero-shot performance on three datasets
  • Exhibits strong retrieval tool usage ability rather than domain-specific overfitting

Ablation Studies

Intent Reward Effectiveness

  • ChatR1-3B shows average improvement of 2.2 F1 points compared to the version without intent rewards
  • Query-level F1 rewards outperform document-based hit@k rewards
  • Optimal performance achieved at retrieval/generation reward ratio of 0.2/1.0

Reward Design Analysis

Advantages of intent rewards over retrieval rewards:

  1. Higher Density: Provides stronger learning signals for PPO
  2. Error Decoupling: Independent of search engine, separates retrieval and query formulation errors
  3. Complete Annotation: Avoids incompleteness issues in document relevance annotations

Case Analysis

Reasoning Path Diversity

Different datasets exhibit different reasoning length distributions:

  • MultiDoc2Dial and QReCC require longest reasoning trajectories
  • FaithDial relatively shorter
  • INSCIT most dispersed, reflecting mixed-initiative characteristics

Retrieval Performance

ChatR1's retrieval performance as a tool is comparable to supervised methods:

  • ChatR1-7B matches or exceeds supervised baselines on TopiOCQA and QReCC
  • Demonstrates autonomous learning of effective retrieval from interactive learning

Conversational Question Answering

Traditional CQA methods primarily rely on static RAG pipelines and supervised fine-tuning, lacking explicit reasoning mechanisms to determine when and how to search.

RL-Based Reasoning QA

Recent work such as Search-R1 and ReSearch apply RL to single-turn reasoning but have not extended to multi-turn dialogue scenarios.

Tool Usage

Methods like CALM extend reasoning to multi-turn dialogue but still rely on supervised fine-tuning rather than RL training.

Conclusions and Discussion

Main Conclusions

  1. RL Reasoning Effectiveness: ChatR1 demonstrates that RL can improve reasoning capabilities in CQA
  2. Intent Reward Importance: Specially designed intent-aware rewards significantly enhance performance
  3. Cross-Domain Generalization: RL reasoning exhibits stronger flexibility and context sensitivity compared to static CQA pipelines

Limitations

  1. Single Optimization Strategy: Only uses PPO, without exploring other optimization strategies
  2. Dialogue Length Constraints: Experiments focus on medium-length dialogues (10-12 turns)
  3. Computational Cost: RL training increases computational overhead for both training and inference
  4. Lack of Personalization: Does not consider user-specific adaptation and personalization

Future Directions

  1. Dialogue-Level Optimization: Use simulated users and preference-based feedback
  2. Longer Dialogue Handling: Enhance memory and context modeling capabilities
  3. Efficiency Optimization: Develop more efficient optimization schedules
  4. Bias Mitigation: Explore bias mitigation and stronger factual grounding in RL optimization

In-Depth Evaluation

Strengths

  1. Strong Innovation: First systematic application of RL to multi-turn CQA, filling an important research gap
  2. Reasonable Design: Intent-aware rewards are carefully designed for CQA characteristics, addressing reward sparsity
  3. Comprehensive Experiments: Five datasets covering different dialogue complexities provide thorough evaluation
  4. In-Depth Analysis: Provides multi-perspective analytical insights including reasoning paths and retrieval quality

Weaknesses

  1. Theoretical Foundation: Lacks theoretical analysis of convergence and stability of RL in CQA
  2. Computational Efficiency: Insufficient discussion of computational cost trade-offs compared to supervised methods
  3. User Studies: Lacks real user interaction evaluation, relying only on offline metrics
  4. Error Analysis: Insufficient analysis of failure cases

Impact

  1. Academic Value: Introduces a new RL paradigm to the CQA field, inspiring subsequent research
  2. Practical Value: Methods applicable to real conversational systems, enhancing user experience
  3. Reproducibility: Provides detailed implementation details and open-source code for easy reproduction

Applicable Scenarios

  1. Information Retrieval Systems: Search engines and QA systems requiring multi-turn interaction
  2. Customer Service Chatbots: Intelligent customer service scenarios handling complex queries
  3. Educational Tutoring: Online learning platforms requiring progressive guidance

References

The paper cites important works in reinforcement learning, dialogue systems, and information retrieval, particularly:

  • PPO Algorithm (Schulman et al., 2017)
  • RL Reasoning Work such as Search-R1 (Jin et al., 2025)
  • Conversational QA Dataset Construction Work (Adlakha et al., 2022; Anantha et al., 2021)

Overall Assessment: This is a high-quality research paper demonstrating excellence in technical innovation, experimental design, and analytical depth. Introducing reinforcement learning to multi-turn conversational question answering represents a meaningful research direction, and the design of intent-aware rewards cleverly addresses key challenges in CQA. Despite some limitations, the paper makes important contributions to the field and merits further research and application.