2025-11-25T20:10:18.587625

ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

Lupart, Aliannejadi, Kanoulas

We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1's performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.

academic

ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

Basic Information

Paper ID: 2510.13312
Title: ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering
Authors: Simon Lupart, Mohammad Aliannejadi, Evangelos Kanoulas (University of Amsterdam)
Classification: cs.CL, cs.IR
Publication Date: October 15, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.13312

Abstract

This paper proposes ChatR1, a conversational question-answering reasoning framework based on reinforcement learning. In conversational QA, user intent evolves across multiple turns, utterances are often incomplete and require contextual interpretation, and dynamic coordination between query reformulation and retrieval-augmented generation is necessary. Unlike static "rewrite-retrieve-generate" pipelines, ChatR1 alternates between search and reasoning across multiple dialogue turns, enabling exploratory and adaptive behavior through reinforcement learning. To address the challenges of sparse and delayed rewards in RL, the authors propose intent-aware rewards that provide turn-level feedback by aligning retrieval and reasoning with evolving user objectives. ChatR1 demonstrates strong performance on both 3B and 7B models, surpassing competing approaches across five CQA datasets.

Research Background and Motivation

Problem Definition

The core challenges faced by conversational question answering (CQA) include:

Evolving User Intent: User intent changes and evolves across multiple dialogue turns
Utterance Incompleteness: User expressions often depend on context, with issues of coreference resolution and ellipsis
Dynamic Coordination Requirements: Dynamic coordination between retrieval and generation is necessary

Limitations of Existing Methods

Static Pipeline Constraints: Existing methods predominantly employ static "rewrite-retrieve-generate" pipelines, lacking flexibility
Supervised Learning Dependency: Most methods rely on supervised fine-tuning (SFT), making it difficult to adapt to dialogue scenarios unseen during training
Single-Turn Interaction Assumption: Existing RL reasoning frameworks primarily target single-turn interactions, failing to account for multi-turn dialogue complexity

Research Motivation

Commercial systems (such as Perplexity.ai and SearchGPT) increasingly favor multi-turn conversational search, yet academic research lags in this area. Reinforcement learning can enable models to learn dynamic retrieval and reasoning strategies rather than relying on static demonstration data.

Core Contributions

Proposes ChatR1 Framework: The first RL-based CQA reasoning model that end-to-end optimizes multi-turn retrieval and generation, learning dynamic behavior rather than static pipelines
Designs Intent-Aware Rewards: A reward mechanism specifically tailored for CQA that reduces reward sparsity by aligning with evolving user intent
Comprehensive Experimental Validation: Validates performance across five CQA datasets of varying complexity, demonstrating cross-domain generalization capability
In-Depth Analytical Insights: Reveals that ChatR1 generates diverse reasoning paths, effectively utilizes search tools, and exhibits cross-domain robustness

Methodology Details

Task Definition

Given a dataset D containing multi-turn user-system dialogues, where each dialogue consists of multiple turns, and a document collection C. At each turn, the system receives dialogue history H and current user query q, with the task of generating answer y, utilizing context from H and verifying facts based on C. User intent is defined as the reformulated query q_rw, which resolves contextual references and ambiguities in q.

Model Architecture

Interaction Loop

ChatR1 is a policy model π_θ that generates trajectories τ at each turn, comprising:

Reasoning Trajectory: Thought process (...)
Intermediate Search Queries: Q = {q_k}^K_ sent to search engine R
Retrieved Documents: Relevant documents returned based on search queries
Final Answer: y

RL Objective Function

The optimization objective maximizes expected rewards while minimizing distance from the original policy:

J(θ) = E_{(q,H)~D, τ~π_θ(·|q,H;R)} [R(τ)] - β D_KL(π_θ || π_ref)

PPO Optimization

Uses Proximal Policy Optimization (PPO) algorithm, maximizing the clipped surrogate objective:

L_PPO(θ) = E_{(q,H;R;i)~μ} [min(ρ_i(θ)Â_i, clip(ρ_i(θ), 1-ε, 1+ε)Â_i)]

where ρ_i(θ) is the probability ratio between new and old policies, and Â_i is the estimated advantage function.

Reward Mechanism Design

Composite Reward Function

R(τ) = R_answer(y) + α R_intent(Q)

Answer Reward

Evaluates final answer quality based on token-level F1 score:

R_answer(y) = F1(y, y*)

Intent Reward

Measures alignment between search queries and user intent:

R_intent(Q) = max_{q_k∈Q} F1(q_k, q_rw)

Taking the maximum ensures the model receives rewards for formulating semantically correct reformulations while maintaining flexibility for exploratory queries.

Technical Innovations

End-to-End Optimization: Unlike traditional decoupled pipelines, ChatR1 jointly optimizes reasoning, retrieval, and generation
Intent-Aware Design: A reward mechanism specifically designed for CQA that directly evaluates query quality rather than relying on retrieval results
Adaptive Reasoning: Learns when and how to search through RL rather than relying on predefined static policies

Experimental Setup

Datasets

Uses five diverse CQA datasets:

Dataset	Turns	Primary Challenge
TopiOCQA	45k/2.5k	Topic switching, intent evolution
QReCC	63k/16k	Large-scale corpus, query reformulation
INSCIT	1.8k/3.3k	Mixed-initiative, open intent
MDoc2Dial	18k/3.3k	Multi-document grounding, domain reasoning
FaithDial	18k/3.5k	Faithfulness, hallucination control

Evaluation Metrics

Generation Quality: F1, BERTScore, LLM-as-judge
Retrieval Quality: nDCG, Recall, MRR, hit@N

Baseline Methods

Zero-shot Approaches: GPT-3.5, Claude, Qwen with direct reasoning and CoT
Supervised Fine-tuning: conv-ANCE+Mistral, ChatRetriever+Mistral, UniConv
RL Training: CoT R1, QR Search R1, etc.

Implementation Details

Base Model: Qwen2.5-3B/7B-Instruct
Retrieval Model: intfloat/e5-base-v2 (300M parameters)
Training Configuration: Batch size 512, PPO mini-batch 64, learning rate 1e-6
Hardware: 4 H100 GPUs

Experimental Results

Main Results

Performance comparisons across five datasets show:

ChatR1-3B Outperforms Large Closed-Source Models: Surpasses ChatGPT and Claude while using fewer parameters
Exceeds Supervised Baselines: ChatR1-3B outperforms all 3B supervised and RL baselines on most datasets in F1 and BERTScore
Clear Scaling Effects: ChatR1-7B shows average improvements of 1.4 F1 points and 0.5 BERTScore over the 3B version

Generalization Capability

Cross-domain transfer experiments (training on QReCC, testing on other datasets) demonstrate:

ChatR1-3B shows only 0.2 point loss on MultiDoc2Dial
Still surpasses ChatGPT zero-shot performance on three datasets
Exhibits strong retrieval tool usage ability rather than domain-specific overfitting

Ablation Studies

Intent Reward Effectiveness

ChatR1-3B shows average improvement of 2.2 F1 points compared to the version without intent rewards
Query-level F1 rewards outperform document-based hit@k rewards
Optimal performance achieved at retrieval/generation reward ratio of 0.2/1.0

Reward Design Analysis

Advantages of intent rewards over retrieval rewards:

Higher Density: Provides stronger learning signals for PPO
Error Decoupling: Independent of search engine, separates retrieval and query formulation errors
Complete Annotation: Avoids incompleteness issues in document relevance annotations

Case Analysis

Reasoning Path Diversity

Different datasets exhibit different reasoning length distributions:

MultiDoc2Dial and QReCC require longest reasoning trajectories
FaithDial relatively shorter
INSCIT most dispersed, reflecting mixed-initiative characteristics

Retrieval Performance

ChatR1's retrieval performance as a tool is comparable to supervised methods:

ChatR1-7B matches or exceeds supervised baselines on TopiOCQA and QReCC
Demonstrates autonomous learning of effective retrieval from interactive learning

Conversational Question Answering

Traditional CQA methods primarily rely on static RAG pipelines and supervised fine-tuning, lacking explicit reasoning mechanisms to determine when and how to search.

RL-Based Reasoning QA

Recent work such as Search-R1 and ReSearch apply RL to single-turn reasoning but have not extended to multi-turn dialogue scenarios.

Tool Usage

Methods like CALM extend reasoning to multi-turn dialogue but still rely on supervised fine-tuning rather than RL training.

Conclusions and Discussion

Main Conclusions

RL Reasoning Effectiveness: ChatR1 demonstrates that RL can improve reasoning capabilities in CQA
Intent Reward Importance: Specially designed intent-aware rewards significantly enhance performance
Cross-Domain Generalization: RL reasoning exhibits stronger flexibility and context sensitivity compared to static CQA pipelines

Limitations

Single Optimization Strategy: Only uses PPO, without exploring other optimization strategies
Dialogue Length Constraints: Experiments focus on medium-length dialogues (10-12 turns)
Computational Cost: RL training increases computational overhead for both training and inference
Lack of Personalization: Does not consider user-specific adaptation and personalization

Future Directions

Dialogue-Level Optimization: Use simulated users and preference-based feedback
Longer Dialogue Handling: Enhance memory and context modeling capabilities
Efficiency Optimization: Develop more efficient optimization schedules
Bias Mitigation: Explore bias mitigation and stronger factual grounding in RL optimization

In-Depth Evaluation

Strengths

Strong Innovation: First systematic application of RL to multi-turn CQA, filling an important research gap
Reasonable Design: Intent-aware rewards are carefully designed for CQA characteristics, addressing reward sparsity
Comprehensive Experiments: Five datasets covering different dialogue complexities provide thorough evaluation
In-Depth Analysis: Provides multi-perspective analytical insights including reasoning paths and retrieval quality

Weaknesses

Theoretical Foundation: Lacks theoretical analysis of convergence and stability of RL in CQA
Computational Efficiency: Insufficient discussion of computational cost trade-offs compared to supervised methods
User Studies: Lacks real user interaction evaluation, relying only on offline metrics
Error Analysis: Insufficient analysis of failure cases

Impact

Academic Value: Introduces a new RL paradigm to the CQA field, inspiring subsequent research
Practical Value: Methods applicable to real conversational systems, enhancing user experience
Reproducibility: Provides detailed implementation details and open-source code for easy reproduction

Applicable Scenarios

Information Retrieval Systems: Search engines and QA systems requiring multi-turn interaction
Customer Service Chatbots: Intelligent customer service scenarios handling complex queries
Educational Tutoring: Online learning platforms requiring progressive guidance

References

The paper cites important works in reinforcement learning, dialogue systems, and information retrieval, particularly:

PPO Algorithm (Schulman et al., 2017)
RL Reasoning Work such as Search-R1 (Jin et al., 2025)
Conversational QA Dataset Construction Work (Adlakha et al., 2022; Anantha et al., 2021)

Overall Assessment: This is a high-quality research paper demonstrating excellence in technical innovation, experimental design, and analytical depth. Introducing reinforcement learning to multi-turn conversational question answering represents a meaningful research direction, and the design of intent-aware rewards cleverly addresses key challenges in CQA. Despite some limitations, the paper makes important contributions to the field and merits further research and application.