From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization
Wang, Su, Tian et al.
Improving the multi-step reasoning ability of Large Language Models (LLMs) is a critical yet challenging task. The dominant paradigm, outcome-supervised reinforcement learning (RLVR), rewards only correct final answers, often propagating flawed reasoning and suffering from sparse reward signals. While process-level reward models (PRMs) provide denser, step-by-step feedback, they lack generalizability and interpretability, requiring task-specific segmentation of the reasoning process. To this end, we propose the Dimension-level Reward Model (DRM), a new supervision framework that bridges the gap between these two approaches. DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions: Confidence for uncertainty calibration, Relevance for semantic alignment, and Coherence for logical consistency. Together, these dimensions capture aspects beyond final answer correctness and enable interpretable assessment without requiring ground truth answers. Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability. In particular, DRM-supervised training achieves consistent gains on both in-distribution and out-of-distribution open-domain tasks, including mathematics, question answering, code execution, and puzzles. Our findings demonstrate that multidimensional supervision of the reasoning process can improve the generalized reasoning ability of LLMs beyond the training distribution.
academic
From to : Multidimensional Supervision of Reasoning Process for LLM Optimization
Enhancing the multi-step reasoning capabilities of large language models (LLMs) is a critical yet challenging task. The mainstream paradigm—reinforcement learning with verifiable rewards (RLVR)—only rewards correct final answers, frequently propagating flawed reasoning and suffering from sparse reward signals. While process-level reward models (PRMs) provide denser step-by-step feedback, they lack generalizability and interpretability, requiring task-specific reasoning process segmentation. To address this, the authors propose the Dimensional-level Reward Model (DRM), a novel supervision framework bridging these two approaches. DRM assesses the quality of reasoning processes along three fundamental, complementary, and interpretable dimensions: confidence (uncertainty calibration), relevance (semantic alignment), and coherence (logical consistency). These dimensions collectively capture aspects beyond final answer correctness, enabling interpretable evaluation without requiring ground-truth answers. Experimental results demonstrate that DRM provides effective supervision signals to guide LLM optimization and enhance reasoning capabilities.
The core challenge facing current LLMs in multi-step reasoning tasks is: how to effectively supervise and optimize the quality of reasoning processes, rather than focusing solely on final answer correctness.
The authors observe that high-quality reasoning processes should possess three key characteristics: maintaining certainty about outputs, grounding in given inputs, and maintaining internal consistency. Based on this insight, they propose a multidimensional supervision framework.
Proposes DRM Framework: First to decompose reasoning supervision into three complementary dimensions (confidence, relevance, coherence), providing dense and interpretable supervision signals
Addresses Existing Limitations: Avoids RLVR's sparse reward problem and PRMs' task-specific segmentation requirements
Achieves Significant Performance Improvements: Consistent gains across multiple open-domain tasks, such as MATH500 (+8.8), 2WIKI RAG (+8.7), CRUXEVAL (+7.1)
Provides Theoretical and Practical Insights: Demonstrates that multidimensional reasoning supervision improves LLMs' generalization beyond training distribution
Formal definition: Given input I, model output O is decomposed into reasoning process R and answer A. In open-domain scenarios, I contains question Q and additional information D. The entire input-output structure is represented as a quadruple: (Q, D, R, A).
Objective: Assess model certainty about its output
Implementation:
scoreConf_R = (1/|R|) * Σ log p (average log probability of all tokens in R)
scoreConf_A = Σ log p (sum of log probabilities of all tokens in A)
scoreConf = scoreConf_R + scoreConf_A
Objective: Assess logical consistency and text quality of reasoning process
Implementation: Use external outcome-level reward model (ORM) to evaluate logical consistency, fluency, and overall text quality
RLVR effectively improves LLM reasoning capabilities by using automatically verifiable correctness signals as rewards, but suffers from sparse rewards and neglects reasoning process quality.
The paper cites important works in reasoning evaluation, reinforcement learning, and reward modeling, providing solid theoretical foundation and comparison baselines for this research.
Overall Assessment: This is a high-quality research paper proposing an innovative multidimensional reasoning supervision framework that effectively addresses limitations of existing methods. The experimental design is comprehensive, results are convincing, and the work has significant theoretical and practical value for enhancing LLM reasoning capabilities.