2025-11-19T10:19:14.428770

Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

Lu, Liu, Qu et al.
Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for intermediate steps limits exploration of alternative, non-human-like reasoning paths and thus constrains achievable performance. Furthermore, through a small-scale pilot study, we observed that in approximately 75% of cases, the model's first erroneous step occurs after the lowest-confidence point. This suggests that guiding the model at its lowest-confidence point before an error provides more accurate supervision than locating the first explicit error. In this paper, we propose Confidence-Guided Reasoning Path Preference Optimization (CGPO), a method that leverages a confidence signal to identify points of maximal uncertainty in the model's reasoning process and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift. Our experiments span diverse models applied to both code and mathematical reasoning tasks. The results show that, with the same amount of training data, our method using data generated by a small model can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.
academic

Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

Basic Information

  • Paper ID: 2510.11104
  • Title: Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization
  • Authors: Junjie Lu, Yuliang Liu, Chaofeng Qu, Wei Shen, Zhouhan Lin, Min Xu
  • Classification: cs.CL cs.AI
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.11104

Abstract

Current methods for enhancing large language model reasoning capabilities often introduce training biases toward human reasoning trajectories. Particularly in step-wise preference optimization, dependence on annotations of intermediate steps from humans or high-capability models limits exploration of alternative non-human reasoning paths, thereby constraining achievable performance. Through small-scale pilot studies, the authors observe that in approximately 75% of cases, the model's first erroneous step occurs after the lowest confidence point. This suggests that guiding the model at the lowest confidence point before error occurrence provides more accurate supervision than locating the first explicit error. This paper proposes Confidence-Guided Preference Optimization (CGPO), which leverages confidence signals to identify points of maximum uncertainty in the model's reasoning process and applies self-generated non-human reasoning path guidance to mitigate trajectory drift.

Research Background and Motivation

Problem Definition

The core challenges faced by current methods for enhancing large language model reasoning capabilities are:

  1. Human Bias Limitations: Existing methods over-rely on reasoning trajectories from humans or strong models, limiting exploration of non-human reasoning paths
  2. Inaccurate Error Localization: Traditional methods supervise by locating the first explicit error, but this is often not the optimal intervention point
  3. High Annotation Costs: Step-wise preference optimization requires extensive human or strong model annotations, making practical application costly

Research Motivation

Through analysis, the authors discovered that in approximately 75% of error cases, the model's first erroneous step occurs after its lowest confidence point. This observation inspired the idea of optimizing reasoning paths based on model confidence rather than human cognition.

Limitations of Existing Methods

  1. Step-DPO and Similar Methods: Rely on human or strong model annotations to locate error steps, with high costs and limited exploration space
  2. Traditional RLHF: Primarily focuses on outcome optimization with insufficient attention to intermediate steps in reasoning trajectories
  3. Human Alignment Bias: Forcing models to follow human reasoning patterns may limit their potential capabilities

Core Contributions

  1. Proposes CGPO Method: A confidence-guided reasoning path preference optimization method that does not require reliance on stronger models or human supervision
  2. Non-Human Reasoning Path Exploration: Constructs preference learning data through the model's own confidence signals, exploring non-human reasoning paths
  3. Multi-Domain Validation: Validates method effectiveness on mathematical reasoning and code generation tasks, demonstrating generalizability
  4. Open-Source Contribution: Commits to releasing complete code repositories, datasets, and trained models to promote reproducibility

Method Details

Task Definition

Given input problem x, the initial policy model π₀ generates reasoning sequence y = (y₁, y₂, ..., yₜ), where yₜ ∈ V (vocabulary). At decoding timestep t, model confidence is defined as:

cₜ ≜ p(yₜ|π₀, x, y<t)

Model Architecture

1. Reasoning Step Definition

  • Uses confidence threshold τ to segment reasoning steps, with τ determined based on the distribution of all confidence values in the dataset
  • Tokens with confidence below τ serve as segmentation points, reconstructing sequence y into step sequence s = (s₁, s₂, ..., sⱼ)

2. Preference Pair Construction Process

Initial Trajectory Determination:

  • Selects the sequence before the most uncertain step as the shared initial reasoning trajectory sᵢₙᵢₜ

Chosen/Rejected Pair Construction:

  • Introduces reward model R to evaluate Top-k candidate tokens given (x, sᵢₙᵢₜ)
  • Selects the highest and lowest scoring tokens as starting tokens for chosen and rejected branches respectively
  • π₀ continues sampling until encountering or a token with confidence below τ

3. Training Objective

Employs a DPO-style objective function:

L_CGPO(θ) = -E_{(s_init,s+,s-)~D}[log σ(β(Δ))]

where:

Δ = Δ_θ - Δ_ref
Δ_θ ≜ log π_θ(s+ | s_init) - log π_θ(s- | s_init)
Δ_ref ≜ log π_ref(s+ | s_init) - log π_ref(s- | s_init)

Technical Innovations

  1. Confidence-Driven Step Segmentation: Breaks free from predefined anchor points, segmenting reasoning steps based on model's inherent uncertainty
  2. Self-Supervised Preference Construction: Utilizes reward models to select optimal/suboptimal tokens at the most uncertain point without human annotation
  3. Non-Human Reasoning Exploration: Allows models to explore reasoning paths that may not align with human cognitive habits but could be more effective

Experimental Setup

Datasets

Mathematical Reasoning Tasks:

  • Training data: 10,795 prompts from Step-DPO-10k dataset
  • Evaluation datasets: GSM8K, MATH, Omni-Math
  • Models: MetaMath-Mistral-7B, MetaMath-LLaMA-8B, Qwen2-7B-SFT, etc.

Code Generation Tasks:

  • Training data: 2,641 samples from LeetCodeDataset training set
  • Evaluation datasets: LiveCodeBench, LeetCodeDataset
  • Models: Deepseek-Coder-7B-Instruct-v1.5

Evaluation Metrics

  • Mathematical Reasoning: Exact match accuracy (final answer exactly matches standard answer)
  • Code Generation: Pass rate (generated code passes all test cases in sandbox environment)

Baseline Methods

  • Base Model: Original baseline model
  • Step-DPO: Step-wise preference optimization method based on human annotations

Implementation Details

  • Confidence threshold: 2nd percentile of dataset confidence distribution
  • Top-k candidates: k=8
  • Training configuration: β=0.3-0.4, learning rate 5e-7, batch size 128, 4-8 training epochs

Experimental Results

Main Results

Mathematical Reasoning Task Performance:

  • GSM8K: CGPO outperforms Step-DPO on all models, with most significant improvement on MetaMath-Llama-8B (+4.3% vs base)
  • MATH: Outperforms Step-DPO on MetaMath-Llama-8B and Qwen2-7B-SFT
  • Key Finding: Even when Step-DPO shows performance decline (e.g., MetaMath-Mistral-7B), CGPO still provides improvements

Code Generation Task Performance:

  • LiveCodeBench: 2.1% improvement (19.3% → 19.7%)
  • LeetCodeDataset: 4.0% improvement (12.7% → 13.2%)

Ablation Studies

1. Scalability Analysis

Validates method scalability by increasing training data size (10k → 80k):

  • MetaMath-Llama-8B improves from 85.3% to 86.4% on GSM8K
  • Qwen2-7B-SFT improves from 88.6% to 89.5% on GSM8K
  • Demonstrates good data scalability of CGPO

2. Reward Model Impact

Compares two reward models: ASPRM and Math-Shepherd:

  • ASPRM performs better, but improvements are observed even with weaker Math-Shepherd
  • Validates importance of fine-grained token-level evaluation

3. Confidence Threshold Analysis

  • Higher thresholds generally yield performance improvements but excessively high thresholds result in overly short sequences
  • Different models have different optimal thresholds requiring task-specific tuning

Generalization Ability Verification

Performance on Omni-Math (Olympic-level mathematical competition problems):

  • CGPO outperforms Step-DPO on 4/5 models
  • Demonstrates good out-of-distribution generalization capability

Case Analysis

Analysis of 200 error samples validates core hypothesis:

  • MetaMath-Llama-8B: 78% of errors occur after the lowest confidence point
  • Qwen2-7B-SFT: 72% of errors occur after the lowest confidence point
  • Supports the design philosophy of early intervention based on confidence

Preference Optimization Methods

  • PPO: High complexity but stable performance
  • DPO/SimPO: Direct optimization of paired preference signals with lower computational overhead
  • This Work's Contribution: Extends preference optimization to intermediate steps in reasoning paths

Confidence-Aware Methods

  • Direct Probability Method: Uses probability of predicted tokens (adopted in this work)
  • Generation Consistency Method: Measures confidence through answer consistency
  • This Work's Innovation: Applies confidence to step segmentation and optimization of reasoning paths

Reasoning Trajectory Optimization

  • Supervised Fine-Tuning: Direct alignment to annotated sequences
  • RLHF: Optimizes trajectories toward higher scores
  • This Work's Advantage: No strong model annotation required, explores non-human reasoning paths

Conclusions and Discussion

Main Conclusions

  1. Value of Non-Human Reasoning Paths: Models can achieve better performance by exploring non-human reasoning paths
  2. Effectiveness of Confidence Signals: Model confidence is an effective indicator for identifying reasoning difficulty points
  3. Potential of Self-Supervised Learning: Effective reasoning capability enhancement is achievable without strong model or human annotations

Limitations

  1. Computational Resource Constraints: Unable to verify scalability on larger models (e.g., 70B)
  2. Domain Limitations: Primarily validated on mathematical and code domains; applicability to commonsense reasoning remains to be verified
  3. Reward Model Dependency: Still requires domain-specific fine-grained reward models

Future Directions

  1. Larger-Scale Validation: Verify method effectiveness on larger models and more domains
  2. Universal Reward Models: Develop cross-domain universal fine-grained evaluation models
  3. Theoretical Analysis: Deepen understanding of theoretical foundations for non-human reasoning paths

In-Depth Evaluation

Strengths

  1. Deep Problem Insight: Identifies human bias problems in existing methods and proposes novel solutions
  2. Clever Method Design: Combines confidence signals with preference optimization, achieving unsupervised reasoning path optimization
  3. Comprehensive Experimental Validation: Multi-model, multi-task, multi-perspective experiments with convincing results
  4. High Practical Value: Reduces dependence on strong model annotations while improving performance

Weaknesses

  1. Insufficient Theoretical Foundation: Lacks deep theoretical explanation for why non-human reasoning paths are more effective
  2. Limited Applicability Scope: Primarily validated on structured reasoning tasks; applicability to open-ended tasks remains unknown
  3. Confidence Reliability: Model confidence itself may be unreliable, particularly on out-of-distribution data
  4. Computational Overhead Analysis: Lacks detailed analysis of computational cost changes compared to baseline methods

Impact

  1. Academic Value: Provides new research direction for reasoning capability optimization, potentially inspiring related work
  2. Practical Value: Reduces annotation costs while improving performance, with important engineering applications
  3. Reproducibility: Commits to open-sourcing complete code and data, facilitating method dissemination and improvement

Applicable Scenarios

  1. Resource-Constrained Environments: Reasoning capability enhancement when strong model annotations are unavailable
  2. Structured Reasoning Tasks: Mathematical, code, logical reasoning and other tasks with clear evaluation standards
  3. Model Self-Improvement: As a technical component for continuous learning and self-optimization

References

The paper cites important works in reasoning optimization, preference learning, and confidence estimation, providing solid theoretical foundations for method design. Particularly noteworthy are comparative analyses with directly related preference optimization methods such as Step-DPO and DPO.


Overall Assessment: This is an important contribution to the field of large language model reasoning capability optimization. By introducing the concept of non-human reasoning paths and confidence-based optimization strategies, it provides new research directions for the field. While there is room for improvement in theoretical explanation and applicability scope, its practical value and novelty make it an important advance in the field.