2025-11-19T10:19:14.428770

Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

Lu, Liu, Qu et al.

Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for intermediate steps limits exploration of alternative, non-human-like reasoning paths and thus constrains achievable performance. Furthermore, through a small-scale pilot study, we observed that in approximately 75% of cases, the model's first erroneous step occurs after the lowest-confidence point. This suggests that guiding the model at its lowest-confidence point before an error provides more accurate supervision than locating the first explicit error. In this paper, we propose Confidence-Guided Reasoning Path Preference Optimization (CGPO), a method that leverages a confidence signal to identify points of maximal uncertainty in the model's reasoning process and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift. Our experiments span diverse models applied to both code and mathematical reasoning tasks. The results show that, with the same amount of training data, our method using data generated by a small model can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.

academic

Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

Basic Information

Paper ID: 2510.11104
Title: Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization
Authors: Junjie Lu, Yuliang Liu, Chaofeng Qu, Wei Shen, Zhouhan Lin, Min Xu
Classification: cs.CL cs.AI
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.11104

Abstract

Current methods for enhancing large language model reasoning capabilities often introduce training biases toward human reasoning trajectories. Particularly in step-wise preference optimization, dependence on annotations of intermediate steps from humans or high-capability models limits exploration of alternative non-human reasoning paths, thereby constraining achievable performance. Through small-scale pilot studies, the authors observe that in approximately 75% of cases, the model's first erroneous step occurs after the lowest confidence point. This suggests that guiding the model at the lowest confidence point before error occurrence provides more accurate supervision than locating the first explicit error. This paper proposes Confidence-Guided Preference Optimization (CGPO), which leverages confidence signals to identify points of maximum uncertainty in the model's reasoning process and applies self-generated non-human reasoning path guidance to mitigate trajectory drift.

Research Background and Motivation

Problem Definition

The core challenges faced by current methods for enhancing large language model reasoning capabilities are:

Human Bias Limitations: Existing methods over-rely on reasoning trajectories from humans or strong models, limiting exploration of non-human reasoning paths
Inaccurate Error Localization: Traditional methods supervise by locating the first explicit error, but this is often not the optimal intervention point
High Annotation Costs: Step-wise preference optimization requires extensive human or strong model annotations, making practical application costly

Research Motivation

Through analysis, the authors discovered that in approximately 75% of error cases, the model's first erroneous step occurs after its lowest confidence point. This observation inspired the idea of optimizing reasoning paths based on model confidence rather than human cognition.

Limitations of Existing Methods

Step-DPO and Similar Methods: Rely on human or strong model annotations to locate error steps, with high costs and limited exploration space
Traditional RLHF: Primarily focuses on outcome optimization with insufficient attention to intermediate steps in reasoning trajectories
Human Alignment Bias: Forcing models to follow human reasoning patterns may limit their potential capabilities

Core Contributions

Proposes CGPO Method: A confidence-guided reasoning path preference optimization method that does not require reliance on stronger models or human supervision
Non-Human Reasoning Path Exploration: Constructs preference learning data through the model's own confidence signals, exploring non-human reasoning paths
Multi-Domain Validation: Validates method effectiveness on mathematical reasoning and code generation tasks, demonstrating generalizability
Open-Source Contribution: Commits to releasing complete code repositories, datasets, and trained models to promote reproducibility

Method Details

Task Definition

Given input problem x, the initial policy model π₀ generates reasoning sequence y = (y₁, y₂, ..., yₜ), where yₜ ∈ V (vocabulary). At decoding timestep t, model confidence is defined as:

cₜ ≜ p(yₜ|π₀, x, y<t)

Model Architecture

1. Reasoning Step Definition

Uses confidence threshold τ to segment reasoning steps, with τ determined based on the distribution of all confidence values in the dataset
Tokens with confidence below τ serve as segmentation points, reconstructing sequence y into step sequence s = (s₁, s₂, ..., sⱼ)

2. Preference Pair Construction Process

Initial Trajectory Determination:

Selects the sequence before the most uncertain step as the shared initial reasoning trajectory sᵢₙᵢₜ

Chosen/Rejected Pair Construction:

Introduces reward model R to evaluate Top-k candidate tokens given (x, sᵢₙᵢₜ)
Selects the highest and lowest scoring tokens as starting tokens for chosen and rejected branches respectively
π₀ continues sampling until encountering or a token with confidence below τ

3. Training Objective

Employs a DPO-style objective function:

L_CGPO(θ) = -E_{(s_init,s+,s-)~D}[log σ(β(Δ))]

where:

Δ = Δ_θ - Δ_ref
Δ_θ ≜ log π_θ(s+ | s_init) - log π_θ(s- | s_init)
Δ_ref ≜ log π_ref(s+ | s_init) - log π_ref(s- | s_init)

Technical Innovations

Confidence-Driven Step Segmentation: Breaks free from predefined anchor points, segmenting reasoning steps based on model's inherent uncertainty
Self-Supervised Preference Construction: Utilizes reward models to select optimal/suboptimal tokens at the most uncertain point without human annotation
Non-Human Reasoning Exploration: Allows models to explore reasoning paths that may not align with human cognitive habits but could be more effective

Experimental Setup

Datasets

Mathematical Reasoning Tasks:

Training data: 10,795 prompts from Step-DPO-10k dataset
Evaluation datasets: GSM8K, MATH, Omni-Math
Models: MetaMath-Mistral-7B, MetaMath-LLaMA-8B, Qwen2-7B-SFT, etc.

Code Generation Tasks:

Training data: 2,641 samples from LeetCodeDataset training set
Evaluation datasets: LiveCodeBench, LeetCodeDataset
Models: Deepseek-Coder-7B-Instruct-v1.5

Evaluation Metrics

Mathematical Reasoning: Exact match accuracy (final answer exactly matches standard answer)
Code Generation: Pass rate (generated code passes all test cases in sandbox environment)

Baseline Methods

Base Model: Original baseline model
Step-DPO: Step-wise preference optimization method based on human annotations

Implementation Details

Confidence threshold: 2nd percentile of dataset confidence distribution
Top-k candidates: k=8
Training configuration: β=0.3-0.4, learning rate 5e-7, batch size 128, 4-8 training epochs

Experimental Results

Main Results

Mathematical Reasoning Task Performance:

GSM8K: CGPO outperforms Step-DPO on all models, with most significant improvement on MetaMath-Llama-8B (+4.3% vs base)
MATH: Outperforms Step-DPO on MetaMath-Llama-8B and Qwen2-7B-SFT
Key Finding: Even when Step-DPO shows performance decline (e.g., MetaMath-Mistral-7B), CGPO still provides improvements

Code Generation Task Performance:

LiveCodeBench: 2.1% improvement (19.3% → 19.7%)
LeetCodeDataset: 4.0% improvement (12.7% → 13.2%)

Ablation Studies

1. Scalability Analysis

Validates method scalability by increasing training data size (10k → 80k):

MetaMath-Llama-8B improves from 85.3% to 86.4% on GSM8K
Qwen2-7B-SFT improves from 88.6% to 89.5% on GSM8K
Demonstrates good data scalability of CGPO

2. Reward Model Impact

Compares two reward models: ASPRM and Math-Shepherd:

ASPRM performs better, but improvements are observed even with weaker Math-Shepherd
Validates importance of fine-grained token-level evaluation

3. Confidence Threshold Analysis

Higher thresholds generally yield performance improvements but excessively high thresholds result in overly short sequences
Different models have different optimal thresholds requiring task-specific tuning

Generalization Ability Verification

Performance on Omni-Math (Olympic-level mathematical competition problems):

CGPO outperforms Step-DPO on 4/5 models
Demonstrates good out-of-distribution generalization capability

Case Analysis

Analysis of 200 error samples validates core hypothesis:

MetaMath-Llama-8B: 78% of errors occur after the lowest confidence point
Qwen2-7B-SFT: 72% of errors occur after the lowest confidence point
Supports the design philosophy of early intervention based on confidence

Preference Optimization Methods

PPO: High complexity but stable performance
DPO/SimPO: Direct optimization of paired preference signals with lower computational overhead
This Work's Contribution: Extends preference optimization to intermediate steps in reasoning paths

Confidence-Aware Methods

Direct Probability Method: Uses probability of predicted tokens (adopted in this work)
Generation Consistency Method: Measures confidence through answer consistency
This Work's Innovation: Applies confidence to step segmentation and optimization of reasoning paths

Reasoning Trajectory Optimization

Supervised Fine-Tuning: Direct alignment to annotated sequences
RLHF: Optimizes trajectories toward higher scores
This Work's Advantage: No strong model annotation required, explores non-human reasoning paths

Conclusions and Discussion

Main Conclusions

Value of Non-Human Reasoning Paths: Models can achieve better performance by exploring non-human reasoning paths
Effectiveness of Confidence Signals: Model confidence is an effective indicator for identifying reasoning difficulty points
Potential of Self-Supervised Learning: Effective reasoning capability enhancement is achievable without strong model or human annotations

Limitations

Computational Resource Constraints: Unable to verify scalability on larger models (e.g., 70B)
Domain Limitations: Primarily validated on mathematical and code domains; applicability to commonsense reasoning remains to be verified
Reward Model Dependency: Still requires domain-specific fine-grained reward models

Future Directions

Larger-Scale Validation: Verify method effectiveness on larger models and more domains
Universal Reward Models: Develop cross-domain universal fine-grained evaluation models
Theoretical Analysis: Deepen understanding of theoretical foundations for non-human reasoning paths

In-Depth Evaluation

Strengths

Deep Problem Insight: Identifies human bias problems in existing methods and proposes novel solutions
Clever Method Design: Combines confidence signals with preference optimization, achieving unsupervised reasoning path optimization
Comprehensive Experimental Validation: Multi-model, multi-task, multi-perspective experiments with convincing results
High Practical Value: Reduces dependence on strong model annotations while improving performance

Weaknesses

Insufficient Theoretical Foundation: Lacks deep theoretical explanation for why non-human reasoning paths are more effective
Limited Applicability Scope: Primarily validated on structured reasoning tasks; applicability to open-ended tasks remains unknown
Confidence Reliability: Model confidence itself may be unreliable, particularly on out-of-distribution data
Computational Overhead Analysis: Lacks detailed analysis of computational cost changes compared to baseline methods

Impact

Academic Value: Provides new research direction for reasoning capability optimization, potentially inspiring related work
Practical Value: Reduces annotation costs while improving performance, with important engineering applications
Reproducibility: Commits to open-sourcing complete code and data, facilitating method dissemination and improvement

Applicable Scenarios

Resource-Constrained Environments: Reasoning capability enhancement when strong model annotations are unavailable
Structured Reasoning Tasks: Mathematical, code, logical reasoning and other tasks with clear evaluation standards
Model Self-Improvement: As a technical component for continuous learning and self-optimization

References

The paper cites important works in reasoning optimization, preference learning, and confidence estimation, providing solid theoretical foundations for method design. Particularly noteworthy are comparative analyses with directly related preference optimization methods such as Step-DPO and DPO.

Overall Assessment: This is an important contribution to the field of large language model reasoning capability optimization. By introducing the concept of non-human reasoning paths and confidence-based optimization strategies, it provides new research directions for the field. While there is room for improvement in theoretical explanation and applicability scope, its practical value and novelty make it an important advance in the field.