Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization
Lu, Liu, Qu et al.
Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for intermediate steps limits exploration of alternative, non-human-like reasoning paths and thus constrains achievable performance. Furthermore, through a small-scale pilot study, we observed that in approximately 75% of cases, the model's first erroneous step occurs after the lowest-confidence point. This suggests that guiding the model at its lowest-confidence point before an error provides more accurate supervision than locating the first explicit error. In this paper, we propose Confidence-Guided Reasoning Path Preference Optimization (CGPO), a method that leverages a confidence signal to identify points of maximal uncertainty in the model's reasoning process and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift. Our experiments span diverse models applied to both code and mathematical reasoning tasks. The results show that, with the same amount of training data, our method using data generated by a small model can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.
academic
Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization
Current methods for enhancing large language model reasoning capabilities often introduce training biases toward human reasoning trajectories. Particularly in step-wise preference optimization, dependence on annotations of intermediate steps from humans or high-capability models limits exploration of alternative non-human reasoning paths, thereby constraining achievable performance. Through small-scale pilot studies, the authors observe that in approximately 75% of cases, the model's first erroneous step occurs after the lowest confidence point. This suggests that guiding the model at the lowest confidence point before error occurrence provides more accurate supervision than locating the first explicit error. This paper proposes Confidence-Guided Preference Optimization (CGPO), which leverages confidence signals to identify points of maximum uncertainty in the model's reasoning process and applies self-generated non-human reasoning path guidance to mitigate trajectory drift.
The core challenges faced by current methods for enhancing large language model reasoning capabilities are:
Human Bias Limitations: Existing methods over-rely on reasoning trajectories from humans or strong models, limiting exploration of non-human reasoning paths
Inaccurate Error Localization: Traditional methods supervise by locating the first explicit error, but this is often not the optimal intervention point
High Annotation Costs: Step-wise preference optimization requires extensive human or strong model annotations, making practical application costly
Through analysis, the authors discovered that in approximately 75% of error cases, the model's first erroneous step occurs after its lowest confidence point. This observation inspired the idea of optimizing reasoning paths based on model confidence rather than human cognition.
Proposes CGPO Method: A confidence-guided reasoning path preference optimization method that does not require reliance on stronger models or human supervision
Non-Human Reasoning Path Exploration: Constructs preference learning data through the model's own confidence signals, exploring non-human reasoning paths
Multi-Domain Validation: Validates method effectiveness on mathematical reasoning and code generation tasks, demonstrating generalizability
Open-Source Contribution: Commits to releasing complete code repositories, datasets, and trained models to promote reproducibility
Given input problem x, the initial policy model π₀ generates reasoning sequence y = (y₁, y₂, ..., yₜ), where yₜ ∈ V (vocabulary). At decoding timestep t, model confidence is defined as:
Confidence-Driven Step Segmentation: Breaks free from predefined anchor points, segmenting reasoning steps based on model's inherent uncertainty
Self-Supervised Preference Construction: Utilizes reward models to select optimal/suboptimal tokens at the most uncertain point without human annotation
Non-Human Reasoning Exploration: Allows models to explore reasoning paths that may not align with human cognitive habits but could be more effective
The paper cites important works in reasoning optimization, preference learning, and confidence estimation, providing solid theoretical foundations for method design. Particularly noteworthy are comparative analyses with directly related preference optimization methods such as Step-DPO and DPO.
Overall Assessment: This is an important contribution to the field of large language model reasoning capability optimization. By introducing the concept of non-human reasoning paths and confidence-based optimization strategies, it provides new research directions for the field. While there is room for improvement in theoretical explanation and applicability scope, its practical value and novelty make it an important advance in the field.