2025-11-15T12:13:12.098814

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

You, Wang, Wang et al.

While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs' ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.

academic

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

Basic Information

Paper ID: 2510.08800
Title: Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective
Authors: Wangjie You, Xusheng Wang, Xing Wang, Wenxiang Jiao, Chao Feng, Juntao Li, Min Zhang
Classification: cs.CL cs.AI
Publication Date: January 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.08800
Institutions: ByteDance Douyin Content Group, School of Computer Science and Technology, Soochow University

Abstract

Although Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, comprehensive evaluation in the Chinese context remains insufficient. To address this gap, this paper proposes the Chinese Commonsense Multi-hop Reasoning (CCMOR) benchmark, aimed at evaluating LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, the authors first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-based pipeline to generate multi-hop questions based on chains of factual units. To ensure dataset quality, a human-in-the-loop validation system is implemented, with domain experts systematically verifying and refining generated questions. Using CCMOR to evaluate state-of-the-art LLMs, results demonstrate that LLMs exhibit persistent limitations in handling long-tail knowledge and executing knowledge-intensive reasoning. Notably, retrieval-augmented generation significantly alleviates these knowledge gaps, yielding substantial performance improvements.

Research Background and Motivation

Problem Definition

The core problems addressed by this research are: How to comprehensively evaluate the capabilities of large language models on Chinese commonsense multi-hop reasoning tasks. Specifically, this includes:

Missing Chinese Reasoning Evaluation: Existing multi-hop reasoning datasets primarily focus on English, lacking systematic evaluation resources for the Chinese context
Insufficient Cultural Relevance: There is a need for evaluation benchmarks grounded in Chinese cultural knowledge, idioms, and logical reasoning patterns
Reasoning vs. Memorization: Distinguishing genuine reasoning ability from simple memorization capability

Research Significance

Technical Necessity: With the emergence of specialized reasoning models such as OpenAI-o1 and DeepSeek-R1, there is a need for specialized evaluation tailored to Chinese scenarios
Practical Value: Chinese is one of the most widely spoken languages globally, making evaluation of Chinese reasoning capabilities of significant practical importance
Academic Gap: Filling the academic void in Chinese multi-hop reasoning evaluation

Limitations of Existing Approaches

Language Limitations: HotpotQA, WikiHop, DROP, and similar benchmarks primarily focus on English
Poor Cultural Adaptability: Directly translated datasets fail to reflect Chinese-specific cultural and reasoning patterns
Quality Control Challenges: Constructing high-quality Chinese multi-hop reasoning datasets faces challenges in accuracy, consistency, and clarity

Core Contributions

Proposing the CCMOR Benchmark: The first comprehensive evaluation benchmark specifically designed for Chinese commonsense multi-hop reasoning
Innovative Data Construction Method: Develops an LLM-based automated pipeline combined with a human-in-the-loop validation system
Comprehensive Experimental Evaluation: Systematic evaluation of state-of-the-art LLMs, revealing their limitations in knowledge-intensive reasoning
In-depth Analytical Insights: Provides detailed analysis regarding different reasoning styles, prompting strategies, and RAG effectiveness

Methodology Details

Task Definition

CCMOR aims to evaluate LLMs' capabilities in the following aspects:

Input: Chinese multi-hop reasoning questions requiring integration of multiple facts for reasoning
Output: Final answer along with optional intermediate reasoning steps
Constraints: Questions must be based on verifiable fact chains, with unique and specific answers

Data Construction Pipeline

Step 1: Seed Data Sampling

Data Sources: Existing Chinese factual QA datasets such as Chinese SimpleQA and CHARM-Memorization
Domain Classification: Using LLM to reclassify questions into six major domains: Chinese Culture, Humanities and Social Sciences, Engineering and Technology, Life and Arts, Society, and Natural Sciences
Quality Control: Multiple LLMs evaluate the correctness and clarity of each QA pair

Step 2: Recursive Sub-question Generation

Anchored Facts: Using the answer from the previous layer as an anchored fact to generate subsequent questions

Recursive Expansion: At each layer ℓ ∈ 1,N, generate n new QA pairs for each QA pair:

QAℓ = ⋃(i∈QAℓ⁻¹) {(qℓᵢ,₁, aℓᵢ,₁), ..., (qℓᵢ,ₙ, aℓᵢ,ₙ)}

Diversity Assurance: Alternating use of different LLMs reduces model-specific bias

Step 3: Multi-hop Question Composition

Path Sampling: Sample all valid paths of length L from the tree structure
Question Composition: Combine independent QA pairs into coherent multi-hop questions
Quality Assessment: Evaluate global answer uniqueness, sequence consistency, and harmlessness

Quality Control Mechanisms

LLM Verification Standards

Answerability and Verifiability: Questions must have concrete, finite, verifiable answer sets
Specificity and Determinism: Questions should clearly target specific facts or relationships
Temporal and Factual Stability: Answers must be objective, time-invariant facts

Human-in-the-Loop Validation

Professional Annotators: Independent review by domain experts
Multi-round Verification: Each instance independently reviewed by two annotators, with disagreements resolved by a third party
Authoritative Verification: All facts verified against authoritative sources

Experimental Setup

Dataset Scale

3-hop Questions: 480 (filtered from 1,000 initial samples)
6-hop Questions: 166 (filtered from 1,000 initial samples)
Average Length: 39.19 characters for 3-hop questions, 68.51 characters for 6-hop questions
Domain Coverage: Average 1.65 domains (3-hop) and 2.26 domains (6-hop)

Evaluation Metrics

ROUGE-L Recall: Measures lexical-level overlap
LLM-as-Judge Accuracy: Uses three independent judge models for semantic-level evaluation with majority voting

Evaluation Settings

Step-by-step QA (SQA): Decompose multi-hop questions into sub-questions, answering sequentially
Overall Answer (OA): Directly answer complete multi-hop questions

Comparison Models

System-1 Style: Qwen2.5/3 series, LLaMA3, GPT-4 series, Gemini-2.5, etc.
System-2 Style: DeepSeek-R1, OpenAI-o1, Qwen-QwQ, and other models with long-chain reasoning

Experimental Results

Main Results

Overall Performance: Even top-tier models achieve average multi-hop accuracy below 75%, demonstrating the benchmark's challenge level
System-2 Advantage: Models with deep reasoning capabilities significantly outperform System-1 models in OA settings
Hop Count Impact: Performance significantly decreases with increasing reasoning hops
SQA vs OA Gap: All models show persistent performance gaps between SQA and OA, indicating that comprehensive reasoning remains challenging

Specific Performance Data

Best Model: Gemini-2.5-Pro achieves 73.61% average accuracy
Chinese Advantage: Chinese community models such as Yi-lightning, GLM-4, and Doubao show outstanding performance in certain settings
Closed-source vs Open-source: Closed-source models generally outperform open-source models

Domain Analysis

Easiest Domain: Natural Sciences with average score of 83.93
Most Difficult Domain: Life and Arts with average score of 66.61
Chinese Culture: Chinese community models perform better in the Chinese Culture domain

RAG Effectiveness

Significant Improvement: RAG brings average accuracy improvement of 9.5 percentage points
Model Differences: Doubao shows the largest improvement, while Kimi and Wenxin show limited improvement
Multi-round Retrieval: Models supporting multi-round retrieval show advantages in multi-hop reasoning

Multi-hop Reasoning Benchmarks

English Benchmarks: HotpotQA, 2WikiMultiHopQA, MuSiQue, etc., established the foundation
Recent Developments: MoreHopQA, Multihop-RAG, and others leverage LLMs to construct higher-quality questions
Chinese Gap: NLPCC-MH, CoreCode, CHARM, and other initial efforts, but lacking systematic verifiable multi-step reasoning

Chinese Commonsense Benchmarks

Development Trajectory: From translating English benchmarks to native Chinese evaluation
Representative Works: C3, CMQA, Chinese SimpleQA, etc.
Limitations: Primarily focus on single-hop factual questions, lacking multi-hop reasoning evaluation

Conclusions and Discussion

Main Conclusions

Performance Limitations: Current state-of-the-art LLMs exhibit significant limitations in Chinese multi-hop reasoning
Importance of Reasoning Style: System-2 style deep reasoning is crucial for multi-hop reasoning
RAG Effectiveness: Retrieval-augmented generation significantly improves knowledge-intensive reasoning
Domain Differences: Fact-centric domains are relatively easier, while procedural or abstract reasoning is more challenging

Limitations

LLM Dependency: The data construction process relies on LLM generation, potentially introducing hallucinations or biases
Evaluation Method: LLM-as-Judge evaluation may be influenced by model-specific preferences
Coverage Scope: Focuses on textual commonsense knowledge, not covering multimodal reasoning

Future Directions

Multimodal Extension: Extend the benchmark to multimodal reasoning tasks
Interactive Reasoning: Incorporate reasoning scenarios requiring multi-turn interaction
Specialized Reasoning: Develop specialized models for reasoning

In-depth Evaluation

Strengths

Filling Important Gap: The first systematic Chinese multi-hop reasoning benchmark with significant academic and practical value
Methodological Innovation: LLM-driven data construction pipeline combined with human-in-the-loop validation ensures data quality
Comprehensive Evaluation: Systematic evaluation covering multiple model types, reasoning styles, and enhancement techniques
In-depth Analysis: Provides rich analytical dimensions including domains, reasoning styles, and prompting strategies
High Quality Control: Rigorous quality control standards and multi-round verification mechanisms

Shortcomings

Scale Limitation: Relatively small dataset size (646 questions) may impact evaluation comprehensiveness
Construction Cost: Human-in-the-loop construction approach is costly and difficult to scale
Evaluation Dependency: Over-reliance on LLM-as-Judge may introduce evaluation bias
Domain Balance: While pursuing domain balance, certain domains may still have insufficient samples

Impact

Academic Contribution: Provides important evaluation resources for the Chinese NLP field
Practical Value: Offers direct guidance for development and evaluation of Chinese LLMs
Methodological Inspiration: Data construction methodology provides reference value for similar benchmarks in other languages
Reproducibility: Detailed methodology description and promised data release ensure reproducibility

Applicable Scenarios

Model Evaluation: Evaluating reasoning capabilities of Chinese LLMs
Model Development: Guiding improvements in reasoning capabilities
Application Deployment: Providing performance reference for Chinese applications requiring complex reasoning
Research Benchmark: Serving as a standard evaluation benchmark for Chinese reasoning research

References

The paper cites multiple important related works, including:

HotpotQA (Yang et al., 2018): Foundational work in multi-hop reasoning
Chinese SimpleQA (He et al., 2024): High-quality Chinese factual QA benchmark
MoreHopQA (Schnitzler et al., 2024): Partial inspiration for this paper's methodology
CHARM (Sun et al., 2024): Related work in Chinese commonsense reasoning

Overall Assessment: This is a high-quality research paper that fills an important gap in Chinese multi-hop reasoning evaluation. The paper employs rigorous methodology, comprehensive experiments, and in-depth analysis, contributing significantly to advancing Chinese NLP and reasoning research. While there are some limitations in dataset scale and evaluation methodology, its contributions are substantial, establishing an important foundation for the field's development.