2025-11-15T12:13:12.098814

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

You, Wang, Wang et al.
While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs' ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.
academic

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

Basic Information

  • Paper ID: 2510.08800
  • Title: Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective
  • Authors: Wangjie You, Xusheng Wang, Xing Wang, Wenxiang Jiao, Chao Feng, Juntao Li, Min Zhang
  • Classification: cs.CL cs.AI
  • Publication Date: January 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.08800
  • Institutions: ByteDance Douyin Content Group, School of Computer Science and Technology, Soochow University

Abstract

Although Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, comprehensive evaluation in the Chinese context remains insufficient. To address this gap, this paper proposes the Chinese Commonsense Multi-hop Reasoning (CCMOR) benchmark, aimed at evaluating LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, the authors first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-based pipeline to generate multi-hop questions based on chains of factual units. To ensure dataset quality, a human-in-the-loop validation system is implemented, with domain experts systematically verifying and refining generated questions. Using CCMOR to evaluate state-of-the-art LLMs, results demonstrate that LLMs exhibit persistent limitations in handling long-tail knowledge and executing knowledge-intensive reasoning. Notably, retrieval-augmented generation significantly alleviates these knowledge gaps, yielding substantial performance improvements.

Research Background and Motivation

Problem Definition

The core problems addressed by this research are: How to comprehensively evaluate the capabilities of large language models on Chinese commonsense multi-hop reasoning tasks. Specifically, this includes:

  1. Missing Chinese Reasoning Evaluation: Existing multi-hop reasoning datasets primarily focus on English, lacking systematic evaluation resources for the Chinese context
  2. Insufficient Cultural Relevance: There is a need for evaluation benchmarks grounded in Chinese cultural knowledge, idioms, and logical reasoning patterns
  3. Reasoning vs. Memorization: Distinguishing genuine reasoning ability from simple memorization capability

Research Significance

  1. Technical Necessity: With the emergence of specialized reasoning models such as OpenAI-o1 and DeepSeek-R1, there is a need for specialized evaluation tailored to Chinese scenarios
  2. Practical Value: Chinese is one of the most widely spoken languages globally, making evaluation of Chinese reasoning capabilities of significant practical importance
  3. Academic Gap: Filling the academic void in Chinese multi-hop reasoning evaluation

Limitations of Existing Approaches

  1. Language Limitations: HotpotQA, WikiHop, DROP, and similar benchmarks primarily focus on English
  2. Poor Cultural Adaptability: Directly translated datasets fail to reflect Chinese-specific cultural and reasoning patterns
  3. Quality Control Challenges: Constructing high-quality Chinese multi-hop reasoning datasets faces challenges in accuracy, consistency, and clarity

Core Contributions

  1. Proposing the CCMOR Benchmark: The first comprehensive evaluation benchmark specifically designed for Chinese commonsense multi-hop reasoning
  2. Innovative Data Construction Method: Develops an LLM-based automated pipeline combined with a human-in-the-loop validation system
  3. Comprehensive Experimental Evaluation: Systematic evaluation of state-of-the-art LLMs, revealing their limitations in knowledge-intensive reasoning
  4. In-depth Analytical Insights: Provides detailed analysis regarding different reasoning styles, prompting strategies, and RAG effectiveness

Methodology Details

Task Definition

CCMOR aims to evaluate LLMs' capabilities in the following aspects:

  • Input: Chinese multi-hop reasoning questions requiring integration of multiple facts for reasoning
  • Output: Final answer along with optional intermediate reasoning steps
  • Constraints: Questions must be based on verifiable fact chains, with unique and specific answers

Data Construction Pipeline

Step 1: Seed Data Sampling

  • Data Sources: Existing Chinese factual QA datasets such as Chinese SimpleQA and CHARM-Memorization
  • Domain Classification: Using LLM to reclassify questions into six major domains: Chinese Culture, Humanities and Social Sciences, Engineering and Technology, Life and Arts, Society, and Natural Sciences
  • Quality Control: Multiple LLMs evaluate the correctness and clarity of each QA pair

Step 2: Recursive Sub-question Generation

  • Anchored Facts: Using the answer from the previous layer as an anchored fact to generate subsequent questions
  • Recursive Expansion: At each layer ℓ ∈ 1,N, generate n new QA pairs for each QA pair:
    QAℓ = ⋃(i∈QAℓ⁻¹) {(qℓᵢ,₁, aℓᵢ,₁), ..., (qℓᵢ,ₙ, aℓᵢ,ₙ)}
    
  • Diversity Assurance: Alternating use of different LLMs reduces model-specific bias

Step 3: Multi-hop Question Composition

  • Path Sampling: Sample all valid paths of length L from the tree structure
  • Question Composition: Combine independent QA pairs into coherent multi-hop questions
  • Quality Assessment: Evaluate global answer uniqueness, sequence consistency, and harmlessness

Quality Control Mechanisms

LLM Verification Standards

  1. Answerability and Verifiability: Questions must have concrete, finite, verifiable answer sets
  2. Specificity and Determinism: Questions should clearly target specific facts or relationships
  3. Temporal and Factual Stability: Answers must be objective, time-invariant facts

Human-in-the-Loop Validation

  • Professional Annotators: Independent review by domain experts
  • Multi-round Verification: Each instance independently reviewed by two annotators, with disagreements resolved by a third party
  • Authoritative Verification: All facts verified against authoritative sources

Experimental Setup

Dataset Scale

  • 3-hop Questions: 480 (filtered from 1,000 initial samples)
  • 6-hop Questions: 166 (filtered from 1,000 initial samples)
  • Average Length: 39.19 characters for 3-hop questions, 68.51 characters for 6-hop questions
  • Domain Coverage: Average 1.65 domains (3-hop) and 2.26 domains (6-hop)

Evaluation Metrics

  1. ROUGE-L Recall: Measures lexical-level overlap
  2. LLM-as-Judge Accuracy: Uses three independent judge models for semantic-level evaluation with majority voting

Evaluation Settings

  1. Step-by-step QA (SQA): Decompose multi-hop questions into sub-questions, answering sequentially
  2. Overall Answer (OA): Directly answer complete multi-hop questions

Comparison Models

  • System-1 Style: Qwen2.5/3 series, LLaMA3, GPT-4 series, Gemini-2.5, etc.
  • System-2 Style: DeepSeek-R1, OpenAI-o1, Qwen-QwQ, and other models with long-chain reasoning

Experimental Results

Main Results

  1. Overall Performance: Even top-tier models achieve average multi-hop accuracy below 75%, demonstrating the benchmark's challenge level
  2. System-2 Advantage: Models with deep reasoning capabilities significantly outperform System-1 models in OA settings
  3. Hop Count Impact: Performance significantly decreases with increasing reasoning hops
  4. SQA vs OA Gap: All models show persistent performance gaps between SQA and OA, indicating that comprehensive reasoning remains challenging

Specific Performance Data

  • Best Model: Gemini-2.5-Pro achieves 73.61% average accuracy
  • Chinese Advantage: Chinese community models such as Yi-lightning, GLM-4, and Doubao show outstanding performance in certain settings
  • Closed-source vs Open-source: Closed-source models generally outperform open-source models

Domain Analysis

  • Easiest Domain: Natural Sciences with average score of 83.93
  • Most Difficult Domain: Life and Arts with average score of 66.61
  • Chinese Culture: Chinese community models perform better in the Chinese Culture domain

RAG Effectiveness

  • Significant Improvement: RAG brings average accuracy improvement of 9.5 percentage points
  • Model Differences: Doubao shows the largest improvement, while Kimi and Wenxin show limited improvement
  • Multi-round Retrieval: Models supporting multi-round retrieval show advantages in multi-hop reasoning

Multi-hop Reasoning Benchmarks

  • English Benchmarks: HotpotQA, 2WikiMultiHopQA, MuSiQue, etc., established the foundation
  • Recent Developments: MoreHopQA, Multihop-RAG, and others leverage LLMs to construct higher-quality questions
  • Chinese Gap: NLPCC-MH, CoreCode, CHARM, and other initial efforts, but lacking systematic verifiable multi-step reasoning

Chinese Commonsense Benchmarks

  • Development Trajectory: From translating English benchmarks to native Chinese evaluation
  • Representative Works: C3, CMQA, Chinese SimpleQA, etc.
  • Limitations: Primarily focus on single-hop factual questions, lacking multi-hop reasoning evaluation

Conclusions and Discussion

Main Conclusions

  1. Performance Limitations: Current state-of-the-art LLMs exhibit significant limitations in Chinese multi-hop reasoning
  2. Importance of Reasoning Style: System-2 style deep reasoning is crucial for multi-hop reasoning
  3. RAG Effectiveness: Retrieval-augmented generation significantly improves knowledge-intensive reasoning
  4. Domain Differences: Fact-centric domains are relatively easier, while procedural or abstract reasoning is more challenging

Limitations

  1. LLM Dependency: The data construction process relies on LLM generation, potentially introducing hallucinations or biases
  2. Evaluation Method: LLM-as-Judge evaluation may be influenced by model-specific preferences
  3. Coverage Scope: Focuses on textual commonsense knowledge, not covering multimodal reasoning

Future Directions

  1. Multimodal Extension: Extend the benchmark to multimodal reasoning tasks
  2. Interactive Reasoning: Incorporate reasoning scenarios requiring multi-turn interaction
  3. Specialized Reasoning: Develop specialized models for reasoning

In-depth Evaluation

Strengths

  1. Filling Important Gap: The first systematic Chinese multi-hop reasoning benchmark with significant academic and practical value
  2. Methodological Innovation: LLM-driven data construction pipeline combined with human-in-the-loop validation ensures data quality
  3. Comprehensive Evaluation: Systematic evaluation covering multiple model types, reasoning styles, and enhancement techniques
  4. In-depth Analysis: Provides rich analytical dimensions including domains, reasoning styles, and prompting strategies
  5. High Quality Control: Rigorous quality control standards and multi-round verification mechanisms

Shortcomings

  1. Scale Limitation: Relatively small dataset size (646 questions) may impact evaluation comprehensiveness
  2. Construction Cost: Human-in-the-loop construction approach is costly and difficult to scale
  3. Evaluation Dependency: Over-reliance on LLM-as-Judge may introduce evaluation bias
  4. Domain Balance: While pursuing domain balance, certain domains may still have insufficient samples

Impact

  1. Academic Contribution: Provides important evaluation resources for the Chinese NLP field
  2. Practical Value: Offers direct guidance for development and evaluation of Chinese LLMs
  3. Methodological Inspiration: Data construction methodology provides reference value for similar benchmarks in other languages
  4. Reproducibility: Detailed methodology description and promised data release ensure reproducibility

Applicable Scenarios

  1. Model Evaluation: Evaluating reasoning capabilities of Chinese LLMs
  2. Model Development: Guiding improvements in reasoning capabilities
  3. Application Deployment: Providing performance reference for Chinese applications requiring complex reasoning
  4. Research Benchmark: Serving as a standard evaluation benchmark for Chinese reasoning research

References

The paper cites multiple important related works, including:

  • HotpotQA (Yang et al., 2018): Foundational work in multi-hop reasoning
  • Chinese SimpleQA (He et al., 2024): High-quality Chinese factual QA benchmark
  • MoreHopQA (Schnitzler et al., 2024): Partial inspiration for this paper's methodology
  • CHARM (Sun et al., 2024): Related work in Chinese commonsense reasoning

Overall Assessment: This is a high-quality research paper that fills an important gap in Chinese multi-hop reasoning evaluation. The paper employs rigorous methodology, comprehensive experiments, and in-depth analysis, contributing significantly to advancing Chinese NLP and reasoning research. While there are some limitations in dataset scale and evaluation methodology, its contributions are substantial, establishing an important foundation for the field's development.