Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective
You, Wang, Wang et al.
While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs' ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.
academic
Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective
Although Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, comprehensive evaluation in the Chinese context remains insufficient. To address this gap, this paper proposes the Chinese Commonsense Multi-hop Reasoning (CCMOR) benchmark, aimed at evaluating LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, the authors first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-based pipeline to generate multi-hop questions based on chains of factual units. To ensure dataset quality, a human-in-the-loop validation system is implemented, with domain experts systematically verifying and refining generated questions. Using CCMOR to evaluate state-of-the-art LLMs, results demonstrate that LLMs exhibit persistent limitations in handling long-tail knowledge and executing knowledge-intensive reasoning. Notably, retrieval-augmented generation significantly alleviates these knowledge gaps, yielding substantial performance improvements.
The core problems addressed by this research are: How to comprehensively evaluate the capabilities of large language models on Chinese commonsense multi-hop reasoning tasks. Specifically, this includes:
Missing Chinese Reasoning Evaluation: Existing multi-hop reasoning datasets primarily focus on English, lacking systematic evaluation resources for the Chinese context
Insufficient Cultural Relevance: There is a need for evaluation benchmarks grounded in Chinese cultural knowledge, idioms, and logical reasoning patterns
Reasoning vs. Memorization: Distinguishing genuine reasoning ability from simple memorization capability
Technical Necessity: With the emergence of specialized reasoning models such as OpenAI-o1 and DeepSeek-R1, there is a need for specialized evaluation tailored to Chinese scenarios
Practical Value: Chinese is one of the most widely spoken languages globally, making evaluation of Chinese reasoning capabilities of significant practical importance
Academic Gap: Filling the academic void in Chinese multi-hop reasoning evaluation
Data Sources: Existing Chinese factual QA datasets such as Chinese SimpleQA and CHARM-Memorization
Domain Classification: Using LLM to reclassify questions into six major domains: Chinese Culture, Humanities and Social Sciences, Engineering and Technology, Life and Arts, Society, and Natural Sciences
Quality Control: Multiple LLMs evaluate the correctness and clarity of each QA pair
The paper cites multiple important related works, including:
HotpotQA (Yang et al., 2018): Foundational work in multi-hop reasoning
Chinese SimpleQA (He et al., 2024): High-quality Chinese factual QA benchmark
MoreHopQA (Schnitzler et al., 2024): Partial inspiration for this paper's methodology
CHARM (Sun et al., 2024): Related work in Chinese commonsense reasoning
Overall Assessment: This is a high-quality research paper that fills an important gap in Chinese multi-hop reasoning evaluation. The paper employs rigorous methodology, comprehensive experiments, and in-depth analysis, contributing significantly to advancing Chinese NLP and reasoning research. While there are some limitations in dataset scale and evaluation methodology, its contributions are substantial, establishing an important foundation for the field's development.