RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
Kattamuri, Fartale, Vats et al.
Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level confidence trajectories and deep mechanistic properties including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble of classifiers trained on these features, RADAR achieves 93\% accuracy on a diverse evaluation set, with perfect performance on clear cases and 76.7\% accuracy on challenging ambiguous examples. This work demonstrates the potential of mechanistic interpretability for advancing LLM evaluation beyond traditional surface-level metrics.
academic
RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
数据污染对可靠的大语言模型(LLM)评估构成重大挑战,模型可能通过记忆训练数据而非展示真正的推理能力来获得高性能。本文提出RADAR (Recall vs. Reasoning Detection through Activation Representation),这是一个利用机制可解释性检测污染的新框架,通过区分基于回忆和基于推理的模型响应来识别数据污染。RADAR提取37个特征,涵盖表面层置信度轨迹和深层机制属性,包括注意力专业化、电路动态和激活流模式。使用基于这些特征训练的集成分类器,RADAR在多样化评估集上达到93%的准确率,在清晰案例上表现完美,在具有挑战性的模糊样例上达到76.7%的准确率。