RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
Kattamuri, Fartale, Vats et al.
Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level confidence trajectories and deep mechanistic properties including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble of classifiers trained on these features, RADAR achieves 93\% accuracy on a diverse evaluation set, with perfect performance on clear cases and 76.7\% accuracy on challenging ambiguous examples. This work demonstrates the potential of mechanistic interpretability for advancing LLM evaluation beyond traditional surface-level metrics.
academic
RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
Data contamination poses a significant challenge to reliable large language model (LLM) evaluation, as models may achieve high performance through memorization of training data rather than demonstrating genuine reasoning capabilities. This paper proposes RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework leveraging mechanistic interpretability to detect contamination by distinguishing between recall-based and reasoning-based model responses. RADAR extracts 37 features encompassing surface-level confidence trajectories and deep mechanistic properties, including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble classifier trained on these features, RADAR achieves 93% accuracy on diverse evaluation sets, with perfect performance on clear cases and 76.7% accuracy on challenging ambiguous samples.
Data contamination in large language model evaluation refers to overlap between training and evaluation data, causing models to solve tasks through memorization rather than reasoning, thereby inflating evaluation metrics and obscuring true capabilities.
Evaluation Reliability: Data contamination severely compromises the credibility of model evaluation, making it impossible to accurately assess models' genuine reasoning abilities
Scientific Research Value: Distinguishing between memorization and reasoning is crucial for understanding models' cognitive mechanisms
Practical Application: In real-world deployment, it is essential to ensure models possess genuine reasoning capabilities rather than relying solely on memorization
This paper proposes analyzing the problem from the perspective of internal computational dynamics, leveraging mechanistic interpretability techniques to distinguish between recall and reasoning processes by analyzing attention, hidden states, and activation flows.
Methodological Innovation: Proposes the RADAR framework, applying mechanistic interpretability to data contamination detection for the first time, distinguishing recall from reasoning through internal computational process analysis
Feature Engineering: Designs 37 features, including 17 surface features and 20 mechanistic features, comprehensively characterizing model internal processing
Performance Breakthrough: Achieves 93% accuracy on diverse evaluation sets, demonstrating the effectiveness of mechanistic features in distinguishing recall from reasoning
Practical Value: Provides a contamination detection tool without requiring training data access, with good interpretability and practicality
Theoretical Insights: Reveals different mechanistic signatures of recall and reasoning processes within models, providing new perspectives for understanding model cognitive processes
Input: Given a prompt and corresponding model response
Output: Binary classification label determining whether the model response is based on recall or reasoning
Objective: Identify potential data contamination by analyzing model internal computational processes
Mechanistic Interpretability Application: First application of transformer circuit analysis to contamination detection, understanding model behavior from internal computational perspective
Multi-level Feature Design: Combines surface trajectory features with deep mechanistic features, comprehensively characterizing model processing
Training Data Independence: Does not require access to original training data; contamination detection relies solely on analyzing model internal states
Enhanced Interpretability: Provides concrete feature explanations, clarifying why a response is classified as recall or reasoning
The paper primarily demonstrates RADAR framework performance without direct comparison to other contamination detection methods, as existing methods are primarily based on text similarity, while RADAR employs a novel mechanistic analysis perspective.
Mechanistic Feature Effectiveness: Mechanistic features effectively distinguish recall from reasoning processes, validating the value of internal computational analysis
Challenging Case Analysis: 76.7% accuracy indicates room for improvement in ambiguous boundary cases, typically involving mismatches between surface form and internal processing
Feature Complementarity: Combination of surface and mechanistic features provides more comprehensive analytical perspective
Interpretability Validation: Feature analysis results align with cognitive science theories regarding memory and reasoning
Theoretical Contribution: Reveals different computational signatures of recall and reasoning within models, providing new perspectives for understanding LLM cognitive mechanisms
Practical Value: RADAR provides contamination detection tool without training data access, with good interpretability
Method Generalizability: Framework extends to different model architectures, providing new tools for LLM evaluation
Scale Limitations: Current experiments primarily conducted on DialoGPT-medium; applicability to larger models requires verification
Dataset Scale: Training set contains only 30 samples, test set 100 samples; relatively small scale
Proxy Features: Some mechanistic features use proxy measures rather than direct computation (e.g., causal effects approximated through attention entropy)
Task Scope: Currently focuses on simple factual recall vs. logical reasoning; applicability to complex tasks requires further verification
Golchin & Surdeanu (2023): Time travel in LLMs: Tracing data contamination
Carlini et al. (2021): Extracting training data from large language models
Elhage et al. (2021): A mathematical framework for transformer circuits
Olah et al. (2020): Zoom in: An introduction to circuits
Feldman (2020): Does learning require memorization?
Summary: RADAR represents significant progress in LLM contamination detection, providing new solutions through mechanistic interpretability. While there is room for improvement in experimental scale and theoretical analysis, its innovation and practical value make it an important contribution to the field. This work not only addresses practical problems but also provides new tools and perspectives for understanding LLM internal mechanisms.