2025-11-25T05:13:17.678139

RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation

Kattamuri, Fartale, Vats et al.

Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level confidence trajectories and deep mechanistic properties including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble of classifiers trained on these features, RADAR achieves 93\% accuracy on a diverse evaluation set, with perfect performance on clear cases and 76.7\% accuracy on challenging ambiguous examples. This work demonstrates the potential of mechanistic interpretability for advancing LLM evaluation beyond traditional surface-level metrics.

academic

RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation

Basic Information

Paper ID: 2510.08931
Title: RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
Authors: Ashish Kattamuri (Proofpoint), Harshwardhan Fartale (Indian Institute of Science), Arpita Vats (LinkedIn), Rahul Raja (LinkedIn), Ishita Prasad (Meta FAIR)
Classification: cs.AI, cs.LG
Publication Date: October 10, 2025 (Preprint)
Paper Link: https://arxiv.org/abs/2510.08931v1

Abstract

Data contamination poses a significant challenge to reliable large language model (LLM) evaluation, as models may achieve high performance through memorization of training data rather than demonstrating genuine reasoning capabilities. This paper proposes RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework leveraging mechanistic interpretability to detect contamination by distinguishing between recall-based and reasoning-based model responses. RADAR extracts 37 features encompassing surface-level confidence trajectories and deep mechanistic properties, including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble classifier trained on these features, RADAR achieves 93% accuracy on diverse evaluation sets, with perfect performance on clear cases and 76.7% accuracy on challenging ambiguous samples.

Research Background and Motivation

Problem Definition

Data contamination in large language model evaluation refers to overlap between training and evaluation data, causing models to solve tasks through memorization rather than reasoning, thereby inflating evaluation metrics and obscuring true capabilities.

Problem Significance

Evaluation Reliability: Data contamination severely compromises the credibility of model evaluation, making it impossible to accurately assess models' genuine reasoning abilities
Scientific Research Value: Distinguishing between memorization and reasoning is crucial for understanding models' cognitive mechanisms
Practical Application: In real-world deployment, it is essential to ensure models possess genuine reasoning capabilities rather than relying solely on memorization

Limitations of Existing Methods

Traditional detection methods primarily include:

Comparing evaluation data with training corpora
Checking n-gram overlap
Flagging verbatim outputs

These methods have the following limitations:

Require access to training data
Cannot handle paraphrased contamination
Cannot reveal whether models solve tasks through recall or reasoning
Focus only on surface-level similarity

Research Motivation

This paper proposes analyzing the problem from the perspective of internal computational dynamics, leveraging mechanistic interpretability techniques to distinguish between recall and reasoning processes by analyzing attention, hidden states, and activation flows.

Core Contributions

Methodological Innovation: Proposes the RADAR framework, applying mechanistic interpretability to data contamination detection for the first time, distinguishing recall from reasoning through internal computational process analysis
Feature Engineering: Designs 37 features, including 17 surface features and 20 mechanistic features, comprehensively characterizing model internal processing
Performance Breakthrough: Achieves 93% accuracy on diverse evaluation sets, demonstrating the effectiveness of mechanistic features in distinguishing recall from reasoning
Practical Value: Provides a contamination detection tool without requiring training data access, with good interpretability and practicality
Theoretical Insights: Reveals different mechanistic signatures of recall and reasoning processes within models, providing new perspectives for understanding model cognitive processes

Methodology Details

Task Definition

Input: Given a prompt and corresponding model response Output: Binary classification label determining whether the model response is based on recall or reasoning Objective: Identify potential data contamination by analyzing model internal computational processes

Model Architecture

The RADAR framework comprises three core components:

1. Mechanistic Analyzer

Interfaces with target LLM, configured to output attention weights and hidden states
Analyzes attention patterns across all heads and layers
Computes entropy and specialization metrics
Examines hidden state dynamics, including variance, norm, and effective rank

2. Feature Extraction

Extracts 37 features divided into two categories:

Surface Features (17):

Confidence statistics: mean, standard deviation, maximum, minimum, range
Convergence properties: convergence layer, convergence speed, confidence slope
Entropy measures: average entropy, entropy change, information gain
Stability indicators: prediction stability, layer consistency

Mechanistic Features (20):

Attention specialization: number of specialized heads, specialization score, attention entropy
Circuit dynamics: circuit depth, complexity, activation flow variance
Intervention sensitivity: ablation robustness, critical component count
Working memory: hidden state variance, norm trajectory
Causal effects: logit attribution, mediation score

3. Classification System

Employs an ensemble of four supervised learning models:

Random Forest
Gradient Boosting
Support Vector Machine (SVM)
Logistic Regression

Ensemble Strategy:

ŷ = 1[1/M ∑(j=1 to M) ŷⱼ > 1/2]
p̄ = 1/M ∑(j=1 to M) pⱼ

Confidence Calculation:

conf = {
  p̄,     if ŷ = 1 (recall)
  1-p̄,   if ŷ = 0 (reasoning)
}

Technical Innovations

Mechanistic Interpretability Application: First application of transformer circuit analysis to contamination detection, understanding model behavior from internal computational perspective
Multi-level Feature Design: Combines surface trajectory features with deep mechanistic features, comprehensively characterizing model processing
Training Data Independence: Does not require access to original training data; contamination detection relies solely on analyzing model internal states
Enhanced Interpretability: Provides concrete feature explanations, clarifying why a response is classified as recall or reasoning

Experimental Setup

Datasets

Training Set:

Total samples: 30 (15 recall, 15 reasoning)
Foundation for training classifiers

Test Set:

Total samples: 100
Clear recall: 20
Clear reasoning: 20
Challenging cases: 30
Complex reasoning: 30

Sample Examples:

Category	Example Prompt	Label
Clear recall	"The capital of France is"	recall
Clear reasoning	"If X is the capital of France, then X is"	reasoning
Challenging case	"What is the sum of 10 and 15?"	reasoning
Complex reasoning	"If a store has 100 items and sells 30% of them, how many items remain?"	reasoning

Evaluation Metrics

Overall Accuracy: Classification accuracy across all samples
Class-wise Accuracy: Separate accuracy for recall and reasoning tasks
Category-wise Accuracy: Accuracy across different difficulty categories
Cross-validation Accuracy: k-fold cross-validation results during training

Comparison Methods

The paper primarily demonstrates RADAR framework performance without direct comparison to other contamination detection methods, as existing methods are primarily based on text similarity, while RADAR employs a novel mechanistic analysis perspective.

Implementation Details

Target Model: microsoft/DialoGPT-medium
Configuration: output_attentions=True, output_hidden_states=True
Feature Normalization: StandardScaler for zero-mean unit-variance normalization
Training Strategy: k-fold cross-validation ensures robust performance estimation

Experimental Results

Main Results

Overall Performance:

Overall Accuracy: 93.0%
Recall Task Accuracy: 97.7%
Reasoning Task Accuracy: 89.3%
Training Cross-validation Accuracy: 96.7%

Category-wise Performance:

Category	Accuracy
Clear recall	100% (20/20)
Clear reasoning	100% (20/20)
Challenging cases	76.7% (23/30)
Complex reasoning	100% (30/30)

Feature Analysis

Key Discriminative Features:

Specialized Attention Heads: Higher in recall tasks
Circuit Complexity: Higher in reasoning tasks
Confidence Convergence Pattern: Faster convergence in recall tasks

Recall Detection Score (RDS):

Average RDS for recall tasks: 0.933
Average RDS for reasoning tasks: 0.375
Demonstrates clear separability

Mechanistic Signature Differences:

Recall Process: Focused attention patterns, rapid confidence convergence, specialized head activation
Reasoning Process: Distributed attention, progressive confidence building, higher activation flow variance

Experimental Findings

Mechanistic Feature Effectiveness: Mechanistic features effectively distinguish recall from reasoning processes, validating the value of internal computational analysis
Challenging Case Analysis: 76.7% accuracy indicates room for improvement in ambiguous boundary cases, typically involving mismatches between surface form and internal processing
Feature Complementarity: Combination of surface and mechanistic features provides more comprehensive analytical perspective
Interpretability Validation: Feature analysis results align with cognitive science theories regarding memory and reasoning

Data Contamination Detection

Traditional Methods: Based on n-gram overlap, text similarity comparison
Representative Work: Carlini et al. (2021) training data extraction methods
Limitations: Depend on training data access, cannot handle paraphrased contamination

Mechanistic Interpretability

Transformer Circuits: Mathematical framework by Elhage et al. (2021)
Attention Analysis: Circuit visualization methods by Olah et al. (2020)
Paper Contribution: First application of mechanistic analysis to contamination detection

LLM Evaluation

Memory vs. Reasoning: Theoretical analysis by Feldman (2020) on learning and memorization
Evaluation Reliability: Time travel detection methods by Golchin and Surdeanu (2023)
Paper Advantage: Provides evaluation methods from internal mechanistic perspective

Conclusions and Discussion

Main Conclusions

Technical Feasibility: Mechanistic interpretability effectively detects data contamination; 93% accuracy validates method effectiveness
Theoretical Contribution: Reveals different computational signatures of recall and reasoning within models, providing new perspectives for understanding LLM cognitive mechanisms
Practical Value: RADAR provides contamination detection tool without training data access, with good interpretability
Method Generalizability: Framework extends to different model architectures, providing new tools for LLM evaluation

Limitations

Scale Limitations: Current experiments primarily conducted on DialoGPT-medium; applicability to larger models requires verification
Dataset Scale: Training set contains only 30 samples, test set 100 samples; relatively small scale
Proxy Features: Some mechanistic features use proxy measures rather than direct computation (e.g., causal effects approximated through attention entropy)
Task Scope: Currently focuses on simple factual recall vs. logical reasoning; applicability to complex tasks requires further verification
Computational Overhead: Requires extracting model internal states, potentially increasing computational cost

Future Directions

Large Model Extension: Explore applications on larger-scale models
Unsupervised Detection: Develop unsupervised contamination detection methods
Multiple Contamination Types: Extend to detecting other types of data contamination
Real-time Detection: Develop efficient online contamination detection systems

In-depth Evaluation

Strengths

Strong Innovation: First application of mechanistic interpretability to contamination detection, opening new research directions
Scientific Methodology: Feature design has theoretical foundation; ensemble classifier enhances robustness
Good Interpretability: Provides concrete feature explanations, enhancing method credibility
High Practical Value: No training data access required, lowering application barriers
Comprehensive Experiments: Includes test cases of varying difficulty, validating method robustness

Weaknesses

Experimental Scale: Relatively small dataset scale, potential overfitting risk
Benchmark Comparison: Lacks direct comparison with existing contamination detection methods
Feature Engineering: Some features use proxy measures, potentially affecting accuracy
Generalization Ability: Validated on only one model; generalization capability requires verification
Theoretical Analysis: Lacks in-depth theoretical analysis of why these features are effective

Impact

Academic Contribution: Provides new perspectives for LLM evaluation and mechanistic interpretability research
Practical Value: Provides practical contamination detection tool for industry
Reproducibility: Provides complete code implementation, facilitating reproduction and extension
Research Inspiration: May inspire more research on model internal mechanisms

Applicable Scenarios

Model Evaluation: Detecting potential data contamination in LLM benchmarking
Research Tool: Analyzing model cognitive mechanisms as research tool
Quality Control: Ensuring evaluation reliability during model development
Educational Application: Helping understand and teach LLM internal workings

References

Main references include:

Golchin & Surdeanu (2023): Time travel in LLMs: Tracing data contamination
Carlini et al. (2021): Extracting training data from large language models
Elhage et al. (2021): A mathematical framework for transformer circuits
Olah et al. (2020): Zoom in: An introduction to circuits
Feldman (2020): Does learning require memorization?

Summary: RADAR represents significant progress in LLM contamination detection, providing new solutions through mechanistic interpretability. While there is room for improvement in experimental scale and theoretical analysis, its innovation and practical value make it an important contribution to the field. This work not only addresses practical problems but also provides new tools and perspectives for understanding LLM internal mechanisms.