2025-11-25T05:13:17.678139

RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation

Kattamuri, Fartale, Vats et al.
Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level confidence trajectories and deep mechanistic properties including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble of classifiers trained on these features, RADAR achieves 93\% accuracy on a diverse evaluation set, with perfect performance on clear cases and 76.7\% accuracy on challenging ambiguous examples. This work demonstrates the potential of mechanistic interpretability for advancing LLM evaluation beyond traditional surface-level metrics.
academic

RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation

Basic Information

  • Paper ID: 2510.08931
  • Title: RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
  • Authors: Ashish Kattamuri (Proofpoint), Harshwardhan Fartale (Indian Institute of Science), Arpita Vats (LinkedIn), Rahul Raja (LinkedIn), Ishita Prasad (Meta FAIR)
  • Classification: cs.AI, cs.LG
  • Publication Date: October 10, 2025 (Preprint)
  • Paper Link: https://arxiv.org/abs/2510.08931v1

Abstract

Data contamination poses a significant challenge to reliable large language model (LLM) evaluation, as models may achieve high performance through memorization of training data rather than demonstrating genuine reasoning capabilities. This paper proposes RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework leveraging mechanistic interpretability to detect contamination by distinguishing between recall-based and reasoning-based model responses. RADAR extracts 37 features encompassing surface-level confidence trajectories and deep mechanistic properties, including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble classifier trained on these features, RADAR achieves 93% accuracy on diverse evaluation sets, with perfect performance on clear cases and 76.7% accuracy on challenging ambiguous samples.

Research Background and Motivation

Problem Definition

Data contamination in large language model evaluation refers to overlap between training and evaluation data, causing models to solve tasks through memorization rather than reasoning, thereby inflating evaluation metrics and obscuring true capabilities.

Problem Significance

  1. Evaluation Reliability: Data contamination severely compromises the credibility of model evaluation, making it impossible to accurately assess models' genuine reasoning abilities
  2. Scientific Research Value: Distinguishing between memorization and reasoning is crucial for understanding models' cognitive mechanisms
  3. Practical Application: In real-world deployment, it is essential to ensure models possess genuine reasoning capabilities rather than relying solely on memorization

Limitations of Existing Methods

Traditional detection methods primarily include:

  • Comparing evaluation data with training corpora
  • Checking n-gram overlap
  • Flagging verbatim outputs

These methods have the following limitations:

  1. Require access to training data
  2. Cannot handle paraphrased contamination
  3. Cannot reveal whether models solve tasks through recall or reasoning
  4. Focus only on surface-level similarity

Research Motivation

This paper proposes analyzing the problem from the perspective of internal computational dynamics, leveraging mechanistic interpretability techniques to distinguish between recall and reasoning processes by analyzing attention, hidden states, and activation flows.

Core Contributions

  1. Methodological Innovation: Proposes the RADAR framework, applying mechanistic interpretability to data contamination detection for the first time, distinguishing recall from reasoning through internal computational process analysis
  2. Feature Engineering: Designs 37 features, including 17 surface features and 20 mechanistic features, comprehensively characterizing model internal processing
  3. Performance Breakthrough: Achieves 93% accuracy on diverse evaluation sets, demonstrating the effectiveness of mechanistic features in distinguishing recall from reasoning
  4. Practical Value: Provides a contamination detection tool without requiring training data access, with good interpretability and practicality
  5. Theoretical Insights: Reveals different mechanistic signatures of recall and reasoning processes within models, providing new perspectives for understanding model cognitive processes

Methodology Details

Task Definition

Input: Given a prompt and corresponding model response Output: Binary classification label determining whether the model response is based on recall or reasoning Objective: Identify potential data contamination by analyzing model internal computational processes

Model Architecture

The RADAR framework comprises three core components:

1. Mechanistic Analyzer

  • Interfaces with target LLM, configured to output attention weights and hidden states
  • Analyzes attention patterns across all heads and layers
  • Computes entropy and specialization metrics
  • Examines hidden state dynamics, including variance, norm, and effective rank

2. Feature Extraction

Extracts 37 features divided into two categories:

Surface Features (17):

  • Confidence statistics: mean, standard deviation, maximum, minimum, range
  • Convergence properties: convergence layer, convergence speed, confidence slope
  • Entropy measures: average entropy, entropy change, information gain
  • Stability indicators: prediction stability, layer consistency

Mechanistic Features (20):

  • Attention specialization: number of specialized heads, specialization score, attention entropy
  • Circuit dynamics: circuit depth, complexity, activation flow variance
  • Intervention sensitivity: ablation robustness, critical component count
  • Working memory: hidden state variance, norm trajectory
  • Causal effects: logit attribution, mediation score

3. Classification System

Employs an ensemble of four supervised learning models:

  • Random Forest
  • Gradient Boosting
  • Support Vector Machine (SVM)
  • Logistic Regression

Ensemble Strategy:

ŷ = 1[1/M ∑(j=1 to M) ŷⱼ > 1/2]
p̄ = 1/M ∑(j=1 to M) pⱼ

Confidence Calculation:

conf = {
  p̄,     if ŷ = 1 (recall)
  1-p̄,   if ŷ = 0 (reasoning)
}

Technical Innovations

  1. Mechanistic Interpretability Application: First application of transformer circuit analysis to contamination detection, understanding model behavior from internal computational perspective
  2. Multi-level Feature Design: Combines surface trajectory features with deep mechanistic features, comprehensively characterizing model processing
  3. Training Data Independence: Does not require access to original training data; contamination detection relies solely on analyzing model internal states
  4. Enhanced Interpretability: Provides concrete feature explanations, clarifying why a response is classified as recall or reasoning

Experimental Setup

Datasets

Training Set:

  • Total samples: 30 (15 recall, 15 reasoning)
  • Foundation for training classifiers

Test Set:

  • Total samples: 100
  • Clear recall: 20
  • Clear reasoning: 20
  • Challenging cases: 30
  • Complex reasoning: 30

Sample Examples:

CategoryExample PromptLabel
Clear recall"The capital of France is"recall
Clear reasoning"If X is the capital of France, then X is"reasoning
Challenging case"What is the sum of 10 and 15?"reasoning
Complex reasoning"If a store has 100 items and sells 30% of them, how many items remain?"reasoning

Evaluation Metrics

  • Overall Accuracy: Classification accuracy across all samples
  • Class-wise Accuracy: Separate accuracy for recall and reasoning tasks
  • Category-wise Accuracy: Accuracy across different difficulty categories
  • Cross-validation Accuracy: k-fold cross-validation results during training

Comparison Methods

The paper primarily demonstrates RADAR framework performance without direct comparison to other contamination detection methods, as existing methods are primarily based on text similarity, while RADAR employs a novel mechanistic analysis perspective.

Implementation Details

  • Target Model: microsoft/DialoGPT-medium
  • Configuration: output_attentions=True, output_hidden_states=True
  • Feature Normalization: StandardScaler for zero-mean unit-variance normalization
  • Training Strategy: k-fold cross-validation ensures robust performance estimation

Experimental Results

Main Results

Overall Performance:

  • Overall Accuracy: 93.0%
  • Recall Task Accuracy: 97.7%
  • Reasoning Task Accuracy: 89.3%
  • Training Cross-validation Accuracy: 96.7%

Category-wise Performance:

CategoryAccuracy
Clear recall100% (20/20)
Clear reasoning100% (20/20)
Challenging cases76.7% (23/30)
Complex reasoning100% (30/30)

Feature Analysis

Key Discriminative Features:

  1. Specialized Attention Heads: Higher in recall tasks
  2. Circuit Complexity: Higher in reasoning tasks
  3. Confidence Convergence Pattern: Faster convergence in recall tasks

Recall Detection Score (RDS):

  • Average RDS for recall tasks: 0.933
  • Average RDS for reasoning tasks: 0.375
  • Demonstrates clear separability

Mechanistic Signature Differences:

  • Recall Process: Focused attention patterns, rapid confidence convergence, specialized head activation
  • Reasoning Process: Distributed attention, progressive confidence building, higher activation flow variance

Experimental Findings

  1. Mechanistic Feature Effectiveness: Mechanistic features effectively distinguish recall from reasoning processes, validating the value of internal computational analysis
  2. Challenging Case Analysis: 76.7% accuracy indicates room for improvement in ambiguous boundary cases, typically involving mismatches between surface form and internal processing
  3. Feature Complementarity: Combination of surface and mechanistic features provides more comprehensive analytical perspective
  4. Interpretability Validation: Feature analysis results align with cognitive science theories regarding memory and reasoning

Data Contamination Detection

  • Traditional Methods: Based on n-gram overlap, text similarity comparison
  • Representative Work: Carlini et al. (2021) training data extraction methods
  • Limitations: Depend on training data access, cannot handle paraphrased contamination

Mechanistic Interpretability

  • Transformer Circuits: Mathematical framework by Elhage et al. (2021)
  • Attention Analysis: Circuit visualization methods by Olah et al. (2020)
  • Paper Contribution: First application of mechanistic analysis to contamination detection

LLM Evaluation

  • Memory vs. Reasoning: Theoretical analysis by Feldman (2020) on learning and memorization
  • Evaluation Reliability: Time travel detection methods by Golchin and Surdeanu (2023)
  • Paper Advantage: Provides evaluation methods from internal mechanistic perspective

Conclusions and Discussion

Main Conclusions

  1. Technical Feasibility: Mechanistic interpretability effectively detects data contamination; 93% accuracy validates method effectiveness
  2. Theoretical Contribution: Reveals different computational signatures of recall and reasoning within models, providing new perspectives for understanding LLM cognitive mechanisms
  3. Practical Value: RADAR provides contamination detection tool without training data access, with good interpretability
  4. Method Generalizability: Framework extends to different model architectures, providing new tools for LLM evaluation

Limitations

  1. Scale Limitations: Current experiments primarily conducted on DialoGPT-medium; applicability to larger models requires verification
  2. Dataset Scale: Training set contains only 30 samples, test set 100 samples; relatively small scale
  3. Proxy Features: Some mechanistic features use proxy measures rather than direct computation (e.g., causal effects approximated through attention entropy)
  4. Task Scope: Currently focuses on simple factual recall vs. logical reasoning; applicability to complex tasks requires further verification
  5. Computational Overhead: Requires extracting model internal states, potentially increasing computational cost

Future Directions

  1. Large Model Extension: Explore applications on larger-scale models
  2. Unsupervised Detection: Develop unsupervised contamination detection methods
  3. Multiple Contamination Types: Extend to detecting other types of data contamination
  4. Real-time Detection: Develop efficient online contamination detection systems

In-depth Evaluation

Strengths

  1. Strong Innovation: First application of mechanistic interpretability to contamination detection, opening new research directions
  2. Scientific Methodology: Feature design has theoretical foundation; ensemble classifier enhances robustness
  3. Good Interpretability: Provides concrete feature explanations, enhancing method credibility
  4. High Practical Value: No training data access required, lowering application barriers
  5. Comprehensive Experiments: Includes test cases of varying difficulty, validating method robustness

Weaknesses

  1. Experimental Scale: Relatively small dataset scale, potential overfitting risk
  2. Benchmark Comparison: Lacks direct comparison with existing contamination detection methods
  3. Feature Engineering: Some features use proxy measures, potentially affecting accuracy
  4. Generalization Ability: Validated on only one model; generalization capability requires verification
  5. Theoretical Analysis: Lacks in-depth theoretical analysis of why these features are effective

Impact

  1. Academic Contribution: Provides new perspectives for LLM evaluation and mechanistic interpretability research
  2. Practical Value: Provides practical contamination detection tool for industry
  3. Reproducibility: Provides complete code implementation, facilitating reproduction and extension
  4. Research Inspiration: May inspire more research on model internal mechanisms

Applicable Scenarios

  1. Model Evaluation: Detecting potential data contamination in LLM benchmarking
  2. Research Tool: Analyzing model cognitive mechanisms as research tool
  3. Quality Control: Ensuring evaluation reliability during model development
  4. Educational Application: Helping understand and teach LLM internal workings

References

Main references include:

  • Golchin & Surdeanu (2023): Time travel in LLMs: Tracing data contamination
  • Carlini et al. (2021): Extracting training data from large language models
  • Elhage et al. (2021): A mathematical framework for transformer circuits
  • Olah et al. (2020): Zoom in: An introduction to circuits
  • Feldman (2020): Does learning require memorization?

Summary: RADAR represents significant progress in LLM contamination detection, providing new solutions through mechanistic interpretability. While there is room for improvement in experimental scale and theoretical analysis, its innovation and practical value make it an important contribution to the field. This work not only addresses practical problems but also provides new tools and perspectives for understanding LLM internal mechanisms.