2025-11-21T11:01:15.942804

High-Power Training Data Identification with Provable Statistical Guarantees

Liu, Zeng, Huang et al.
Identifying training data within large-scale models is critical for copyright litigation, privacy auditing, and ensuring fair evaluation. The conventional approaches treat it as a simple binary classification task without statistical guarantees. A recent approach is designed to control the false discovery rate (FDR), but its guarantees rely on strong, easily violated assumptions. In this paper, we introduce Provable Training Data Identification (PTDI), a rigorous method that identifies a set of training data with strict false discovery rate (FDR) control. Specifically, our method computes p-values for each data point using a set of known unseen data, and then constructs a conservative estimator for the data usage proportion of the test set, which allows us to scale these p-values. Our approach then selects the final set of training data by identifying all points whose scaled p-values fall below a data-dependent threshold. This entire procedure enables the discovery of training data with provable, strict FDR control and significantly boosted power. Extensive experiments across a wide range of models (LLMs and VLMs), and datasets demonstrate that PTDI strictly controls the FDR and achieves higher power.
academic

High-Power Training Data Identification with Provable Statistical Guarantees

Basic Information

  • Paper ID: 2510.09717
  • Title: High-Power Training Data Identification with Provable Statistical Guarantees
  • Authors: Zhenlong Liu, Hao Zeng, Weiran Huang, Hongxin Wei
  • Classification: cs.LG cs.AI
  • Publication Date/Venue: Preprint (October 2025)
  • Paper Link: https://arxiv.org/abs/2510.09717

Abstract

Identifying training data in large-scale models is crucial for copyright litigation, privacy audits, and ensuring fair evaluation. Traditional methods treat this as a simple binary classification task, lacking statistical guarantees. Recent approaches, while designing mechanisms to control the false discovery rate (FDR), rely on strong assumptions that are easily violated. This paper proposes Provable Training Data Identification (PTDI), a method that rigorously controls the false discovery rate. The method computes p-values for each data point using a known unseen dataset, then constructs a conservative estimator of the proportion of test set data to scale these p-values. Finally, it selects the final training data set by identifying all points whose scaled p-values fall below a data-dependent threshold. The entire process achieves provable rigorous FDR control while significantly improving statistical power.

Research Background and Motivation

Problem Importance

With the widespread deployment of machine learning models, training data identification has become increasingly critical, manifested in:

  1. Copyright Disputes: Such as Strike 3 v. Meta, involving 2,396 copyrighted films with potential statutory damages exceeding $350 million
  2. Data Privacy: Compliance with privacy regulations such as GDPR and CCPA
  3. Data Contamination: Ensuring fairness of evaluation benchmarks and preventing training data leakage

Limitations of Existing Methods

  1. Traditional Methods: Treat training data detection as a simple binary classification task, lacking theoretical guarantees
  2. Recent Methods: Such as the knockoff statistical approach proposed by Hu et al. (2025), which controls FDR but suffers from:
    • Requiring access to model gradients, unavailable in black-box settings
    • Difficulty in constructing effective knockoffs, easily violating symmetric distribution assumptions
    • Potentially resulting in invalid FDR control

Research Motivation

This paper aims to design a distribution-free method that provides rigorous FDR control in both white-box and black-box settings while achieving higher statistical power.

Core Contributions

  1. Proposes PTDI Method: A novel and general method achieving distribution-free finite-sample FDR control, compatible with existing detection methods
  2. Theoretical Guarantees: Provides rigorous theoretical proof (Theorem 1) ensuring PTDI strictly controls the false discovery rate
  3. Extensive Experimental Validation: Verifies method effectiveness across multiple models (LLMs and VLMs), tasks (pretraining and fine-tuning), and datasets
  4. Practical Applicability: Model-agnostic method applicable to both black-box and white-box settings, requiring only unseen data as a calibration set

Methodology Details

Task Definition

Given a target model θ, calibration set D_cal (size n), and test set D_test = {X_{n+j}}^m_, the goal is to select an index subset S ⊆ {1,...,m} such that the false discovery rate is controlled at a user-specified level α ∈ (0,1):

FDR=E[j=1m1{Mn+j=0,jS}max(S,1)]α\text{FDR} = E\left[\frac{\sum_{j=1}^m \mathbf{1}\{M_{n+j} = 0, j \in S\}}{\max(|S|, 1)}\right] \leq \alpha

Core Algorithm: PTDI

Step 1: Construct Conformal p-values

Compute p-values for each test point: pj=1+i=1n1{TiTn+j}n+1p_j = \frac{1 + \sum_{i=1}^n \mathbf{1}\{T_i \leq T_{n+j}\}}{n+1}

where T(X;θ) is the detection score (e.g., perplexity), with lower scores indicating higher likelihood of being training members.

Step 2: Estimate Data Usage Proportion

Use the subtraction estimator π̂_sub to estimate the proportion of training data in the test set π_test: π^sub=11m+1(1+j=1m1{T(Xn+j)R})1ni=1n1{T(Xi)R}\hat{\pi}_{sub} = 1 - \frac{\frac{1}{m+1}(1 + \sum_{j=1}^m \mathbf{1}\{T(X_{n+j}) \in R\})}{\frac{1}{n}\sum_{i=1}^n \mathbf{1}\{T(X_i) \in R\}}

where R = (τ,+∞) is a sparse membership region constructed through quantile threshold η.

Step 3: Scale p-values

Compute scaled p-values: p~j=(1π^test)pj\tilde{p}_j = (1-\hat{\pi}_{test})p_j

Step 4: Benjamini-Hochberg Procedure

Apply the BH procedure to select the final set: S={jp~jkmα}S = \{j | \tilde{p}_j \leq \frac{k^*}{m}\alpha\} where k=max{kp~(k)kmα}k^* = \max\{k | \tilde{p}_{(k)} \leq \frac{k}{m}\alpha\}

Technical Innovations

  1. Conservative Estimator Design: The subtraction estimator ensures E(1-π_test)/(1-π̂_sub) ≤ 1, maintaining FDR control
  2. P-value Scaling Technique: Overcomes the conservativeness of the standard BH procedure through p-value scaling, significantly improving statistical power
  3. Distribution-Free Guarantees: Does not depend on specific distributional assumptions, providing broad applicability

Experimental Setup

Datasets

  • LLM Pretraining: WikiMIA, ArxivTection
  • LLM Fine-tuning: XSum, BBC Real Time
  • Vision-Language Models: VL-MIA/Flickr, VL-MIA/DALL-E

Models

  • LLMs: GPT-2, GPT-Neo, GPT-NeoX-20B, LLaMA-7B, Pythia (1.4B and 6.9B)
  • VLMs: LLaVA-1.5, MiniGPT-4

Detection Scores

  • LLMs: Perplexity, Zlib compression ratio, MIN-K%, Modified Entropy (M-Entropy)
  • VLMs: MaxRényi-K%

Evaluation Metrics

  • FDR: Empirical estimate of false discovery rate
  • Power: Statistical power, proportion of true members correctly identified

Experimental Results

Main Results

FDR Control Effectiveness

The PTDI method strictly controls FDR below the target level across all experimental settings:

  • Pythia-1.4B on WikiMIA, target FDR=5%: PTDI achieves 4.94% vs KTD's 13.11%
  • All model and dataset combinations show actual FDR below target level

Statistical Power Improvement

P-value scaling significantly improves statistical power:

  • GPT-NeoX-20B on WikiMIA, target FDR=0.5, MIN-K% score: power improves from 0.44 to 0.75
  • Across different target FDR levels, the scaling method consistently outperforms the vanilla method

Ablation Studies

Impact of Calibration Set Size

  • Increasing calibration set size (ρ = n/m from 0.1 to 1.0) reduces variance in FDP and power
  • All ρ values effectively control FDR

Robustness of Hyperparameter η

  • Method robustly controls FDR across η ∈ {0.01, 0.05, 0.1, 0.5}
  • Default setting η = 0.05

Robustness to π_test Variation

  • Maintains FDR control across different data usage proportions (π_test = 0.3, 0.5, 0.7)

Comparison with KTD Method

  • PTDI strictly controls FDR across all test settings
  • KTD loses control on WikiMIA and XSum at certain α values
  • When FDR control is effective, PTDI shows superior power on GPT-2

Adjusted Moment Estimator

Proposes a bias-corrected moment estimator π̂_mom that further improves power when confirmed member data is available, while maintaining FDR control.

Training Data Detection in Large-Scale Models

  • Data Contamination Research: Preventing benchmark data leakage into training sets
  • Heuristic Detection Scores: Methods like perplexity and MIN-k% lack theoretical guarantees
  • Statistically Rigorous Methods: Approaches by Dekoninck et al. and Oren et al. only apply to dataset-level assumptions

Membership Inference Attacks

  • Privacy Perspective: MIA aims to determine whether specific data points were used for training
  • Binary Classification Methods: Focus on average classification accuracy
  • Hypothesis Testing Framework: Methods like Attack-P prioritize TPR at low FPR

FDR Control

  • Benjamini-Hochberg Procedure: Classical FDR control tool
  • Conformal P-values: Jin & Candès' method requires strong i.i.d assumptions
  • Knockoff Statistics: Hu et al.'s method requires high-quality knockoff generation

Conclusions and Discussion

Main Conclusions

  1. The PTDI method achieves rigorous FDR control with distribution-free finite-sample guarantees
  2. P-value scaling technique significantly improves statistical power while maintaining theoretical rigor
  3. The method has broad applicability and can be combined with existing detection methods

Limitations

  1. Calibration Set Requirement: Requires an unseen calibration set with distribution similar to the test set
  2. Heterogeneous Data Challenges: Constructing representative calibration sets is difficult for highly heterogeneous test data
  3. Distribution Mismatch: Significant distribution mismatch between calibration and test data may invalidate FDR guarantees

Future Directions

  1. Develop more robust methods for estimating data usage proportions
  2. Study FDR control under distribution mismatch
  3. Extend to more complex detection scenarios

In-Depth Evaluation

Strengths

  1. Theoretical Rigor: Provides complete mathematical proofs and finite-sample guarantees
  2. Strong Practicality: Simple and easy to implement, compatible with existing tools
  3. Comprehensive Experiments: Extensive evaluation across multiple models, tasks, and datasets
  4. Innovation: P-value scaling technique cleverly addresses the conservativeness of the BH procedure

Weaknesses

  1. Assumption Limitations: Depends on the assumption of obtaining suitable calibration sets
  2. Computational Overhead: Requires computing detection scores for numerous candidate data points
  3. Parameter Selection: While robust to η, optimal selection still requires empirical guidance

Impact

  1. Academic Contribution: Provides the first rigorous statistical framework for training data identification
  2. Practical Value: Direct applications in copyright litigation and privacy audits
  3. Reproducibility: Clear algorithm description, easy to reproduce and extend

Applicable Scenarios

  1. Copyright Protection: Identifying copyrighted content used in model training
  2. Privacy Audits: Verifying whether personal data was used for model training
  3. Benchmark Evaluation: Detecting and removing contaminated samples in evaluation datasets
  4. Model Auditing: Verifying model compliance in regulatory environments

References

The paper cites important works including:

  • Benjamini & Hochberg (1995): Classical BH procedure for FDR control
  • Shi et al. (2024): WikiMIA dataset and MIN-K% detection method
  • Hu et al. (2025): Training data detection based on knockoff statistics
  • Jin & Candès (2023): Conformal p-values in selection problems

Summary: This is a paper of significant theoretical and practical value in the field of training data identification. The PTDI method not only provides rigorous statistical guarantees but also demonstrates excellent performance in practical applications. This work provides important tools for addressing current issues of transparency and accountability in AI models.