2025-11-21T11:01:15.942804

High-Power Training Data Identification with Provable Statistical Guarantees

Liu, Zeng, Huang et al.

Identifying training data within large-scale models is critical for copyright litigation, privacy auditing, and ensuring fair evaluation. The conventional approaches treat it as a simple binary classification task without statistical guarantees. A recent approach is designed to control the false discovery rate (FDR), but its guarantees rely on strong, easily violated assumptions. In this paper, we introduce Provable Training Data Identification (PTDI), a rigorous method that identifies a set of training data with strict false discovery rate (FDR) control. Specifically, our method computes p-values for each data point using a set of known unseen data, and then constructs a conservative estimator for the data usage proportion of the test set, which allows us to scale these p-values. Our approach then selects the final set of training data by identifying all points whose scaled p-values fall below a data-dependent threshold. This entire procedure enables the discovery of training data with provable, strict FDR control and significantly boosted power. Extensive experiments across a wide range of models (LLMs and VLMs), and datasets demonstrate that PTDI strictly controls the FDR and achieves higher power.

academic

High-Power Training Data Identification with Provable Statistical Guarantees

Basic Information

Paper ID: 2510.09717
Title: High-Power Training Data Identification with Provable Statistical Guarantees
Authors: Zhenlong Liu, Hao Zeng, Weiran Huang, Hongxin Wei
Classification: cs.LG cs.AI
Publication Date/Venue: Preprint (October 2025)
Paper Link: https://arxiv.org/abs/2510.09717

Abstract

Identifying training data in large-scale models is crucial for copyright litigation, privacy audits, and ensuring fair evaluation. Traditional methods treat this as a simple binary classification task, lacking statistical guarantees. Recent approaches, while designing mechanisms to control the false discovery rate (FDR), rely on strong assumptions that are easily violated. This paper proposes Provable Training Data Identification (PTDI), a method that rigorously controls the false discovery rate. The method computes p-values for each data point using a known unseen dataset, then constructs a conservative estimator of the proportion of test set data to scale these p-values. Finally, it selects the final training data set by identifying all points whose scaled p-values fall below a data-dependent threshold. The entire process achieves provable rigorous FDR control while significantly improving statistical power.

Research Background and Motivation

Problem Importance

With the widespread deployment of machine learning models, training data identification has become increasingly critical, manifested in:

Copyright Disputes: Such as Strike 3 v. Meta, involving 2,396 copyrighted films with potential statutory damages exceeding $350 million
Data Privacy: Compliance with privacy regulations such as GDPR and CCPA
Data Contamination: Ensuring fairness of evaluation benchmarks and preventing training data leakage

Limitations of Existing Methods

Traditional Methods: Treat training data detection as a simple binary classification task, lacking theoretical guarantees
Recent Methods: Such as the knockoff statistical approach proposed by Hu et al. (2025), which controls FDR but suffers from:
- Requiring access to model gradients, unavailable in black-box settings
- Difficulty in constructing effective knockoffs, easily violating symmetric distribution assumptions
- Potentially resulting in invalid FDR control

Research Motivation

This paper aims to design a distribution-free method that provides rigorous FDR control in both white-box and black-box settings while achieving higher statistical power.

Core Contributions

Proposes PTDI Method: A novel and general method achieving distribution-free finite-sample FDR control, compatible with existing detection methods
Theoretical Guarantees: Provides rigorous theoretical proof (Theorem 1) ensuring PTDI strictly controls the false discovery rate
Extensive Experimental Validation: Verifies method effectiveness across multiple models (LLMs and VLMs), tasks (pretraining and fine-tuning), and datasets
Practical Applicability: Model-agnostic method applicable to both black-box and white-box settings, requiring only unseen data as a calibration set

Methodology Details

Task Definition

Given a target model θ, calibration set D_cal (size n), and test set D_test = {X_{n+j}}^m_, the goal is to select an index subset S ⊆ {1,...,m} such that the false discovery rate is controlled at a user-specified level α ∈ (0,1):

$\text{FDR} = E\left[\frac{\sum_{j=1}^m \mathbf{1}\{M_{n+j} = 0, j \in S\}}{\max(|S|, 1)}\right] \leq \alpha$

Core Algorithm: PTDI

Step 1: Construct Conformal p-values

Compute p-values for each test point: $p_j = \frac{1 + \sum_{i=1}^n \mathbf{1}\{T_i \leq T_{n+j}\}}{n+1}$

where T(X;θ) is the detection score (e.g., perplexity), with lower scores indicating higher likelihood of being training members.

Step 2: Estimate Data Usage Proportion

Use the subtraction estimator π̂_sub to estimate the proportion of training data in the test set π_test: $\hat{\pi}_{sub} = 1 - \frac{\frac{1}{m+1}(1 + \sum_{j=1}^m \mathbf{1}\{T(X_{n+j}) \in R\})}{\frac{1}{n}\sum_{i=1}^n \mathbf{1}\{T(X_i) \in R\}}$

where R = (τ,+∞) is a sparse membership region constructed through quantile threshold η.

Step 3: Scale p-values

Compute scaled p-values: $\tilde{p}_j = (1-\hat{\pi}_{test})p_j$

Step 4: Benjamini-Hochberg Procedure

Apply the BH procedure to select the final set: $S = \{j | \tilde{p}_j \leq \frac{k^*}{m}\alpha\}$ where $k^* = \max\{k | \tilde{p}_{(k)} \leq \frac{k}{m}\alpha\}$

Technical Innovations

Conservative Estimator Design: The subtraction estimator ensures E(1-π_test)/(1-π̂_sub) ≤ 1, maintaining FDR control
P-value Scaling Technique: Overcomes the conservativeness of the standard BH procedure through p-value scaling, significantly improving statistical power
Distribution-Free Guarantees: Does not depend on specific distributional assumptions, providing broad applicability

Experimental Setup

Datasets

LLM Pretraining: WikiMIA, ArxivTection
LLM Fine-tuning: XSum, BBC Real Time
Vision-Language Models: VL-MIA/Flickr, VL-MIA/DALL-E

Models

LLMs: GPT-2, GPT-Neo, GPT-NeoX-20B, LLaMA-7B, Pythia (1.4B and 6.9B)
VLMs: LLaVA-1.5, MiniGPT-4

Detection Scores

LLMs: Perplexity, Zlib compression ratio, MIN-K%, Modified Entropy (M-Entropy)
VLMs: MaxRényi-K%

Evaluation Metrics

FDR: Empirical estimate of false discovery rate
Power: Statistical power, proportion of true members correctly identified

Experimental Results

Main Results

FDR Control Effectiveness

The PTDI method strictly controls FDR below the target level across all experimental settings:

Pythia-1.4B on WikiMIA, target FDR=5%: PTDI achieves 4.94% vs KTD's 13.11%
All model and dataset combinations show actual FDR below target level

Statistical Power Improvement

P-value scaling significantly improves statistical power:

GPT-NeoX-20B on WikiMIA, target FDR=0.5, MIN-K% score: power improves from 0.44 to 0.75
Across different target FDR levels, the scaling method consistently outperforms the vanilla method

Ablation Studies

Impact of Calibration Set Size

Increasing calibration set size (ρ = n/m from 0.1 to 1.0) reduces variance in FDP and power
All ρ values effectively control FDR

Robustness of Hyperparameter η

Method robustly controls FDR across η ∈ {0.01, 0.05, 0.1, 0.5}
Default setting η = 0.05

Robustness to π_test Variation

Maintains FDR control across different data usage proportions (π_test = 0.3, 0.5, 0.7)

Comparison with KTD Method

PTDI strictly controls FDR across all test settings
KTD loses control on WikiMIA and XSum at certain α values
When FDR control is effective, PTDI shows superior power on GPT-2

Adjusted Moment Estimator

Proposes a bias-corrected moment estimator π̂_mom that further improves power when confirmed member data is available, while maintaining FDR control.

Training Data Detection in Large-Scale Models

Data Contamination Research: Preventing benchmark data leakage into training sets
Heuristic Detection Scores: Methods like perplexity and MIN-k% lack theoretical guarantees
Statistically Rigorous Methods: Approaches by Dekoninck et al. and Oren et al. only apply to dataset-level assumptions

Membership Inference Attacks

Privacy Perspective: MIA aims to determine whether specific data points were used for training
Binary Classification Methods: Focus on average classification accuracy
Hypothesis Testing Framework: Methods like Attack-P prioritize TPR at low FPR

FDR Control

Benjamini-Hochberg Procedure: Classical FDR control tool
Conformal P-values: Jin & Candès' method requires strong i.i.d assumptions
Knockoff Statistics: Hu et al.'s method requires high-quality knockoff generation

Conclusions and Discussion

Main Conclusions

The PTDI method achieves rigorous FDR control with distribution-free finite-sample guarantees
P-value scaling technique significantly improves statistical power while maintaining theoretical rigor
The method has broad applicability and can be combined with existing detection methods

Limitations

Calibration Set Requirement: Requires an unseen calibration set with distribution similar to the test set
Heterogeneous Data Challenges: Constructing representative calibration sets is difficult for highly heterogeneous test data
Distribution Mismatch: Significant distribution mismatch between calibration and test data may invalidate FDR guarantees

Future Directions

Develop more robust methods for estimating data usage proportions
Study FDR control under distribution mismatch
Extend to more complex detection scenarios

In-Depth Evaluation

Strengths

Theoretical Rigor: Provides complete mathematical proofs and finite-sample guarantees
Strong Practicality: Simple and easy to implement, compatible with existing tools
Comprehensive Experiments: Extensive evaluation across multiple models, tasks, and datasets
Innovation: P-value scaling technique cleverly addresses the conservativeness of the BH procedure

Weaknesses

Assumption Limitations: Depends on the assumption of obtaining suitable calibration sets
Computational Overhead: Requires computing detection scores for numerous candidate data points
Parameter Selection: While robust to η, optimal selection still requires empirical guidance

Impact

Academic Contribution: Provides the first rigorous statistical framework for training data identification
Practical Value: Direct applications in copyright litigation and privacy audits
Reproducibility: Clear algorithm description, easy to reproduce and extend

Applicable Scenarios

Copyright Protection: Identifying copyrighted content used in model training
Privacy Audits: Verifying whether personal data was used for model training
Benchmark Evaluation: Detecting and removing contaminated samples in evaluation datasets
Model Auditing: Verifying model compliance in regulatory environments

References

The paper cites important works including:

Benjamini & Hochberg (1995): Classical BH procedure for FDR control
Shi et al. (2024): WikiMIA dataset and MIN-K% detection method
Hu et al. (2025): Training data detection based on knockoff statistics
Jin & Candès (2023): Conformal p-values in selection problems

Summary: This is a paper of significant theoretical and practical value in the field of training data identification. The PTDI method not only provides rigorous statistical guarantees but also demonstrates excellent performance in practical applications. This work provides important tools for addressing current issues of transparency and accountability in AI models.