2025-11-20T05:58:13.871627

Secret-Protected Evolution for Differentially Private Synthetic Text Generation

Wang, Chen, Du et al.

Text data has become extremely valuable on large language models (LLMs) and even lead to general artificial intelligence (AGI). A lot of high-quality text in the real world is private and cannot be freely used due to privacy concerns. Therefore, differentially private (DP) synthetic text generation has been proposed, aiming to produce high-utility synthetic data while protecting sensitive information. However, existing DP synthetic text generation imposes uniform guarantees that often overprotect non-sensitive content, resulting in substantial utility loss and computational overhead. Therefore, we propose Secret-Protected Evolution (SecPE), a novel framework that extends private evolution with secret-aware protection. Theoretically, we show that SecPE satisfies $(\mathrm{p}, \mathrm{r})$-secret protection, constituting a relaxation of Gaussian DP that enables tighter utility-privacy trade-offs, while also substantially reducing computational complexity relative to baseline methods. Empirically, across the OpenReview, PubMed, and Yelp benchmarks, SecPE consistently achieves lower FrÃ©chet Inception Distance (FID) and higher downstream task accuracy than GDP-based Aug-PE baselines, while requiring less noise to attain the same level of protection. Our results highlight that secret-aware guarantees can unlock more practical and effective privacy-preserving synthetic text generation.

academic

Secret-Protected Evolution for Differentially Private Synthetic Text Generation

Basic Information

Paper ID: 2510.10990
Title: Secret-Protected Evolution for Differentially Private Synthetic Text Generation
Authors: Tianze Wang¹'², Zhaoyu Chen¹, Jian Du¹†, Yingtai Xiao¹, Linjun Zhang², Qiang Yan¹ (¹TikTok, ²Rutgers University)
Classification: cs.CR (Cryptography and Security), cs.CL (Computation and Language), cs.NE (Neural and Evolutionary Computing)
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10990

Abstract

Text data has become extremely valuable in large language models (LLMs) and may even drive the development of artificial general intelligence (AGI). However, many high-quality text datasets in the real world are private and cannot be freely used due to privacy concerns. Consequently, differentially private (DP) synthetic text generation has been proposed to generate high-utility synthetic data while protecting sensitive information. However, existing DP synthetic text generation methods impose uniform guarantees that often over-protect non-sensitive content, resulting in significant utility loss and computational overhead. This paper proposes Secret-Protected Evolution (SecPE), a novel framework that extends private evolution through secret-aware protection. We theoretically prove that SecPE satisfies (p,r)-secret protection, which constitutes a relaxation of Gaussian DP, achieving tighter utility-privacy tradeoffs while substantially reducing computational complexity relative to baseline methods.

Research Background and Motivation

Problem Definition

With the rapid development of large language models, the value of text data has become increasingly prominent. However, training and adapting these models typically relies on large quantities of private user text data, which poses serious privacy risks, including memorization and leakage of sensitive content.

Problem Significance

Data Value vs. Privacy Conflict: High-quality text data is crucial for LLMs, but the use of private data faces restrictions from privacy regulations
Limitations of Existing Methods: Traditional differential privacy approaches provide uniform protection for all records, even when sensitive information may be sparse and vary across users and attributes
Computational Efficiency Issues: Existing private evolution (PE) methods require extensive pairwise similarity computations, incurring substantial computational overhead

Research Motivation

Existing DP methods assume all records are equally sensitive, but in reality:

Sensitive information may be sparsely distributed
Sensitivity levels differ across users and attributes
Secrets may be repeated across records
Uniform guarantees lead to over-protection and utility loss

Core Contributions

Proposes SecPE Framework: A private synthetic data generation framework emphasizing secret protection rather than traditional DP, improving utility by reducing the noise typically required by DP
Develops Secret-Protected Clustering Method: Significantly reduces runtime complexity compared to PE methods, from O(MNsyn) to O(KNsyn), where K ≪ M
Theoretical Guarantees: Proves that SecPE satisfies (p,r)-secret protection, which is a relaxation of Gaussian DP
Experimental Validation: On OpenReview, PubMed, and Yelp datasets, SecPE achieves higher efficiency, lower FID, and better downstream accuracy under equivalent reconstruction guarantees

Methodology Details

Task Definition

Given a private text dataset containing sensitive secrets, generate high-quality synthetic text data such that:

Statistical properties similar to the original data are maintained
Specific secrets are protected from reconstruction
Good performance is preserved in downstream tasks

Secret Protection Definition

Definition 3.1 (Secret Protection): Let D = {x₁,...,xₙ} be a training dataset where each sample may contain secrets from S = {s₁,...,sₘ}. For secret sⱼ ∈ S, let πⱼ be a prior distribution over the dataset {D¹ⱼ,...,Dᴷⱼ} satisfying Pr(Dᵏⱼ) ≤ pⱼ, where D and Dᵏⱼ differ only in the presence of sⱼ. A randomized mechanism A satisfies (p,r)-secret protection if for any reconstruction attack B:

Pr[B(A(Dⱼ)) = sⱼ] ≤ rⱼ, ∀j

Model Architecture

The SecPE framework comprises two core components:

1. Secret Clustering

Objective: Perform clustering using public data, then update with noisy private data to form representative centers
Algorithm Flow:
1. Execute K-means clustering on public data: {(eₖ, nₖ)}ᴷₖ₌₁ = Kmeans(Dpub, K)
2. Assign private data to nearest public centers
3. Add calibrated noise to update clustering statistics

2. Protected Evolution

Objective: Perform iterative selection based on noisy representatives rather than direct voting on private data
Advantage: Reduces complexity from O(MNsyn) to O(KNsyn)

Noise Calibration

Algorithm 1 (SecretNoise): Assign weights to each private sample through linear programming:

max Σwᵢ subject to Σwᵢ ≤ ηⱼ, wᵢ ∈ [0,1]

where ηⱼ = Φ⁻¹(1-pⱼ) - Φ⁻¹(1-rⱼ) serves as the capacity constraint.

Technical Innovations

From Membership Privacy to Secret Protection: Rather than protecting dataset membership, the method protects specific secret content
Clustering Acceleration: Replaces point-wise voting with representative voting, substantially improving computational efficiency
Relaxed DP Constraints: (p,r)-secret protection only constrains the adversary's success rate for single-point priors, rather than the entire tradeoff curve

Experimental Setup

Datasets

OpenReview: ICLR 2023 paper reviews annotated by research area and recommendation rating
PubMed: Medical paper abstracts
Yelp: User business reviews annotated by business category and rating

Evaluation Metrics

Computational Efficiency: GPU hours and histogram computation time
Downstream Performance: Classification accuracy of RoBERTa/BERT fine-tuned on synthetic data
Real-Synthetic Similarity: FID on text embeddings and text length distribution comparison

Baseline Methods

Aug-PE: Enhanced private evolution method based on μ-GDP
Varying Cluster Numbers K: SecPE₂₀₀₀, SecPE₃₀₀₀, SecPE₄₀₀₀ variants

Implementation Details

Generation Models: GPT-2, Qwen-2.5-1.5B (main experiments), Llama-3.1-8B, GPT-4o-Mini (ablations)
Embedding Model: Sentence-Transformers
Privacy Budget: p = 1×10⁻⁴, r/p ∈ {2, 10, 50, ∞}

Experimental Results

Main Results

Runtime Comparison

Table 2 demonstrates significant acceleration in histogram construction by SecPE:

OpenReview: 126.9s → 1.5s (84× speedup)
PubMed: 32.2s → 0.5s (64× speedup)
Yelp: 30126.4s → 2.3s (approximately 13,000× speedup)

Downstream Task Performance

SecPE consistently outperforms Aug-PE across all datasets:

PubMed (Table 3):

GPT-2 + BERT-small: Aug-PE from 29.70→24.93 (r/p: ∞→2), SecPE from 29.19→29.18
Advantages of SecPE become more pronounced with stricter privacy requirements

Yelp (Table 5):

At r/p=2, SecPE₈₀₀ achieves 72.74% on category classification vs. Aug-PE's 71.53%
On rating classification, SecPE₈₀₀ achieves 62.46% vs. Aug-PE's 47.02%

Real-Synthetic Similarity

Figure 2 shows that as r/p decreases, SecPE achieves lower FID (higher similarity), while FID in the non-private setting is slightly higher but essentially comparable.

Ablation Studies

LLM Selection Impact (Table 6)

Stronger LLMs produce better results:

GPT-4o-mini (74.84, 62.96) > GPT-2 (73.82, 58.36)
Qwen-2.5-7B (74.56, 63.06) > Qwen-2.5-1.5B (73.12, 62.08)

DP-Generator: Trains language models using DP-SGD, computationally intensive and requiring large quantities of high-quality private data
Private Evolution (PE): Accesses base models via API, iteratively updating randomly initialized samples
This Work's Contribution: Transitions from uniform protection to secret-aware protection

Secret Protection vs. Differential Privacy

Traditional DP: Protects membership relationships, provides uniform protection for all records
Secret Protection: Calibrates guarantees for specific secrets, allowing public data to remain unprotected

Conclusions and Discussion

Main Conclusions

SecPE achieves better utility-privacy tradeoffs through secret-aware protection
The clustering method significantly improves computational efficiency
Consistently outperforms GDP baseline methods across multiple datasets
Stronger LLMs produce higher-quality synthetic text

Limitations

Clustering Abstraction Loss: Clustering may abstract away fine-grained details, potentially causing slight utility loss in non-private settings
Secret Definition Challenges: How to formally define secrets and quantify their sensitivity remains an open question
Applicability Scope: The method assumes sensitive information is sparse and repetitive, which may not apply to all scenarios

Future Directions

Explore heterogeneous, secret-specific budgets and adaptive priors
Extend to image domain and investigate secret-protected generators
Further standardize private data usage

In-Depth Evaluation

Strengths

Theoretical Innovation: The (p,r)-secret protection concept is novel and provides a new perspective on privacy protection
Practical Value: Significant computational acceleration makes the method more applicable in practice
Comprehensive Experiments: Full evaluation across multiple datasets and metrics
Solid Technique: Rigorous theoretical analysis and proofs

Weaknesses

Secret Identification: The paper insufficiently discusses how to identify and define "secrets" in practice
Limited Baselines: Primarily compares against one baseline method, lacking comparison with other DP text generation approaches
Generalization: Limited improvements on PII tasks; the method's generalization capability requires further verification

Impact

Academic Contribution: Provides a new theoretical framework for privacy-preserving synthetic data generation
Practical Value: Significant computational efficiency improvements make the method more suitable for large-scale applications
Reproducibility: Provides detailed implementation details and hyperparameter settings

Applicable Scenarios

Text data where sensitive information is sparse and types are known
Applications requiring large-scale privacy-preserving text generation
Scenarios with high computational efficiency requirements
Domain applications where "secrets" can be explicitly defined

References

The paper cites important works in privacy protection, differential privacy, and text generation, including:

Abadi et al. (2016): Foundational DP-SGD work
Dong et al. (2019): Gaussian differential privacy theory
Xie et al. (2024): Private Evolution method
Ganesh et al. (2025): Secret protection theoretical foundations