2025-11-20T05:58:13.871627

Secret-Protected Evolution for Differentially Private Synthetic Text Generation

Wang, Chen, Du et al.
Text data has become extremely valuable on large language models (LLMs) and even lead to general artificial intelligence (AGI). A lot of high-quality text in the real world is private and cannot be freely used due to privacy concerns. Therefore, differentially private (DP) synthetic text generation has been proposed, aiming to produce high-utility synthetic data while protecting sensitive information. However, existing DP synthetic text generation imposes uniform guarantees that often overprotect non-sensitive content, resulting in substantial utility loss and computational overhead. Therefore, we propose Secret-Protected Evolution (SecPE), a novel framework that extends private evolution with secret-aware protection. Theoretically, we show that SecPE satisfies $(\mathrm{p}, \mathrm{r})$-secret protection, constituting a relaxation of Gaussian DP that enables tighter utility-privacy trade-offs, while also substantially reducing computational complexity relative to baseline methods. Empirically, across the OpenReview, PubMed, and Yelp benchmarks, SecPE consistently achieves lower Fréchet Inception Distance (FID) and higher downstream task accuracy than GDP-based Aug-PE baselines, while requiring less noise to attain the same level of protection. Our results highlight that secret-aware guarantees can unlock more practical and effective privacy-preserving synthetic text generation.
academic

Secret-Protected Evolution for Differentially Private Synthetic Text Generation

Basic Information

  • Paper ID: 2510.10990
  • Title: Secret-Protected Evolution for Differentially Private Synthetic Text Generation
  • Authors: Tianze Wang¹'², Zhaoyu Chen¹, Jian Du¹†, Yingtai Xiao¹, Linjun Zhang², Qiang Yan¹ (¹TikTok, ²Rutgers University)
  • Classification: cs.CR (Cryptography and Security), cs.CL (Computation and Language), cs.NE (Neural and Evolutionary Computing)
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10990

Abstract

Text data has become extremely valuable in large language models (LLMs) and may even drive the development of artificial general intelligence (AGI). However, many high-quality text datasets in the real world are private and cannot be freely used due to privacy concerns. Consequently, differentially private (DP) synthetic text generation has been proposed to generate high-utility synthetic data while protecting sensitive information. However, existing DP synthetic text generation methods impose uniform guarantees that often over-protect non-sensitive content, resulting in significant utility loss and computational overhead. This paper proposes Secret-Protected Evolution (SecPE), a novel framework that extends private evolution through secret-aware protection. We theoretically prove that SecPE satisfies (p,r)-secret protection, which constitutes a relaxation of Gaussian DP, achieving tighter utility-privacy tradeoffs while substantially reducing computational complexity relative to baseline methods.

Research Background and Motivation

Problem Definition

With the rapid development of large language models, the value of text data has become increasingly prominent. However, training and adapting these models typically relies on large quantities of private user text data, which poses serious privacy risks, including memorization and leakage of sensitive content.

Problem Significance

  1. Data Value vs. Privacy Conflict: High-quality text data is crucial for LLMs, but the use of private data faces restrictions from privacy regulations
  2. Limitations of Existing Methods: Traditional differential privacy approaches provide uniform protection for all records, even when sensitive information may be sparse and vary across users and attributes
  3. Computational Efficiency Issues: Existing private evolution (PE) methods require extensive pairwise similarity computations, incurring substantial computational overhead

Research Motivation

Existing DP methods assume all records are equally sensitive, but in reality:

  • Sensitive information may be sparsely distributed
  • Sensitivity levels differ across users and attributes
  • Secrets may be repeated across records
  • Uniform guarantees lead to over-protection and utility loss

Core Contributions

  1. Proposes SecPE Framework: A private synthetic data generation framework emphasizing secret protection rather than traditional DP, improving utility by reducing the noise typically required by DP
  2. Develops Secret-Protected Clustering Method: Significantly reduces runtime complexity compared to PE methods, from O(MNsyn) to O(KNsyn), where K ≪ M
  3. Theoretical Guarantees: Proves that SecPE satisfies (p,r)-secret protection, which is a relaxation of Gaussian DP
  4. Experimental Validation: On OpenReview, PubMed, and Yelp datasets, SecPE achieves higher efficiency, lower FID, and better downstream accuracy under equivalent reconstruction guarantees

Methodology Details

Task Definition

Given a private text dataset containing sensitive secrets, generate high-quality synthetic text data such that:

  1. Statistical properties similar to the original data are maintained
  2. Specific secrets are protected from reconstruction
  3. Good performance is preserved in downstream tasks

Secret Protection Definition

Definition 3.1 (Secret Protection): Let D = {x₁,...,xₙ} be a training dataset where each sample may contain secrets from S = {s₁,...,sₘ}. For secret sⱼ ∈ S, let πⱼ be a prior distribution over the dataset {D¹ⱼ,...,Dᴷⱼ} satisfying Pr(Dᵏⱼ) ≤ pⱼ, where D and Dᵏⱼ differ only in the presence of sⱼ. A randomized mechanism A satisfies (p,r)-secret protection if for any reconstruction attack B:

Pr[B(A(Dⱼ)) = sⱼ] ≤ rⱼ, ∀j

Model Architecture

The SecPE framework comprises two core components:

1. Secret Clustering

  • Objective: Perform clustering using public data, then update with noisy private data to form representative centers
  • Algorithm Flow:
    1. Execute K-means clustering on public data: {(eₖ, nₖ)}ᴷₖ₌₁ = Kmeans(Dpub, K)
    2. Assign private data to nearest public centers
    3. Add calibrated noise to update clustering statistics

2. Protected Evolution

  • Objective: Perform iterative selection based on noisy representatives rather than direct voting on private data
  • Advantage: Reduces complexity from O(MNsyn) to O(KNsyn)

Noise Calibration

Algorithm 1 (SecretNoise): Assign weights to each private sample through linear programming:

max Σwᵢ subject to Σwᵢ ≤ ηⱼ, wᵢ ∈ [0,1]

where ηⱼ = Φ⁻¹(1-pⱼ) - Φ⁻¹(1-rⱼ) serves as the capacity constraint.

Technical Innovations

  1. From Membership Privacy to Secret Protection: Rather than protecting dataset membership, the method protects specific secret content
  2. Clustering Acceleration: Replaces point-wise voting with representative voting, substantially improving computational efficiency
  3. Relaxed DP Constraints: (p,r)-secret protection only constrains the adversary's success rate for single-point priors, rather than the entire tradeoff curve

Experimental Setup

Datasets

  1. OpenReview: ICLR 2023 paper reviews annotated by research area and recommendation rating
  2. PubMed: Medical paper abstracts
  3. Yelp: User business reviews annotated by business category and rating

Evaluation Metrics

  1. Computational Efficiency: GPU hours and histogram computation time
  2. Downstream Performance: Classification accuracy of RoBERTa/BERT fine-tuned on synthetic data
  3. Real-Synthetic Similarity: FID on text embeddings and text length distribution comparison

Baseline Methods

  • Aug-PE: Enhanced private evolution method based on μ-GDP
  • Varying Cluster Numbers K: SecPE₂₀₀₀, SecPE₃₀₀₀, SecPE₄₀₀₀ variants

Implementation Details

  • Generation Models: GPT-2, Qwen-2.5-1.5B (main experiments), Llama-3.1-8B, GPT-4o-Mini (ablations)
  • Embedding Model: Sentence-Transformers
  • Privacy Budget: p = 1×10⁻⁴, r/p ∈ {2, 10, 50, ∞}

Experimental Results

Main Results

Runtime Comparison

Table 2 demonstrates significant acceleration in histogram construction by SecPE:

  • OpenReview: 126.9s → 1.5s (84× speedup)
  • PubMed: 32.2s → 0.5s (64× speedup)
  • Yelp: 30126.4s → 2.3s (approximately 13,000× speedup)

Downstream Task Performance

SecPE consistently outperforms Aug-PE across all datasets:

PubMed (Table 3):

  • GPT-2 + BERT-small: Aug-PE from 29.70→24.93 (r/p: ∞→2), SecPE from 29.19→29.18
  • Advantages of SecPE become more pronounced with stricter privacy requirements

Yelp (Table 5):

  • At r/p=2, SecPE₈₀₀ achieves 72.74% on category classification vs. Aug-PE's 71.53%
  • On rating classification, SecPE₈₀₀ achieves 62.46% vs. Aug-PE's 47.02%

Real-Synthetic Similarity

Figure 2 shows that as r/p decreases, SecPE achieves lower FID (higher similarity), while FID in the non-private setting is slightly higher but essentially comparable.

Ablation Studies

LLM Selection Impact (Table 6)

Stronger LLMs produce better results:

  • GPT-4o-mini (74.84, 62.96) > GPT-2 (73.82, 58.36)
  • Qwen-2.5-7B (74.56, 63.06) > Qwen-2.5-1.5B (73.12, 62.08)

Impact of Cluster Number K

Experiments demonstrate that performance is insensitive to K selection, indicating method robustness.

PII Task Results

On real PII detection tasks, SecPE's improvements over Aug-PE are modest but remain competitive.

Differentially Private Text Generation

  1. DP-Generator: Trains language models using DP-SGD, computationally intensive and requiring large quantities of high-quality private data
  2. Private Evolution (PE): Accesses base models via API, iteratively updating randomly initialized samples
  3. This Work's Contribution: Transitions from uniform protection to secret-aware protection

Secret Protection vs. Differential Privacy

  • Traditional DP: Protects membership relationships, provides uniform protection for all records
  • Secret Protection: Calibrates guarantees for specific secrets, allowing public data to remain unprotected

Conclusions and Discussion

Main Conclusions

  1. SecPE achieves better utility-privacy tradeoffs through secret-aware protection
  2. The clustering method significantly improves computational efficiency
  3. Consistently outperforms GDP baseline methods across multiple datasets
  4. Stronger LLMs produce higher-quality synthetic text

Limitations

  1. Clustering Abstraction Loss: Clustering may abstract away fine-grained details, potentially causing slight utility loss in non-private settings
  2. Secret Definition Challenges: How to formally define secrets and quantify their sensitivity remains an open question
  3. Applicability Scope: The method assumes sensitive information is sparse and repetitive, which may not apply to all scenarios

Future Directions

  1. Explore heterogeneous, secret-specific budgets and adaptive priors
  2. Extend to image domain and investigate secret-protected generators
  3. Further standardize private data usage

In-Depth Evaluation

Strengths

  1. Theoretical Innovation: The (p,r)-secret protection concept is novel and provides a new perspective on privacy protection
  2. Practical Value: Significant computational acceleration makes the method more applicable in practice
  3. Comprehensive Experiments: Full evaluation across multiple datasets and metrics
  4. Solid Technique: Rigorous theoretical analysis and proofs

Weaknesses

  1. Secret Identification: The paper insufficiently discusses how to identify and define "secrets" in practice
  2. Limited Baselines: Primarily compares against one baseline method, lacking comparison with other DP text generation approaches
  3. Generalization: Limited improvements on PII tasks; the method's generalization capability requires further verification

Impact

  1. Academic Contribution: Provides a new theoretical framework for privacy-preserving synthetic data generation
  2. Practical Value: Significant computational efficiency improvements make the method more suitable for large-scale applications
  3. Reproducibility: Provides detailed implementation details and hyperparameter settings

Applicable Scenarios

  1. Text data where sensitive information is sparse and types are known
  2. Applications requiring large-scale privacy-preserving text generation
  3. Scenarios with high computational efficiency requirements
  4. Domain applications where "secrets" can be explicitly defined

References

The paper cites important works in privacy protection, differential privacy, and text generation, including:

  • Abadi et al. (2016): Foundational DP-SGD work
  • Dong et al. (2019): Gaussian differential privacy theory
  • Xie et al. (2024): Private Evolution method
  • Ganesh et al. (2025): Secret protection theoretical foundations