Text data has become extremely valuable in large language models (LLMs) and may even drive the development of artificial general intelligence (AGI). However, many high-quality text datasets in the real world are private and cannot be freely used due to privacy concerns. Consequently, differentially private (DP) synthetic text generation has been proposed to generate high-utility synthetic data while protecting sensitive information. However, existing DP synthetic text generation methods impose uniform guarantees that often over-protect non-sensitive content, resulting in significant utility loss and computational overhead. This paper proposes Secret-Protected Evolution (SecPE), a novel framework that extends private evolution through secret-aware protection. We theoretically prove that SecPE satisfies (p,r)-secret protection, which constitutes a relaxation of Gaussian DP, achieving tighter utility-privacy tradeoffs while substantially reducing computational complexity relative to baseline methods.
With the rapid development of large language models, the value of text data has become increasingly prominent. However, training and adapting these models typically relies on large quantities of private user text data, which poses serious privacy risks, including memorization and leakage of sensitive content.
Data Value vs. Privacy Conflict: High-quality text data is crucial for LLMs, but the use of private data faces restrictions from privacy regulations
Limitations of Existing Methods: Traditional differential privacy approaches provide uniform protection for all records, even when sensitive information may be sparse and vary across users and attributes
Proposes SecPE Framework: A private synthetic data generation framework emphasizing secret protection rather than traditional DP, improving utility by reducing the noise typically required by DP
Develops Secret-Protected Clustering Method: Significantly reduces runtime complexity compared to PE methods, from O(MNsyn) to O(KNsyn), where K ≪ M
Theoretical Guarantees: Proves that SecPE satisfies (p,r)-secret protection, which is a relaxation of Gaussian DP
Experimental Validation: On OpenReview, PubMed, and Yelp datasets, SecPE achieves higher efficiency, lower FID, and better downstream accuracy under equivalent reconstruction guarantees
Definition 3.1 (Secret Protection): Let D = {x₁,...,xₙ} be a training dataset where each sample may contain secrets from S = {s₁,...,sₘ}. For secret sⱼ ∈ S, let πⱼ be a prior distribution over the dataset {D¹ⱼ,...,Dᴷⱼ} satisfying Pr(Dᵏⱼ) ≤ pⱼ, where D and Dᵏⱼ differ only in the presence of sⱼ. A randomized mechanism A satisfies (p,r)-secret protection if for any reconstruction attack B:
Relaxed DP Constraints: (p,r)-secret protection only constrains the adversary's success rate for single-point priors, rather than the entire tradeoff curve
Figure 2 shows that as r/p decreases, SecPE achieves lower FID (higher similarity), while FID in the non-private setting is slightly higher but essentially comparable.