Voice anonymization techniques have been found to successfully obscure a speaker's acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual's vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose new content anonymization approaches. Our approach performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.
- Paper ID: 2510.12780
- Title: Content Anonymization for Privacy in Long-form Audio
- Authors: Cristina Aggazzotti, Ashi Garg, Zexin Cai, Nicholas Andrews (Johns Hopkins University)
- Classification: cs.SD (Sound), cs.CL (Computational Linguistics)
- Publication Date: October 14, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.12780
Existing speech anonymization techniques successfully conceal speaker acoustic identity in short, isolated utterances as demonstrated in benchmarks such as the VoicePrivacy Challenge. However, in practical applications, utterances rarely appear in isolation: long-form audio is prevalent in interviews, phone calls, and meetings. In these scenarios, multiple utterances from the same speaker are available, presenting greater privacy risks: attackers can re-identify individuals by exploiting their vocabulary, grammar, and expression patterns, even if their voice is completely disguised. To address this risk, this paper proposes novel content anonymization methods. The approach performs context-aware rewriting of transcribed text within an ASR-TTS pipeline to eliminate speaker-specific stylistic markers while preserving semantics. The research demonstrates the effectiveness of content-based attacks against anonymized speech in long-form telephone conversation settings, then shows how the proposed content-based anonymization methods mitigate this risk while maintaining speech utility.
Existing speech anonymization techniques primarily focus on acoustic identity concealment at the individual utterance level but face significant challenges in long-form audio scenarios:
- Prevalence of Long-form Audio: In practical applications such as interviews, phone calls, and meetings, audio typically contains multiple utterances from the same speaker
- Linguistic Content as Biometric Side-channel: Attackers can exploit speaker-specific linguistic features including vocabulary choices, grammatical structures, and expression habits for identity recognition
- Limitations of Existing Methods: Focus exclusively on acoustic signal anonymization while neglecting identity information embedded in linguistic content
- Privacy Protection Needs: With increasing applications of voice data, protecting speaker identity becomes increasingly critical
- Practical Application Scenarios: Existing benchmarks diverge from real-world applications, necessitating consideration of long-form audio characteristics
- Multi-modal Threats: Attackers may simultaneously exploit acoustic and linguistic features, requiring comprehensive protection
- Single-modality Protection: Addresses only acoustic features while ignoring linguistic content
- Simplistic PII Handling: Removes only obvious personally identifiable information without addressing linguistic style
- Utterance-level Processing: Lacks consideration of discourse structure in long-form audio
- First Systematic Study: First comprehensive evaluation of content-based attacks against speech anonymization in long-form audio
- Context-aware Rewriting Method: Proposes multi-utterance joint rewriting technique using sliding windows that considers dialogue context
- Privacy-Utility Trade-off Quantification: Quantifies the trade-off between privacy protection and utility using modern generative models and detection systems
- Multi-model Comparison: Compares performance of API models (GPT-4o-mini, GPT-5) and local models (Gemma-3-4B)
- Comprehensive Evaluation Framework: Establishes multi-dimensional evaluation system encompassing privacy protection, content fidelity, and audio naturalness
Given long-form audio recording X=(u1,u2,...,uN) from source speaker s, the objective is to produce anonymized version X′=g(X) that cannot be attributed to s. Successful anonymization requires achieving Equal Error Rate (EER) of 50% (random guessing level) for attackers.
- ASR Stage: Transcribes raw audio to text using Whisper-medium
- Content Anonymization Stage: Performs rewriting of transcribed text
- TTS Stage: Synthesizes new speech using XTTS with pseudo-target speaker embeddings
1. Utterance-level Rewriting (GPT-4o-mini)
- Processes each utterance independently
- Suitable for shorter utterance processing
2. Segment-level Rewriting (Gemma-3-4B, GPT-5)
- Processes text segments spanning multiple utterances (16 utterances or approximately 300 tokens)
- Captures and modifies broader discourse patterns
- Uses sliding window providing context (N=8 preceding utterances)
- PII Replacement: Substitutes personally identifiable information with fictional but gender-consistent alternatives
- Style Modification: Alters linguistic style to eliminate speaker-specific characteristics
- Length Adjustment: Compresses content and varies sentence length
- Context Awareness: Considers dialogue history during rewriting
- Multi-utterance Joint Rewriting: Transcends traditional single-utterance processing limitations by considering discourse structure
- Context Window Mechanism: Leverages dialogue history for more accurate rewriting
- Localization Solutions: Provides privacy-preserving yet practical local model alternatives
- Multi-dimensional Optimization: Simultaneously considers privacy protection, semantic fidelity, and detection evasion
- Fisher Speech Corpus: Contains approximately 2000 hours of conversational telephone speech
- Experimental Configuration: Employs "difficult" setting (1944 trials)
- Positive samples (959): Different topic conversations from same speaker
- Negative samples (985): Same topic conversations from different speakers
- VoxCeleb2: Used for generating pseudo-target speaker embeddings
- Equal Error Rate (EER): Attackers' error rate in distinguishing same speaker from different speaker speech
- Target: EER = 50% (random guessing level)
- UTMOS: Automatically predicts speech naturalness score (1-5 scale)
- Semantic Similarity:
- Greedy Alignment Score (GAS)
- Dynamic Time Warping Similarity (DTW-Sim)
- Synthetic Text Detection: Using Binoculars detector
- Synthetic Speech Detection: Using SSL-AASIST detector
- Audio-only Anonymization: Standard ASR-TTS pipeline without content modification
- Content-only Anonymization: Rewrites content while preserving original voice
- Audio + Content Anonymization: Simultaneously performs content rewriting and voice anonymization
- Speech Attack: WavLM-Base speaker verification model
- Content Attack: LUAR (Learning Universal Authorship Representations) model
- Content-based Attack Threats: As utterance count increases, content attack EER decreases from approximately 0.4 to 0.1, demonstrating linguistic content's identity recognition capability
- Anonymization Effectiveness: All rewriting methods significantly increase EER, bringing content attacks close to random guessing level
- Model Comparison: Segment-level rewriting (GPT-5, Gemma3-4B) proves more effective than utterance-level rewriting (GPT4o-mini)
- Audio Naturalness: Anonymized speech achieves UTMOS score of 3.14, surpassing original recording's 2.09
- Semantic Fidelity:
- GPT-5: GAS=0.699, DTW-Sim=0.739
- Gemma3-4B: GAS=0.648, DTW-Sim=0.582
- GPT4o-mini: GAS=0.678, DTW-Sim=0.702
- Conservative Strategy (Gemma3-4Bc): Retains 50% of original utterances, lowest detection difficulty
- Complete Rewriting: Provides stronger privacy protection but slightly higher detectability
- Synthetic Speech Detection: More accurate than synthetic text detection, particularly with fewer utterances
- Re-transcription Effect: Re-transcription after synthesis naturally removes certain machine-generated text artifacts
Experiments demonstrate that re-transcription through the ASR-TTS pipeline naturally removes certain machine-generated text characteristics, making final anonymized text more difficult to detect as artificially generated.
- VoicePrivacy Challenge: Primarily focuses on acoustic anonymization of short utterances
- Traditional Methods: kNN speech conversion, etc., perform well in single-utterance scenarios
- PII Processing: Existing methods primarily address explicit identifiers such as names and locations
- Style Anonymization: Lacks systematic treatment of linguistic style characteristics
- Text Analysis: Based on vocabulary choices, grammar, and functional word usage
- Speech Transcription: Recent work demonstrates identity information in transcribed text
- Content Threats Are Real: Linguistic content in long-form audio constitutes significant privacy risk
- Rewriting Protection Is Effective: LLM-based rewriting effectively defends against content attacks
- Local Solutions Are Feasible: Small open-source models (Gemma-3-4B) approach API model performance
- Utility Preservation Is Achievable: Maintains speech quality and semantic integrity while providing privacy protection
- ASR Error Propagation: Errors in ASR stage may affect final quality
- Semantic Fidelity: Rewriting process may lose subtle semantic information or ironic tone
- Attack Model Limitations: Primarily considers uninformed attackers; semi-informed attacks may be more effective
- End-to-end Absence: Current methods rely on cascaded pipeline, lacking end-to-end solutions
- End-to-end Models: Develop end-to-end systems jointly anonymizing speech and content
- Robust Rewriting: Improve rewriting models' balance between semantic fidelity and style anonymization
- Strong Attack Defense: Research protection strategies against semi-informed attackers
- Real-time Processing: Develop efficient anonymization methods applicable to real-time scenarios
- Problem Importance: First systematic identification and resolution of content threats in long-form audio anonymization
- Method Innovation: Proposes context-aware multi-utterance joint rewriting strategy
- Experimental Sufficiency:
- Multi-dimensional evaluation system (privacy, utility, detectability)
- Comparison of multiple models and strategies
- Validation on real datasets
- Practical Value: Provides complete solutions from API models to local models
- Research Rigor: Employs established attack models and evaluation protocols
- Single Dataset: Primarily validated on Fisher corpus, lacks cross-domain generalization verification
- Attack Model Limitations: Does not consider stronger adaptive attacks or multi-modal attacks
- Missing Computational Cost Analysis: Lacks detailed analysis of computational overhead for different methods
- Absence of User Studies: Lacks subjective evaluation by real users on anonymization effectiveness
- Long-term Security: Does not consider impact of advancing attack techniques on protection effectiveness
- Academic Contribution:
- Fills research gap in long-form audio anonymization
- Establishes new evaluation paradigm and benchmark
- Provides important foundation for subsequent research
- Practical Value:
- Provides practical privacy protection solutions for voice data processing
- Has direct applicability in interviews, meeting recordings, etc.
- Supports compliance with relevant privacy regulations
- Reproducibility: Authors commit to open-sourcing code and prompts, facilitating research reproduction and extension
- High-privacy Requirement Scenarios: Medical interviews, legal consultations, psychotherapy
- Commercial Applications: Customer service calls, meeting record privacy protection
- Research Data Sharing: Privacy-preserving release of speech corpora
- Compliance Requirements: Meeting GDPR and other privacy regulation technical requirements
This paper cites 26 relevant references covering multiple domains including speech anonymization, content privacy, and authorship identification, providing solid theoretical foundation. Key references include VoicePrivacy Challenge-related work, LUAR authorship identification model, and recent speech anonymization technology advances.
Overall Assessment: This is a high-quality research paper that identifies and addresses an important problem in speech anonymization. The methodology is innovative, experiments are comprehensive, and results are convincing, with significant value for both academia and industry. Despite certain limitations, it opens new research directions for long-form audio privacy protection.