Most of the existing speaker anonymization research has focused on single-speaker audio, leading to the development of techniques and evaluation metrics optimized for such condition. This study addresses the significant challenge of speaker anonymization within multi-speaker conversational audio, specifically when only a single target speaker needs to be anonymized. This scenario is highly relevant in contexts like call centers, where customer privacy necessitates anonymizing only the customer's voice in interactions with operators. Conventional anonymization methods are often not suitable for this task. Moreover, current evaluation methodology does not allow us to accurately assess privacy protection and utility in this complex multi-speaker scenario. This work aims to bridge these gaps by exploring effective strategies for targeted speaker anonymization in conversational audio, highlighting potential problems in their development and proposing corresponding improved evaluation methodologies.
- Paper ID: 2510.09307
- Title: Target Speaker Anonymization in Multi-Speaker Recordings
- Authors: Natalia Tomashenko¹, Junichi Yamagishi², Xin Wang², Yun Liu², Emmanuel Vincent¹
- Institutions: ¹Université de Lorraine, CNRS, Inria, Loria, France; ²National Institute of Informatics, Tokyo, Japan
- Classification: eess.AS (Audio and Speech Processing), cs.CL (Computational Linguistics), cs.CR (Cryptography and Security)
- Publication Date: October 10, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.09307
Existing speaker anonymization research has primarily focused on single-speaker audio, resulting in techniques and evaluation metrics optimized for such conditions. This work addresses the significant challenge of speaker anonymization in multi-speaker conversational audio, particularly in scenarios requiring anonymization of only a single target speaker. Such scenarios are highly relevant in environments like call centers, where customer privacy must be protected through anonymization of customer voice while preserving operator information. Traditional anonymization methods are often inadequate for this task. Furthermore, current evaluation methodologies cannot accurately assess privacy protection and utility in such complex multi-speaker scenarios. This work aims to bridge these gaps by exploring effective target speaker anonymization strategies for conversational audio, highlighting potential issues in their development and proposing improved evaluation methods.
The core problem addressed in this research is selective anonymization of specific target speakers in multi-speaker conversational recordings—a novel and challenging task. Traditional speaker anonymization techniques are designed primarily for single-speaker audio and cannot effectively handle selective anonymization requirements in multi-speaker scenarios.
- Legal Compliance Requirements: With the implementation of privacy protection regulations such as GDPR, voice data privacy protection has become critical
- Practical Application Scenarios: In call centers, medical consultations, and similar environments, there is a need to protect customer privacy while preserving service personnel information
- Technical Challenges: Voice data contains rich personal information (age, gender, health status, emotional state, etc.), requiring privacy protection while maintaining linguistic content
- Technical Limitations: Existing anonymization methods cannot selectively target specific speakers in mixed audio
- Insufficient Evaluation: Lack of privacy protection and utility evaluation metrics tailored for multi-speaker scenarios
- Limited Applicability: Traditional methods perform poorly with overlapping speech and complex conversational scenarios
- Proposes Target Speaker Anonymization (TSA) Framework: First systematic approach to selective anonymization in multi-speaker conversations
- Develops Comprehensive Evaluation Methodology: Establishes privacy protection and utility evaluation systems for multi-speaker anonymization scenarios
- Experimental Validation and Analysis: Comprehensive experimental evaluation based on two state-of-the-art target speaker extraction methods
- Identifies Key Challenges: In-depth analysis of inherent limitations and technical challenges, providing guidance for future research
Input: Mixed audio signals containing multiple speakers
Output: Mixed audio with anonymization applied only to the target speaker
Constraints: Preserve original speech of non-target speakers and maintain overall conversational intelligibility and utility
TSA employs a three-step pipeline approach:
- Target Speaker Extraction (TSE):
- Uses pre-trained speaker embedding vectors to identify the target speaker
- Estimates complex-valued soft masks to separate target speaker time-frequency spectra
- Extracts target speaker speech segments from mixed audio
- Speaker Anonymization:
- Applies anonymization processing only to extracted target speaker speech
- Utilizes anonymization systems based on vector quantization bottleneck (VQ-BN) features
- Synthesizes anonymized speech through HiFi-GAN networks
- Speech Recombination:
- Combines anonymized target speaker speech with original non-target speaker speech
- Generates final partially anonymized mixed audio
Conformer-based TSE:
- Combines convolutional layers and self-attention mechanisms to process STFT spectrograms
- Reconstructs real and imaginary components of target speaker STFT spectra
- Integrates speaker embeddings to identify and focus on target speakers
WeSep BSRNN TSE:
- Explicitly partitions audio spectra into multiple frequency bands
- Performs fine-grained modeling of unique spectral features for each band
- Based on band-split recurrent neural network architecture
- Novel Framework: First complete solution for target speaker anonymization in multi-speaker scenarios
- Modular Design: Decoupled TSE and anonymization modules enabling optimization and replacement
- Innovative Evaluation System: Introduces new metrics such as tcpWER for comprehensive assessment of privacy protection and utility
- Attacker Modeling: Considers semi-informed attacker scenarios for more realistic privacy evaluation
- SparseLibri2Mix: Multi-speaker dataset constructed from LibriSpeech test-clean subset
- Overlap Conditions: Five different overlap levels (20%, 40%, 60%, 80%, 100%)
- Data Scale: 500 mixed files per condition, totaling 2,500 files (~5 hours of speech)
- Speaker Count: 40 speakers, with the first speaker designated as target speaker
- Equal Error Rate (EER): Evaluates anonymization effectiveness using automatic speaker verification (ASV) systems
- Attacker Model: Semi-informed attacker with access to anonymization system and training data
- Primary Metric: Time-constrained minimum permutation word error rate (tcpWER)
- Auxiliary Metrics:
- Diarization error rate (DER)
- Word error rate (WER) for target speaker ASR
- Scale-invariant signal-to-distortion ratio (SI-SDR)
- Anonymization System: B5 baseline system from VoicePrivacy 2024 Challenge
- TSE Models: Conformer-based TSE vs. WeSep BSRNN TSE
- Evaluation Models: ECAPA-TDNN ASV system, DiCoW ASR system
| Overlap Rate (%) | 20 | 40 | 60 | 80 | 100 | Average |
|---|
| Conformer TSE | 17.9 | 15.8 | 14.6 | 14.0 | 14.0 | 15.3 |
| WeSep BSRNN TSE | 18.6 | 17.5 | 17.2 | 16.7 | 16.2 | 17.2 |
- Single-speaker Scenario: EER improves from 3.0% to 32.4% after anonymization
- Multi-speaker Scenario:
- Conformer TSE: Average EER 36.4%
- WeSep BSRNN TSE: Average EER 36.9%
- Privacy Improvement: 12-14% improvement compared to single-speaker scenarios
- tcpWER Results:
- Conformer TSE: Average 17.8%
- WeSep BSRNN TSE: Average 14.6% (superior)
- DER Results: WeSep BSRNN outperforms Conformer across all overlap conditions
- Original Signal Extraction: TSE process causes significant relative decline in EER and WER compared to original mixed signals
- Anonymization Impact: WER further increases after anonymization, primarily due to insertion errors from residual non-target speaker signals
- Overlap Degree Impact: TSE performance degrades with increasing overlap, but privacy protection effectiveness remains relatively stable
- Reference Signal Selection: Attacks using original reference signals outperform those using anonymized reference signals
- TSE Model Consistency: Attacks are most effective when attackers use the same TSE model as users
- TSE is Critical Bottleneck: TSE quality directly impacts final privacy protection and utility
- Overlapping Speech Challenge: TSE performance significantly degrades under high overlap conditions
- Insertion Error Problem: Residual non-target speaker signals increase ASR insertion errors
- Privacy-Utility Trade-off: Inherent trade-off exists between privacy protection and speech utility
- Signal Processing Methods: Simple transformation methods such as McAdams coefficients and pitch shifting
- Neural Speech Conversion Methods: Anonymization techniques based on decoupled representation learning
- VoicePrivacy Challenge: Advanced single-speaker anonymization technology development
- Deep Learning Methods: Speech separation techniques based on deep neural networks
- Attention Mechanisms: Attention mechanisms guided by speaker embeddings
- Band-Split Techniques: Advanced frequency-domain processing methods such as BSRNN
Existing multi-speaker anonymization research is extremely limited; this paper represents pioneering work in this field.
- Technical Feasibility: TSA framework enables selective anonymization of target speakers in multi-speaker scenarios
- Performance Trade-offs: Trade-offs exist among privacy protection, speech quality, and computational complexity
- Evaluation Importance: New evaluation metrics are critical for accurately assessing multi-speaker anonymization effectiveness
- Improvement Potential: Current methods have significant room for improvement in utility preservation
- TSE Dependency: Method performance heavily depends on TSE module quality
- Computational Complexity: Three-step pipeline increases system complexity and computational overhead
- Utility Degradation: tcpWER shows significant decline compared to original audio
- Dataset Limitations: Experiments conducted only on simulated datasets, lacking validation on real conversational data
- End-to-End Training: Joint training of TSE and anonymization modules to optimize overall performance
- Improved TSE: Development of TSE models specifically optimized for anonymization tasks
- Real-Time Processing: Exploration of real-time or near-real-time TSA solutions
- Multimodal Anonymization: Integration of visual information for multimodal privacy protection
- Strong Innovation: First systematic approach to multi-speaker target anonymization, filling an important research gap
- Complete Methodology: Provides comprehensive solution from technical framework to evaluation methods
- Sufficient Experiments: Comprehensive comparative experiments across multiple TSE models and overlap conditions
- In-Depth Analysis: Detailed analysis of module contributions and system limitations
- Practical Significance: Addresses urgent needs in practical applications such as call centers
- Performance Limitations: tcpWER shows significant decline compared to original audio, with utility requiring improvement
- Computational Efficiency: Three-step pipeline has high computational complexity, limiting real-time applications
- Data Limitations: Lacks validation on real conversational data
- Attacker Model: Relatively simple attacker model, not considering more sophisticated attack strategies
- Privacy Evaluation: EER results of 36-37% indicate remaining privacy leakage risks
- Academic Contribution: Pioneering work opening new research direction in multi-speaker target anonymization
- Practical Value: Provides privacy protection solutions for call centers, medical, and related industries
- Technology Advancement: Promotes convergence of TSE and speech anonymization technologies
- Standard Setting: Provides reference for related evaluation standards and benchmark development
- Call Centers: Protecting customer privacy while preserving service quality analysis capabilities
- Medical Consultation: Anonymizing patient speech for medical research and training purposes
- Legal Recordings: Processing court recordings to protect party privacy
- Educational Training: Anonymizing student speech for teaching and research purposes
This paper cites 31 relevant references covering multiple related fields including voice privacy protection, speaker anonymization, target speaker extraction, and automatic speech recognition, providing a solid theoretical foundation for the research.
Overall Assessment: This is a high-quality research paper making pioneering contributions to the important and challenging problem of multi-speaker voice privacy protection. While there remains room for improvement in technical performance, its innovative framework design, comprehensive evaluation methodology, and in-depth analysis establish an important foundation for subsequent research in this field.