2025-11-15T14:19:11.467059

VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

Rahimi, Afouras, Zisserman
We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-specific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements) or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare with previous methods and obtain superior performance.
academic

VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

Basic Information

  • Paper ID: 2501.01401
  • Title: VoiceVector: Multimodal Enrolment Vectors for Speaker Separation
  • Authors: Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman (VGG Group, University of Oxford)
  • Classification: eess.AS (Electrical Engineering and Systems Science - Audio and Speech Processing)
  • Publication Date: January 2, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.01401

Abstract

This paper proposes a Transformer-based architecture for separating target speaker voice from multiple speakers and environmental noise. The method employs two independent neural networks: (A) an enrolment network that utilizes various combinations of audio and visual modalities to generate speaker-specific embedding vectors; (B) a separation network that accepts noisy signals and enrolment vectors as input and outputs clean signals of the target speaker. The main innovations include: (i) enrolment vectors can be generated from audio-only, audiovisual data (using lip movements), or visual-only data (using silent video lip movements); (ii) flexibility in using multiple positive and negative enrolment vectors for conditioning during separation.

Research Background and Motivation

Problem Definition

Speaker separation is a core challenge in audio processing, particularly in noisy environments and multi-speaker scenarios. Existing applications such as hearing aids, voice-activated systems, and video conferencing heavily depend on speaker separation performance.

Limitations of Existing Methods

  1. Audio Embedding-Based Methods: Methods such as VoiceFilter rely on clean, noise-free audio to generate speaker embeddings, which is difficult to obtain in practical noisy environments.
  2. Audiovisual Methods: Methods such as Looking to Listen and VoiceFormer leverage visual cues (lip movements) but require continuous visual information during separation. Performance degrades when visual data is occluded or missing.

Research Motivation

This work aims to combine the advantages of audio and visual conditioning methods while circumventing their inherent challenges. Through a two-stage design: the enrolment stage can leverage multimodal information to generate robust speaker representations, while the separation stage relies solely on audio data, improving computational efficiency and robustness to visual information variations.

Core Contributions

  1. Multimodal Enrolment Network: Proposes a speaker embedding network capable of processing audio, audiovisual, and pure visual inputs, with innovative support for generating enrolment vectors from silent video alone.
  2. Positive and Negative Sample Conditioning: Introduces a contrastive learning mechanism that simultaneously uses positive (target speaker) and negative (non-target speaker) enrolment vectors.
  3. Two-Stage Architecture Advantages: The separation stage is completely independent of visual information, addressing the limitations of traditional audiovisual methods when visual information is unavailable.
  4. Performance Improvement: Achieves superior performance compared to existing methods on LRS3 and LibriSpeech datasets.

Methodology Details

Task Definition

Given a mixed audio signal containing a target speaker, other speakers, and environmental noise, the objective is to separate the speech component of the target speaker with specific acoustic characteristics while filtering out competing voices and environmental noise.

Model Architecture

1. Speaker Enrolment Network

Audio-Only Network (Figure 1a):

  • Uses pre-trained ECAPA-TDNN model as speaker feature extractor
  • Input: Spectrogram of clean audio S(f,t)=STFT(ac)S(f,t) = STFT(a_c)
  • Output: 192-dimensional speaker embedding SacR192S_{ac} \in \mathbb{R}^{192}

Audiovisual Network (Figure 1b):

  • Audio encoding: EaRta×768E_a \in \mathbb{R}^{t_a \times 768}
  • Video encoding (lip movements): EvRtv×512E_v \in \mathbb{R}^{t_v \times 512}
  • Face image encoding: EfR128E_f \in \mathbb{R}^{128}
  • Feature fusion: F(Ea,Ev,Ef)=(Ea;Ev;Ef)R(ta+tv+1)×768F(E_a, E_v, E_f) = (E_a; E_v; E_f) \in \mathbb{R}^{(t_a+t_v+1) \times 768}
  • Fused features processed through three-layer Transformer encoder
  • Output: 192-dimensional enrolment vector SavfR192S_{avf} \in \mathbb{R}^{192}

Pure Visual Network (Figure 1b):

  • Uses only visual information (lip movements and/or face images)
  • Output: Svf=SpeakerExtractor(Transformer([Ev;Ef]))S_{vf} = \text{SpeakerExtractor}(\text{Transformer}([E_v; E_f]))

2. Speaker Separation Network

  • Based on VoiceFormer architecture, comprising audio encoder-decoder and speaker embedding encoder
  • Input: Noisy audio waveform and multiple positive and negative enrolment vectors
  • Uses three-layer Transformer encoder to fuse audio and speaker encodings
  • Employs attention mechanisms to enhance features matching target speaker while suppressing non-target speaker features
  • Skip connections between encoder and decoder preserve low-level and high-level information

Technical Innovations

  1. Knowledge Distillation Training Strategy: The audiovisual enrolment network learns to mimic the output of the audio-only network through knowledge distillation, ensuring consistency across modalities.
  2. Multimodal Flexibility: Supports generating enrolment vectors from different modality combinations, including the innovative pure visual mode.
  3. Contrastive Learning Mechanism: Simultaneously uses positive and negative samples to provide stronger speaker discrimination capability.

Experimental Setup

Datasets

  • LRS3: Large-scale audiovisual dataset from public TEDx videos with diverse speaking styles and topics
  • LibriSpeech: Large-scale pure audio dataset from public domain audiobooks
  • Test set speakers are unseen during training, ensuring generalization capability assessment

Evaluation Metrics

  • SDR (Signal-to-Distortion Ratio): Measures separation output quality
  • STOI (Short-Time Objective Intelligibility): Quantifies signal intelligibility
  • PESQ (Perceptual Evaluation of Speech Quality): Reflects perceived quality scores

Comparison Methods

  • Audio Methods: VoiceFilter
  • Audiovisual Methods: Conversation, VisualVoice, VoiceFormer

Implementation Details

  • Implemented in PyTorch
  • Video data: 25 FPS, facial crops to speaker mouth region
  • Audio: Mono, 16 kHz sampling rate
  • Transformer: 3 layers, 8 attention heads, model dimension 532
  • Training data: 4-second audio segments with random cropping and data augmentation including speed, pitch, and decibel adjustments

Experimental Results

Main Results

Positive and Negative Embedding Vector Effects (Table 1):

Configuration1P-0N1P-1N3P-2N3P-3N
SDR↑13.814.014.414.5

Results demonstrate that increasing the number of positive and negative enrolment vectors improves separation performance.

Multimodal Comparison (Table 2):

ModalityAudioVisualSDR↑STOI↑PESQ↑
Clean audio14.4912.52
Clean audio + lips14.5912.55
Noisy audio6.3581.82
Noisy audio + lips13.7882.45
Lips only11.1772.25
Lips + face12.0802.35

Comparison with SOTA Methods (Table 3):

MethodDatasetSDR↑STOI↑PESQ↑
VoiceFormerLRS314.4922.42
VoiceVectorLRS314.5912.52
VoiceFilterLibriSpeech12.6--
VoiceVectorLibriSpeech13.1892.12

Key Findings

  1. Effectiveness of Pure Visual Mode: Using only lip movements achieves SDR of 11.1, demonstrating the importance of visual information.
  2. Noise Robustness: When combined with visual cues, noisy audio performance dramatically improves from SDR 6.3 to 13.7.
  3. Cross-Dataset Generalization: Outperforms baseline methods on the untrained LibriSpeech dataset.

Main Research Directions

  1. Multimodal Conditioning Methods: Leveraging visual cues (primarily lip movements) to guide separation
  2. Speaker-Specific Embedding Methods: Generating speaker embeddings from clean speech samples for conditioning

Advantages of This Work

  • Compared to traditional audiovisual methods: separation stage does not require visual information, improving robustness and computational efficiency
  • Compared to pure audio methods: provides stronger speaker discrimination through multimodal enrolment vectors
  • Introduces negative sample mechanism: compared to previous methods using only positive samples, provides better contrastive learning effects

Conclusions and Discussion

Main Conclusions

  1. The proposed two-stage architecture successfully combines the advantages of audio and visual conditioning
  2. Multimodal enrolment vectors demonstrate good performance across various scenarios
  3. The contrastive learning mechanism with positive and negative samples effectively improves separation performance
  4. Achieves superior performance compared to existing methods on standard datasets

Limitations

  1. Synthetic Data Dependency: Primarily trained and tested on synthetic mixed audio, with potential domain gap from real-world noisy environments
  2. Visual Quality Requirements: Pure visual mode still requires clear lip movement video
  3. Computational Complexity: Two-stage architecture increases overall system complexity

Future Directions

  1. Validation and optimization in real noisy environments
  2. Exploration of additional visual modalities (gestures, facial expressions) fusion
  3. Further research on end-to-end optimization strategies

In-Depth Evaluation

Strengths

  1. Strong Technical Innovation: First implementation of pure visual speaker enrolment, opening new directions for visual speech processing
  2. Reasonable Architecture Design: Two-stage design cleverly balances performance and practicality
  3. Comprehensive Experiments: Thorough evaluation covering multiple modality combinations and comparison methods
  4. Significant Performance Improvement: Surpasses existing SOTA methods on multiple metrics

Weaknesses

  1. Insufficient Real-World Validation: Primarily based on synthetic data, lacking validation in real noisy environments
  2. Missing Computational Efficiency Analysis: Lacks detailed computational complexity and inference time analysis
  3. Insufficient Failure Case Analysis: Lacks in-depth analysis of method limitations

Impact

  1. Academic Value: Provides new research directions for multimodal speech separation
  2. Practical Value: Potential applications in hearing aids, video conferencing, and other real-world scenarios
  3. Reproducibility: Provides detailed implementation details facilitating research reproduction

Applicable Scenarios

  1. Video Conferencing Systems: Leveraging visual information of participants for speech separation
  2. Intelligent Hearing Devices: Highlighting target speaker voice in noisy environments
  3. Multimedia Content Processing: Extracting specific speaker voice from audiovisual content

References

The paper cites important works in the speech separation field, including:

  • VoiceFilter series: Speaker embedding-based separation methods
  • Looking to Listen, VoiceFormer: Representative audiovisual separation works
  • ECAPA-TDNN: Classical speaker recognition model
  • LRS3, LibriSpeech: Standard speech processing datasets

Overall Assessment: This is an excellent paper with strong technical innovation and reasonable experimental design. Through clever two-stage architecture design and multimodal fusion strategy, it achieves significant performance improvements on the speaker separation task. Particularly, the innovative application of pure visual modality provides new research directions for the field. Although there is room for improvement in real-world scenario validation, the overall work quality is high with important academic and practical value.