This study explores the design and application of Complex-Valued Convolutional Neural Networks (CVCNNs) in audio signal processing, with a focus on preserving and utilizing phase information often neglected in real-valued networks. We begin by presenting the foundational theoretical concepts of CVCNNs, including complex convolutions, pooling layers, Wirtinger-based differentiation, and various complex-valued activation functions. These are complemented by critical adaptations of training techniques, including complex batch normalization and weight initialization schemes, to ensure stability in training dynamics. Empirical evaluations are conducted across three stages. First, CVCNNs are benchmarked on standard image datasets, where they demonstrate competitive performance with real-valued CNNs, even under synthetic complex perturbations. Although our focus is audio signal processing, we first evaluate CVCNNs on image datasets to establish baseline performance and validate training stability before applying them to audio tasks. In the second experiment, we focus on audio classification using Mel-Frequency Cepstral Coefficients (MFCCs). CVCNNs trained on real-valued MFCCs slightly outperform real CNNs, while preserving phase in input workflows highlights challenges in exploiting phase without architectural modifications. Finally, a third experiment introduces GNNs to model phase information via edge weighting, where the inclusion of phase yields measurable gains in both binary and multi-class genre classification. These results underscore the expressive capacity of complex-valued architectures and confirm phase as a meaningful and exploitable feature in audio processing applications. While current methods show promise, especially with activations like cardioid, future advances in phase-aware design will be essential to leverage the potential of complex representations in neural networks.
- Paper ID: 2510.09926
- Title: Phase-Aware Deep Learning with Complex-Valued CNNs for Audio Signal Applications
- Author: Agrawal Naman (National University of Singapore)
- Classification: cs.LG cs.AI cs.SD
- Publication Date: October 10, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.09926
This study explores the design and application of complex-valued convolutional neural networks (CVCNNs) in audio signal processing, with particular emphasis on preserving and leveraging phase information that is typically neglected in traditional real-valued networks. The research first establishes the theoretical foundations of CVCNNs, including complex-valued convolution, pooling layers, Wirtinger-based differentiation, and various complex-valued activation functions, accompanied by critical training techniques such as complex-valued batch normalization and weight initialization schemes. The experiments proceed in three stages: first validating the baseline performance of CVCNNs on standard image datasets; second evaluating performance on audio classification tasks using Mel-Frequency Cepstral Coefficients (MFCCs); and finally introducing Graph Neural Networks (GNNs) to explicitly model phase information through edge weights. Results demonstrate that CVCNNs possess strong expressive capacity, and phase information is indeed a meaningful and exploitable feature in audio processing.
Traditional real-valued convolutional neural networks suffer from a fundamental deficiency in audio signal processing: they inherently discard or insufficiently utilize phase information, which constitutes a critical component in many signal processing tasks.
- Value of Phase Information: When audio signals are transformed to the frequency domain via Short-Time Fourier Transform (STFT), the resulting complex-valued output contains magnitude representing amplitude and phase encoding important temporal and spatial information
- Application Requirements: In tasks such as speech enhancement, sound source localization, and audio classification, phase information holds potential value for performance improvement
- Technical Development: CVCNNs have demonstrated significant advantages in domains such as remote sensing, medical imaging, and communication systems
- Conventional CNNs process only magnitude spectra, completely ignoring phase information
- Lack of effective complex-valued network training techniques and theoretical frameworks
- Existing complex-valued activation functions face challenges in training stability
By extending CNNs to the complex-valued domain, construct neural network architectures capable of simultaneously processing magnitude and phase information, providing more expressive and efficient representation methods for audio signal processing.
- Theoretical Framework Establishment: Systematically establishes the mathematical foundations of CVCNNs, including a complete theoretical system for complex-valued convolution, pooling, activation functions, and batch normalization
- Training Technique Optimization: Proposes weight initialization strategies and batch normalization methods applicable to complex-valued networks, ensuring training stability
- Activation Function Improvement: Introduces smooth zReLU activation function, addressing the discontinuity issues of the original zReLU
- Phase Information Validation: Explicitly verifies the value of phase information in audio classification tasks through GNN experiments
- Comprehensive Evaluation: Conducts thorough experimental validation across both image and audio domains, providing empirical support for CVCNN applications
This paper primarily investigates audio signal classification tasks, particularly music genre classification. Input consists of MFCC feature representations of audio signals, with output being classification labels. The core challenge is how to effectively utilize phase information from audio signals within neural networks.
For complex-valued input matrix X=A1+iB1 and complex-valued convolution kernel W=A2+iB2, complex-valued convolution is defined as:
W∗X=(A1∗A2−B1∗B2)+i(B1∗A2+A1∗B2)
This can be expressed in matrix form as:
W∗X=(A1B1−B1A1)∗(A2B2−B2A2)
- Max Pooling: Selects maximum values based on complex magnitude, with corresponding phase recovered through the index of maximum magnitude
- Average Pooling: Performs averaging operations separately on real and imaginary parts
The paper provides detailed comparison of five complex-valued activation functions:
- CReLU: CReLU(z)=ReLU(Re(z))+iReLU(Im(z))
- modReLU: modReLU(z)=ReLU(∣z∣+b)⋅∣z∣z
- zReLU: Returns original value only when both real and imaginary parts are non-negative
- smooth zReLU: z⋅σ(α⋅Re(z))⋅σ(α⋅Im(z))
- cardioid: g(z)=2z(1+cosϕz)
Standardization process for complex-valued vector x:
x~=V−1/2(x−E(x))
where the covariance matrix is:
V=(Cov(Re(x),Re(x))Cov(Im(x),Re(x))Cov(Re(x),Im(x))Cov(Im(x),Im(x)))+λI
- Wirtinger Calculus Application: Addresses gradient computation for non-analytic complex-valued functions
- Phase-Aware Feature Extraction: Designs two phase-preserving MFCC extraction pipelines
- Graph Neural Network Integration: Innovatively employs GNN edge weights to explicitly model phase information
- Activation Function Optimization: Proposes smooth zReLU to address training instability issues
- Image Datasets: MNIST, Fashion-MNIST, Kuzushiji-MNIST
- Audio Dataset: GTZAN music genre dataset (1000 30-second audio clips, 10 genres)
- Training and testing accuracy
- Training time comparison
- Convergence analysis
- Standard real-valued CNN (baseline)
- CVCNN with different configurations (real-valued input, complex-valued input, etc.)
- CVCNN variants with different activation functions
- PyTorch and complexPyTorch libraries
- CPU training on Apple M2 Pro chip
- Gradient clipping to prevent training instability
- 5-10 epoch training cycles
On MNIST, KMNIST, and Fashion-MNIST, CVCNNs achieve comparable performance to real-valued CNNs across various input configurations:
- MNIST: ~99% test accuracy
- KMNIST: ~95% test accuracy
- Fashion-MNIST: ~90% test accuracy
On binary music genre classification tasks:
- Real-valued CNN baseline: 92.5% test accuracy
- CVCNN (real-valued MFCC): 95.34% test accuracy (cardioid activation)
- CVCNN (complex-valued MFCC): Performance degradation, revealing current architectural limitations
The cardioid activation function demonstrates superior performance across all experiments:
- Most stable under complex-valued input perturbations
- Achieves highest accuracy on audio tasks
- Most stable training process
Experimental results reveal:
- Cardioid: Excellent performance across various settings, particularly under phase perturbations
- modReLU: Unstable under fixed phase and imaginary part settings, with significant accuracy drops
- smooth zReLU: Good performance under no transformation and noise settings
- CReLU: Serves as a stable baseline choice
GNN experiments explicitly demonstrate the value of phase information:
- GNN without phase information (baseline)
- GNN with phase difference-based edge weights: significantly outperforms baseline on both binary and ten-class classification tasks
- Training Efficiency: CVCNN training time is approximately 4-5 times that of real-valued CNNs
- Stability: Appropriate activation function selection is critical for training stability
- Phase Utilization: Current architectures have limited capacity for directly exploiting phase information
- Generalization Capacity: CVCNNs demonstrate good robustness under complex-valued perturbations
- Early work primarily focused on theoretical foundations and basic architectures
- Recent breakthroughs in specific domains such as MRI reconstruction and SAR image processing
- Traditional methods primarily based on magnitude spectrum features
- Phase-aware methods gaining attention, such as Deep Complex U-Net
Compared to existing work, this paper provides a more systematic theoretical framework and more comprehensive experimental validation, particularly in activation function comparison and phase information value verification.
- Architectural Feasibility: CVCNNs maintain comparable performance to real-valued CNNs while providing capacity for processing complex-valued information
- Phase Information Value: GNN experiments explicitly demonstrate the discriminative value of phase information in audio classification
- Activation Function Importance: Phase-aware activation functions such as cardioid significantly outperform traditional choices
- Application Potential: With appropriate architectural design, CVCNNs show promise for breakthroughs in audio processing tasks
- Computational Overhead: Significantly increased training time (4-5 times)
- Architectural Constraints: Current design still falls short in directly exploiting phase information
- Domain Specificity: The value of phase information may be limited in certain tasks
- Implementation Complexity: Requires specialized complex-valued computation libraries
- Architectural Innovation: Design specialized phase-aware modules and attention mechanisms
- Training Optimization: Develop more efficient training algorithms for complex-valued networks
- Application Extension: Explore applications in speech recognition, sound source localization, and other tasks
- Theoretical Deepening: Further understand the expressive capacity and learning dynamics of complex-valued representations
- Theoretical Completeness: Provides a complete mathematical framework for CVCNNs, from basic operations to training techniques
- Experimental Comprehensiveness: Systematic evaluation across domains (image + audio) and multiple perspectives (different activation functions, input configurations)
- Innovation Verification: Cleverly validates the intrinsic value of phase information through GNN experiments
- Practical Guidance: Provides concrete technical guidance for practical CVCNN applications
- Limited Performance Improvement: CVCNNs' advantages over real-valued CNNs are not pronounced in certain tasks
- Computational Efficiency: Significant computational overhead may limit practical applications
- Insufficient Architectural Exploration: Primarily uses standard CNN architectures, lacking specialized designs for complex-valued characteristics
- Dataset Scale: Experiments mainly conducted on relatively simple datasets
- Academic Contribution: Provides important theoretical and experimental foundations for complex-valued neural network research
- Practical Value: Introduces new technical approaches to the audio signal processing field
- Reproducibility: Provides complete code implementation, facilitating subsequent research
- Inspirational Value: Points direction for development of phase-aware deep learning
- Audio Processing: Music analysis, speech enhancement, acoustic scene classification
- Signal Processing: Radar signal processing, communication systems, biomedical signal analysis
- Scientific Computing: Physics simulation and numerical computation involving complex-valued data
- Research Tools: Serves as foundational platform for exploring phase information value
The paper cites 37 important references covering complex-valued neural network theory, audio signal processing, deep learning optimization, and other domains, providing solid theoretical foundation and technical support for the research.
Overall Assessment: This is a highly systematic research paper that bridges theory and practical application of complex-valued neural networks. While performance improvements in certain aspects are not yet substantial, it provides important foundational work and research directions for the field's development.