2025-11-18T05:49:12.501691

Phase-Aware Deep Learning with Complex-Valued CNNs for Audio Signal Applications

Agrawal

This study explores the design and application of Complex-Valued Convolutional Neural Networks (CVCNNs) in audio signal processing, with a focus on preserving and utilizing phase information often neglected in real-valued networks. We begin by presenting the foundational theoretical concepts of CVCNNs, including complex convolutions, pooling layers, Wirtinger-based differentiation, and various complex-valued activation functions. These are complemented by critical adaptations of training techniques, including complex batch normalization and weight initialization schemes, to ensure stability in training dynamics. Empirical evaluations are conducted across three stages. First, CVCNNs are benchmarked on standard image datasets, where they demonstrate competitive performance with real-valued CNNs, even under synthetic complex perturbations. Although our focus is audio signal processing, we first evaluate CVCNNs on image datasets to establish baseline performance and validate training stability before applying them to audio tasks. In the second experiment, we focus on audio classification using Mel-Frequency Cepstral Coefficients (MFCCs). CVCNNs trained on real-valued MFCCs slightly outperform real CNNs, while preserving phase in input workflows highlights challenges in exploiting phase without architectural modifications. Finally, a third experiment introduces GNNs to model phase information via edge weighting, where the inclusion of phase yields measurable gains in both binary and multi-class genre classification. These results underscore the expressive capacity of complex-valued architectures and confirm phase as a meaningful and exploitable feature in audio processing applications. While current methods show promise, especially with activations like cardioid, future advances in phase-aware design will be essential to leverage the potential of complex representations in neural networks.

academic

Phase-Aware Deep Learning with Complex-Valued CNNs for Audio Signal Applications

Basic Information

Paper ID: 2510.09926
Title: Phase-Aware Deep Learning with Complex-Valued CNNs for Audio Signal Applications
Author: Agrawal Naman (National University of Singapore)
Classification: cs.LG cs.AI cs.SD
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09926

Abstract

This study explores the design and application of complex-valued convolutional neural networks (CVCNNs) in audio signal processing, with particular emphasis on preserving and leveraging phase information that is typically neglected in traditional real-valued networks. The research first establishes the theoretical foundations of CVCNNs, including complex-valued convolution, pooling layers, Wirtinger-based differentiation, and various complex-valued activation functions, accompanied by critical training techniques such as complex-valued batch normalization and weight initialization schemes. The experiments proceed in three stages: first validating the baseline performance of CVCNNs on standard image datasets; second evaluating performance on audio classification tasks using Mel-Frequency Cepstral Coefficients (MFCCs); and finally introducing Graph Neural Networks (GNNs) to explicitly model phase information through edge weights. Results demonstrate that CVCNNs possess strong expressive capacity, and phase information is indeed a meaningful and exploitable feature in audio processing.

Research Background and Motivation

Problem Definition

Traditional real-valued convolutional neural networks suffer from a fundamental deficiency in audio signal processing: they inherently discard or insufficiently utilize phase information, which constitutes a critical component in many signal processing tasks.

Importance Analysis

Value of Phase Information: When audio signals are transformed to the frequency domain via Short-Time Fourier Transform (STFT), the resulting complex-valued output contains magnitude representing amplitude and phase encoding important temporal and spatial information
Application Requirements: In tasks such as speech enhancement, sound source localization, and audio classification, phase information holds potential value for performance improvement
Technical Development: CVCNNs have demonstrated significant advantages in domains such as remote sensing, medical imaging, and communication systems

Limitations of Existing Methods

Conventional CNNs process only magnitude spectra, completely ignoring phase information
Lack of effective complex-valued network training techniques and theoretical frameworks
Existing complex-valued activation functions face challenges in training stability

Research Motivation

By extending CNNs to the complex-valued domain, construct neural network architectures capable of simultaneously processing magnitude and phase information, providing more expressive and efficient representation methods for audio signal processing.

Core Contributions

Theoretical Framework Establishment: Systematically establishes the mathematical foundations of CVCNNs, including a complete theoretical system for complex-valued convolution, pooling, activation functions, and batch normalization
Training Technique Optimization: Proposes weight initialization strategies and batch normalization methods applicable to complex-valued networks, ensuring training stability
Activation Function Improvement: Introduces smooth zReLU activation function, addressing the discontinuity issues of the original zReLU
Phase Information Validation: Explicitly verifies the value of phase information in audio classification tasks through GNN experiments
Comprehensive Evaluation: Conducts thorough experimental validation across both image and audio domains, providing empirical support for CVCNN applications

Methodology Details

Task Definition

This paper primarily investigates audio signal classification tasks, particularly music genre classification. Input consists of MFCC feature representations of audio signals, with output being classification labels. The core challenge is how to effectively utilize phase information from audio signals within neural networks.

Model Architecture

Complex-Valued Convolution Operation

For complex-valued input matrix $X = A_1 + iB_1$ and complex-valued convolution kernel $W = A_2 + iB_2$ , complex-valued convolution is defined as:

$W * X = (A_1 * A_2 - B_1 * B_2) + i(B_1 * A_2 + A_1 * B_2)$

This can be expressed in matrix form as: $W * X = \begin{pmatrix} A_1 & -B_1 \\ B_1 & A_1 \end{pmatrix} * \begin{pmatrix} A_2 & -B_2 \\ B_2 & A_2 \end{pmatrix}$

Complex-Valued Pooling Layers

Max Pooling: Selects maximum values based on complex magnitude, with corresponding phase recovered through the index of maximum magnitude
Average Pooling: Performs averaging operations separately on real and imaginary parts

Complex-Valued Activation Functions

The paper provides detailed comparison of five complex-valued activation functions:

CReLU: $\text{CReLU}(z) = \text{ReLU}(\text{Re}(z)) + i\text{ReLU}(\text{Im}(z))$
modReLU: $\text{modReLU}(z) = \text{ReLU}(|z| + b) \cdot \frac{z}{|z|}$
zReLU: Returns original value only when both real and imaginary parts are non-negative
smooth zReLU: $z \cdot \sigma(\alpha \cdot \text{Re}(z)) \cdot \sigma(\alpha \cdot \text{Im}(z))$
cardioid: $g(z) = \frac{z}{2}(1 + \cos \phi_z)$

Complex-Valued Batch Normalization

Standardization process for complex-valued vector $x$ : $\tilde{x} = V^{-1/2}(x - E(x))$

where the covariance matrix is: $V = \begin{pmatrix} \text{Cov}(\text{Re}(x), \text{Re}(x)) & \text{Cov}(\text{Re}(x), \text{Im}(x)) \\ \text{Cov}(\text{Im}(x), \text{Re}(x)) & \text{Cov}(\text{Im}(x), \text{Im}(x)) \end{pmatrix} + \lambda I$

Technical Innovations

Wirtinger Calculus Application: Addresses gradient computation for non-analytic complex-valued functions
Phase-Aware Feature Extraction: Designs two phase-preserving MFCC extraction pipelines
Graph Neural Network Integration: Innovatively employs GNN edge weights to explicitly model phase information
Activation Function Optimization: Proposes smooth zReLU to address training instability issues

Experimental Setup

Datasets

Image Datasets: MNIST, Fashion-MNIST, Kuzushiji-MNIST
Audio Dataset: GTZAN music genre dataset (1000 30-second audio clips, 10 genres)

Evaluation Metrics

Training and testing accuracy
Training time comparison
Convergence analysis

Baseline Methods

Standard real-valued CNN (baseline)
CVCNN with different configurations (real-valued input, complex-valued input, etc.)
CVCNN variants with different activation functions

Implementation Details

PyTorch and complexPyTorch libraries
CPU training on Apple M2 Pro chip
Gradient clipping to prevent training instability
5-10 epoch training cycles

Experimental Results

Main Results

Image Classification Experiments

On MNIST, KMNIST, and Fashion-MNIST, CVCNNs achieve comparable performance to real-valued CNNs across various input configurations:

MNIST: ~99% test accuracy
KMNIST: ~95% test accuracy
Fashion-MNIST: ~90% test accuracy

Audio Classification Experiments

On binary music genre classification tasks:

Real-valued CNN baseline: 92.5% test accuracy
CVCNN (real-valued MFCC): 95.34% test accuracy (cardioid activation)
CVCNN (complex-valued MFCC): Performance degradation, revealing current architectural limitations

Activation Function Comparison

The cardioid activation function demonstrates superior performance across all experiments:

Most stable under complex-valued input perturbations
Achieves highest accuracy on audio tasks
Most stable training process

Ablation Studies

Impact of Different Activation Functions

Experimental results reveal:

Cardioid: Excellent performance across various settings, particularly under phase perturbations
modReLU: Unstable under fixed phase and imaginary part settings, with significant accuracy drops
smooth zReLU: Good performance under no transformation and noise settings
CReLU: Serves as a stable baseline choice

Phase Information Value Verification

GNN experiments explicitly demonstrate the value of phase information:

GNN without phase information (baseline)
GNN with phase difference-based edge weights: significantly outperforms baseline on both binary and ten-class classification tasks

Experimental Findings

Training Efficiency: CVCNN training time is approximately 4-5 times that of real-valued CNNs
Stability: Appropriate activation function selection is critical for training stability
Phase Utilization: Current architectures have limited capacity for directly exploiting phase information
Generalization Capacity: CVCNNs demonstrate good robustness under complex-valued perturbations

Complex-Valued Neural Network Development

Early work primarily focused on theoretical foundations and basic architectures
Recent breakthroughs in specific domains such as MRI reconstruction and SAR image processing

Deep Learning in Audio Signal Processing

Traditional methods primarily based on magnitude spectrum features
Phase-aware methods gaining attention, such as Deep Complex U-Net

Advantages of This Work

Compared to existing work, this paper provides a more systematic theoretical framework and more comprehensive experimental validation, particularly in activation function comparison and phase information value verification.

Conclusions and Discussion

Main Conclusions

Architectural Feasibility: CVCNNs maintain comparable performance to real-valued CNNs while providing capacity for processing complex-valued information
Phase Information Value: GNN experiments explicitly demonstrate the discriminative value of phase information in audio classification
Activation Function Importance: Phase-aware activation functions such as cardioid significantly outperform traditional choices
Application Potential: With appropriate architectural design, CVCNNs show promise for breakthroughs in audio processing tasks

Limitations

Computational Overhead: Significantly increased training time (4-5 times)
Architectural Constraints: Current design still falls short in directly exploiting phase information
Domain Specificity: The value of phase information may be limited in certain tasks
Implementation Complexity: Requires specialized complex-valued computation libraries

Future Directions

Architectural Innovation: Design specialized phase-aware modules and attention mechanisms
Training Optimization: Develop more efficient training algorithms for complex-valued networks
Application Extension: Explore applications in speech recognition, sound source localization, and other tasks
Theoretical Deepening: Further understand the expressive capacity and learning dynamics of complex-valued representations

In-Depth Evaluation

Strengths

Theoretical Completeness: Provides a complete mathematical framework for CVCNNs, from basic operations to training techniques
Experimental Comprehensiveness: Systematic evaluation across domains (image + audio) and multiple perspectives (different activation functions, input configurations)
Innovation Verification: Cleverly validates the intrinsic value of phase information through GNN experiments
Practical Guidance: Provides concrete technical guidance for practical CVCNN applications

Weaknesses

Limited Performance Improvement: CVCNNs' advantages over real-valued CNNs are not pronounced in certain tasks
Computational Efficiency: Significant computational overhead may limit practical applications
Insufficient Architectural Exploration: Primarily uses standard CNN architectures, lacking specialized designs for complex-valued characteristics
Dataset Scale: Experiments mainly conducted on relatively simple datasets

Impact

Academic Contribution: Provides important theoretical and experimental foundations for complex-valued neural network research
Practical Value: Introduces new technical approaches to the audio signal processing field
Reproducibility: Provides complete code implementation, facilitating subsequent research
Inspirational Value: Points direction for development of phase-aware deep learning

Applicable Scenarios

Audio Processing: Music analysis, speech enhancement, acoustic scene classification
Signal Processing: Radar signal processing, communication systems, biomedical signal analysis
Scientific Computing: Physics simulation and numerical computation involving complex-valued data
Research Tools: Serves as foundational platform for exploring phase information value

References

The paper cites 37 important references covering complex-valued neural network theory, audio signal processing, deep learning optimization, and other domains, providing solid theoretical foundation and technical support for the research.

Overall Assessment: This is a highly systematic research paper that bridges theory and practical application of complex-valued neural networks. While performance improvements in certain aspects are not yet substantial, it provides important foundational work and research directions for the field's development.