2025-11-18T05:49:12.501691

Phase-Aware Deep Learning with Complex-Valued CNNs for Audio Signal Applications

Agrawal

This study explores the design and application of Complex-Valued Convolutional Neural Networks (CVCNNs) in audio signal processing, with a focus on preserving and utilizing phase information often neglected in real-valued networks. We begin by presenting the foundational theoretical concepts of CVCNNs, including complex convolutions, pooling layers, Wirtinger-based differentiation, and various complex-valued activation functions. These are complemented by critical adaptations of training techniques, including complex batch normalization and weight initialization schemes, to ensure stability in training dynamics. Empirical evaluations are conducted across three stages. First, CVCNNs are benchmarked on standard image datasets, where they demonstrate competitive performance with real-valued CNNs, even under synthetic complex perturbations. Although our focus is audio signal processing, we first evaluate CVCNNs on image datasets to establish baseline performance and validate training stability before applying them to audio tasks. In the second experiment, we focus on audio classification using Mel-Frequency Cepstral Coefficients (MFCCs). CVCNNs trained on real-valued MFCCs slightly outperform real CNNs, while preserving phase in input workflows highlights challenges in exploiting phase without architectural modifications. Finally, a third experiment introduces GNNs to model phase information via edge weighting, where the inclusion of phase yields measurable gains in both binary and multi-class genre classification. These results underscore the expressive capacity of complex-valued architectures and confirm phase as a meaningful and exploitable feature in audio processing applications. While current methods show promise, especially with activations like cardioid, future advances in phase-aware design will be essential to leverage the potential of complex representations in neural networks.

academic

Phase-Aware Deep Learning with Complex-Valued CNNs for Audio Signal Applications

基本信息

论文ID: 2510.09926
标题: Phase-Aware Deep Learning with Complex-Valued CNNs for Audio Signal Applications
作者: Agrawal Naman (National University of Singapore)
分类: cs.LG cs.AI cs.SD
发表时间: 2025年10月10日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.09926

摘要

本研究探索了复值卷积神经网络(CVCNNs)在音频信号处理中的设计与应用，重点关注保留和利用传统实值网络中被忽略的相位信息。研究首先建立了CVCNNs的理论基础，包括复值卷积、池化层、基于Wirtinger的微分法和各种复值激活函数，并配套了复值批量归一化和权重初始化方案等关键训练技术。实验分为三个阶段：首先在标准图像数据集上验证CVCNNs的基础性能；其次在音频分类任务中使用梅尔频率倒谱系数(MFCCs)进行评估；最后引入图神经网络(GNNs)通过边权重显式建模相位信息。结果表明CVCNNs具有强大的表达能力，相位信息在音频处理中确实是有意义且可利用的特征。

研究背景与动机

问题定义

传统的实值卷积神经网络在音频信号处理中存在一个根本性缺陷：它们固有地丢弃或未充分利用相位信息，而相位信息在许多信号处理任务中是至关重要的组成部分。

重要性分析

相位信息的价值：音频信号通过短时傅里叶变换(STFT)转换到频域时会产生复值输出，其中幅度代表振幅，相位包含重要的时序和空间信息
应用需求：在语音增强、声源定位、音频分类等任务中，相位信息对提升性能具有潜在价值
技术发展：CVCNNs在遥感、医学成像、通信系统等领域已显示出显著优势

现有方法局限性

传统CNN只处理幅度谱，完全忽略相位信息
缺乏有效的复值网络训练技术和理论框架
现有复值激活函数在训练稳定性方面存在挑战

研究动机

通过扩展CNN到复值域，构建能够同时处理幅度和相位信息的神经网络架构，为音频信号处理提供更表达性和高效的表示方法。

核心贡献

理论框架建立：系统性地建立了CVCNNs的数学基础，包括复值卷积、池化、激活函数和批量归一化的完整理论体系
训练技术优化：提出了适用于复值网络的权重初始化策略和批量归一化方法，确保训练稳定性
激活函数改进：提出了smooth zReLU激活函数，解决了原始zReLU的不连续性问题
相位信息验证：通过GNN实验明确验证了相位信息在音频分类任务中的价值
综合性评估：在图像和音频两个领域进行了全面的实验验证，为CVCNNs的应用提供了实证支持

方法详解

任务定义

本文主要研究音频信号分类任务，特别是音乐流派分类。输入为音频信号的MFCC特征表示，输出为分类标签。核心挑战是如何在神经网络中有效利用音频信号的相位信息。

模型架构

复值卷积操作

对于复值输入矩阵 $X = A_1 + iB_1$ 和复值卷积核 $W = A_2 + iB_2$ ，复值卷积定义为：

$W * X = (A_1 * A_2 - B_1 * B_2) + i(B_1 * A_2 + A_1 * B_2)$

这可以用矩阵形式表示为： $W * X = \begin{pmatrix} A_1 & -B_1 \\ B_1 & A_1 \end{pmatrix} * \begin{pmatrix} A_2 & -B_2 \\ B_2 & A_2 \end{pmatrix}$

复值池化层

最大池化：基于复数幅度进行最大值选择，相应的相位通过幅度最大值的索引恢复
平均池化：分别对实部和虚部进行平均操作

复值激活函数

论文详细比较了五种复值激活函数：

CReLU: $\text{CReLU}(z) = \text{ReLU}(\text{Re}(z)) + i\text{ReLU}(\text{Im}(z))$
modReLU: $\text{modReLU}(z) = \text{ReLU}(|z| + b) \cdot \frac{z}{|z|}$
zReLU: 仅当实部和虚部都非负时返回原值
smooth zReLU: $z \cdot \sigma(\alpha \cdot \text{Re}(z)) \cdot \sigma(\alpha \cdot \text{Im}(z))$
cardioid: $g(z) = \frac{z}{2}(1 + \cos \phi_z)$

复值批量归一化

对复值向量 $x$ 的标准化过程： $\tilde{x} = V^{-1/2}(x - E(x))$

其中协方差矩阵： $V = \begin{pmatrix} \text{Cov}(\text{Re}(x), \text{Re}(x)) & \text{Cov}(\text{Re}(x), \text{Im}(x)) \\ \text{Cov}(\text{Im}(x), \text{Re}(x)) & \text{Cov}(\text{Im}(x), \text{Im}(x)) \end{pmatrix} + \lambda I$