learning discriminative features from spectrograms using center loss for speech emotion recognition
Dai, Wu, Li et al.
Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker. However, extracting effective features for emotion recognition is difficult, as emotions are ambiguous. We propose a novel approach to learn discriminative features from variable length spectrograms for emotion recognition by cooperating softmax cross-entropy loss and center loss together. The softmax cross-entropy loss enables features from different emotion categories separable, and center loss efficiently pulls the features belonging to the same emotion category to their center. By combining the two losses together, the discriminative power will be highly enhanced, which leads to network learning more effective features for emotion recognition. As demonstrated by the experimental results, after introducing center loss, both the unweighted accuracy and weighted accuracy are improved by over 3\% on Mel-spectrogram input, and more than 4\% on Short Time Fourier Transform spectrogram input.
academic
Learning Discriminative Features from Spectrograms Using Center Loss for Speech Emotion Recognition
This paper addresses the challenge of feature extraction in speech emotion recognition (SER) caused by the inherent ambiguity of emotions. The authors propose a novel approach that combines softmax cross-entropy loss with center loss to learn discriminative features from variable-length spectrograms. Softmax cross-entropy loss ensures separability of features across different emotion classes, while center loss effectively pulls features of the same emotion class toward their class center. Experimental results demonstrate that introducing center loss improves both unweighted accuracy (UA) and weighted accuracy (WA) by over 3% on Mel-spectrogram inputs and over 4% on Short-Time Fourier Transform (STFT) spectrogram inputs.
Speech emotion recognition (SER) is a key technology for natural human-computer interaction, requiring the extraction of features from speech waveforms and their classification into corresponding emotion categories. However, the inherent ambiguity of emotions makes effective feature extraction challenging.
Traditional approaches: Extract frame-level features from overlapping frames and apply statistical functions, with limited feature expressiveness
Existing deep learning methods: While utilizing neural networks to extract high-level features, they remain insufficient in handling emotion ambiguity
Existing discriminative learning methods: Approaches such as cosine similarity loss and triplet loss employ two-step strategies that may lead to performance degradation and depend on sample pair or triplet selection strategies
Propose an end-to-end approach that learns discriminative features through joint supervised loss functions (softmax cross-entropy loss + center loss), avoiding the inconsistency problems of two-step strategies.
Proposes a novel joint loss function approach: Combines softmax cross-entropy loss with center loss for learning discriminative features from variable-length spectrograms
Achieves end-to-end speech emotion recognition: Avoids the two-step strategy problems of existing methods without requiring construction of sample pairs or triplets
Achieves significant performance improvements on the IEMOCAP dataset: Over 3% improvement on Mel-spectrogram inputs and over 4% on STFT spectrogram inputs
Provides detailed visualization analysis: Demonstrates the enhancement of feature discriminability through center loss via PCA embedding visualization
Input: Variable-length spectrograms (LT × LF, where LT is the time dimension and LF is the frequency dimension)
Output: Emotion class labels (neutral, angry, happy, sad)
Objective: Learn discriminative features with small intra-class variance and large inter-class variance
The authors propose exploring additional loss function designs, particularly methods for increasing inter-class variance of features, to further improve SER performance.
The paper cites 19 relevant references covering traditional methods, deep learning approaches, and discriminative feature learning in speech emotion recognition, providing sufficient theoretical foundation and technical comparison.
Overall Assessment: This is a technically solid paper with comprehensive experiments that successfully introduces center loss to speech emotion recognition and achieves significant performance improvements. While there is room for improvement in theoretical analysis and cross-dataset validation, its simple and effective approach combined with consistent experimental results provides good academic and practical value.