2025-11-18T20:58:12.950706

learning discriminative features from spectrograms using center loss for speech emotion recognition

Dai, Wu, Li et al.

Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker. However, extracting effective features for emotion recognition is difficult, as emotions are ambiguous. We propose a novel approach to learn discriminative features from variable length spectrograms for emotion recognition by cooperating softmax cross-entropy loss and center loss together. The softmax cross-entropy loss enables features from different emotion categories separable, and center loss efficiently pulls the features belonging to the same emotion category to their center. By combining the two losses together, the discriminative power will be highly enhanced, which leads to network learning more effective features for emotion recognition. As demonstrated by the experimental results, after introducing center loss, both the unweighted accuracy and weighted accuracy are improved by over 3\% on Mel-spectrogram input, and more than 4\% on Short Time Fourier Transform spectrogram input.

academic

Learning Discriminative Features from Spectrograms Using Center Loss for Speech Emotion Recognition

Basic Information

Paper ID: 2501.01103
Title: Learning Discriminative Features from Spectrograms Using Center Loss for Speech Emotion Recognition
Authors: Dongyang Dai, Zhiyong Wu, Runnan Li, Xixin Wu, Jia Jia, Helen Meng
Classification: eess.AS (Audio and Speech Processing), cs.AI (Artificial Intelligence), cs.SD (Sound)
Publication Date: January 2, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2501.01103

Abstract

This paper addresses the challenge of feature extraction in speech emotion recognition (SER) caused by the inherent ambiguity of emotions. The authors propose a novel approach that combines softmax cross-entropy loss with center loss to learn discriminative features from variable-length spectrograms. Softmax cross-entropy loss ensures separability of features across different emotion classes, while center loss effectively pulls features of the same emotion class toward their class center. Experimental results demonstrate that introducing center loss improves both unweighted accuracy (UA) and weighted accuracy (WA) by over 3% on Mel-spectrogram inputs and over 4% on Short-Time Fourier Transform (STFT) spectrogram inputs.

Research Background and Motivation

1. Problem Definition

Speech emotion recognition (SER) is a key technology for natural human-computer interaction, requiring the extraction of features from speech waveforms and their classification into corresponding emotion categories. However, the inherent ambiguity of emotions makes effective feature extraction challenging.

2. Problem Significance

Speech emotion recognition is crucial for achieving natural human-computer interaction
Different types of emotions can be confusing, increasing the difficulty of extracting effective features
Traditional methods have limitations in handling emotion ambiguity

3. Limitations of Existing Methods

Traditional approaches: Extract frame-level features from overlapping frames and apply statistical functions, with limited feature expressiveness
Existing deep learning methods: While utilizing neural networks to extract high-level features, they remain insufficient in handling emotion ambiguity
Existing discriminative learning methods: Approaches such as cosine similarity loss and triplet loss employ two-step strategies that may lead to performance degradation and depend on sample pair or triplet selection strategies

4. Research Motivation

Propose an end-to-end approach that learns discriminative features through joint supervised loss functions (softmax cross-entropy loss + center loss), avoiding the inconsistency problems of two-step strategies.

Core Contributions

Proposes a novel joint loss function approach: Combines softmax cross-entropy loss with center loss for learning discriminative features from variable-length spectrograms
Achieves end-to-end speech emotion recognition: Avoids the two-step strategy problems of existing methods without requiring construction of sample pairs or triplets
Achieves significant performance improvements on the IEMOCAP dataset: Over 3% improvement on Mel-spectrogram inputs and over 4% on STFT spectrogram inputs
Provides detailed visualization analysis: Demonstrates the enhancement of feature discriminability through center loss via PCA embedding visualization

Methodology Details

Task Definition

Input: Variable-length spectrograms (LT × LF, where LT is the time dimension and LF is the frequency dimension) Output: Emotion class labels (neutral, angry, happy, sad) Objective: Learn discriminative features with small intra-class variance and large inter-class variance

Model Architecture

The model comprises the following components:

CNN Layers: Extract spatial information from spectrograms
- Layer 1: 48 7×7 convolution kernels, stride 2,2, ReLU activation
- Layer 2: 64 3×3 convolution kernels, stride 1,1, ReLU activation
- Layer 3: 80 3×3 convolution kernels, stride 1,1, ReLU activation
- Layer 4: 96 3×3 convolution kernels, stride 1,1, ReLU activation
- Max pooling layer (2×2, stride 2,2) after each layer
Bidirectional RNN Layer (Bi-RNN):
- Uses 128-dimensional GRU units
- Compresses variable-length sequences into fixed-length vectors (256 dimensions)
- Concatenates the final outputs of forward and backward RNNs
Fully Connected Layers:
- FC1: Projects Bi-RNN output to target feature space (64 dimensions) with PReLU activation
- FC2: Outputs posterior probabilities for computing softmax cross-entropy loss

Loss Function Design

1. Softmax Cross-Entropy Loss

L_s = -1/Σω_yi * Σ(i=1 to m) ω_yi * log(e^(W_yi^T * z_i + b_yi) / Σ(j=1 to n) e^(W_j^T * z_i + b_j))

where ω_j represents class weights used to address class imbalance.

2. Center Loss

L_c = 1/Σω_yi * Σ(i=1 to m) ω_yi * ||z_i - c_yi||²

where c_j is the global center of class j, updated as follows:

c_j^(t+1) = (1-α)c_j^t + α*ċ_j^t  (when class j samples exist in mini-batch)
c_j^(t+1) = c_j^t                  (when no class j samples exist in mini-batch)

3. Joint Loss

L = L_s + λL_c

where λ is a hyperparameter balancing the two losses.

Technical Innovations

End-to-End Learning: Avoids the two-step strategy problems of traditional discriminative learning methods
Natural Integration: Center loss can be naturally integrated into common SER models
No Sample Pairing Required: Eliminates the need for constructing sample pairs or triplets, simplifying the training process
Class Balance Handling: Effectively addresses data imbalance through weighted loss functions

Experimental Setup

Dataset

IEMOCAP Dataset:

Approximately 12 hours of audio-visual data
Four emotion classes: neutral (30.9%), angry (19.9%), happy+excited (29.6%), sad (19.6%)
Total of 5,531 utterances with happy and excited merged
5-fold cross-validation maintaining emotion distribution

Evaluation Metrics

Unweighted Accuracy (UA): Average recall rate across all classes
Weighted Accuracy (WA): Number of correctly classified samples divided by total samples

Comparison Methods

Baseline method: Using only softmax cross-entropy loss (λ=0)
Proposed method: Joint softmax cross-entropy loss and center loss

Implementation Details

Optimizer: Adam with learning rate 0.0003
Batch size: 32
Feature dimension: 64 dimensions (FC1 output)
Spectrogram parameters: 10ms hop length, 40ms window length, 16kHz sampling rate, 1024 DFT length
Mel-spectrogram: 128 Mel bands
Maximum utterance length: 14 seconds

Experimental Results

Main Results

Mel-Spectrogram Experimental Results:

Baseline (λ=0): UA=63.80%, WA=61.83%
Proposed method (λ=0.3, α=0.5): UA=66.86%, WA=65.40%
Improvement: UA +3.06%, WA +3.57%

STFT Spectrogram Experimental Results:

Baseline (λ=0): UA=60.98%, WA=58.93%
Proposed method (λ=0.3, α=0.5): UA=65.13%, WA=62.96%
Improvement: UA +4.15%, WA +4.03%

Hyperparameter Sensitivity Analysis

α parameter: UA and WA are relatively insensitive to α, with stable performance in the range 0.1-0.9
λ parameter: Optimal performance achieved at λ=0.3; both larger and smaller values degrade performance

Visualization Analysis

PCA dimensionality reduction visualization shows:

After introducing center loss, features of the same class cluster more tightly
Separation between different classes improves
Similar improvement patterns observed in both training and test sets

Confusion Matrix Analysis

After introducing center loss, recognition accuracy for each emotion class improves to varying degrees:

Neutral: 57.5%→63.7%
Angry: 69.1%→70.5%
Happy: 51.1%→55.6%
Sad: 77.6%→77.7%

Traditional Methods

Statistical methods based on hand-crafted features
Frame-level feature extraction and statistical function application

Deep Learning Methods

DNN combined with extreme learning machines
Bidirectional LSTM for high-level feature representation
End-to-end raw waveform learning
CNN and RNN combined spectrogram learning

Discriminative Learning Methods

Pairwise discriminative tasks: Using cosine similarity loss with binary cross-entropy
Triplet framework: Using triplet loss for learning discriminative features
Advantages of this paper's method compared to these approaches: End-to-end learning without sample pairing

Conclusions and Discussion

Main Conclusions

Center loss effectively reduces intra-class variance and enhances feature discriminability
The joint loss function achieves significant performance improvements on both spectrogram input types
The method can be naturally integrated into existing SER models without requiring additional classifiers

Limitations

Primarily focuses on reducing intra-class variance with limited exploration of increasing inter-class variance
Validation only on the IEMOCAP dataset; generalization requires further verification
For severely imbalanced datasets, weighting strategies may require further optimization

Future Directions

The authors propose exploring additional loss function designs, particularly methods for increasing inter-class variance of features, to further improve SER performance.

In-Depth Evaluation

Strengths

Strong methodological novelty: Successfully transfers center loss from face recognition to speech emotion recognition
Rigorous experimental design: Includes hyperparameter sensitivity analysis, visualization verification, and detailed ablation studies
Convincing results: Consistent performance improvements across two different spectrogram input types
Clear presentation: Technical details are well-described with accurate mathematical formulations

Weaknesses

Single dataset: Validation only on IEMOCAP dataset; lacks cross-dataset generalization verification
Limited comparison methods: Primarily compares against self-baseline; lacks detailed comparison with other state-of-the-art methods
Insufficient theoretical analysis: Lacks in-depth theoretical explanation for why center loss is effective in SER tasks
Missing computational complexity analysis: Does not discuss the impact of introducing center loss on training and inference efficiency

Impact

Technical contribution: Provides a simple and effective feature learning method for speech emotion recognition
Practical value: The method is easy to implement and integrate with good practical applicability
Reproducibility: Sufficient technical details facilitate reproduction

Applicable Scenarios

Suitable for various spectrogram-based speech emotion recognition tasks
Particularly effective for handling class-imbalanced emotion datasets
Can serve as a performance enhancement module for existing SER systems

References

The paper cites 19 relevant references covering traditional methods, deep learning approaches, and discriminative feature learning in speech emotion recognition, providing sufficient theoretical foundation and technical comparison.

Overall Assessment: This is a technically solid paper with comprehensive experiments that successfully introduces center loss to speech emotion recognition and achieves significant performance improvements. While there is room for improvement in theoretical analysis and cross-dataset validation, its simple and effective approach combined with consistent experimental results provides good academic and practical value.