2025-11-18T20:58:12.950706

learning discriminative features from spectrograms using center loss for speech emotion recognition

Dai, Wu, Li et al.
Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker. However, extracting effective features for emotion recognition is difficult, as emotions are ambiguous. We propose a novel approach to learn discriminative features from variable length spectrograms for emotion recognition by cooperating softmax cross-entropy loss and center loss together. The softmax cross-entropy loss enables features from different emotion categories separable, and center loss efficiently pulls the features belonging to the same emotion category to their center. By combining the two losses together, the discriminative power will be highly enhanced, which leads to network learning more effective features for emotion recognition. As demonstrated by the experimental results, after introducing center loss, both the unweighted accuracy and weighted accuracy are improved by over 3\% on Mel-spectrogram input, and more than 4\% on Short Time Fourier Transform spectrogram input.
academic

Learning Discriminative Features from Spectrograms Using Center Loss for Speech Emotion Recognition

Basic Information

  • Paper ID: 2501.01103
  • Title: Learning Discriminative Features from Spectrograms Using Center Loss for Speech Emotion Recognition
  • Authors: Dongyang Dai, Zhiyong Wu, Runnan Li, Xixin Wu, Jia Jia, Helen Meng
  • Classification: eess.AS (Audio and Speech Processing), cs.AI (Artificial Intelligence), cs.SD (Sound)
  • Publication Date: January 2, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.01103

Abstract

This paper addresses the challenge of feature extraction in speech emotion recognition (SER) caused by the inherent ambiguity of emotions. The authors propose a novel approach that combines softmax cross-entropy loss with center loss to learn discriminative features from variable-length spectrograms. Softmax cross-entropy loss ensures separability of features across different emotion classes, while center loss effectively pulls features of the same emotion class toward their class center. Experimental results demonstrate that introducing center loss improves both unweighted accuracy (UA) and weighted accuracy (WA) by over 3% on Mel-spectrogram inputs and over 4% on Short-Time Fourier Transform (STFT) spectrogram inputs.

Research Background and Motivation

1. Problem Definition

Speech emotion recognition (SER) is a key technology for natural human-computer interaction, requiring the extraction of features from speech waveforms and their classification into corresponding emotion categories. However, the inherent ambiguity of emotions makes effective feature extraction challenging.

2. Problem Significance

  • Speech emotion recognition is crucial for achieving natural human-computer interaction
  • Different types of emotions can be confusing, increasing the difficulty of extracting effective features
  • Traditional methods have limitations in handling emotion ambiguity

3. Limitations of Existing Methods

  • Traditional approaches: Extract frame-level features from overlapping frames and apply statistical functions, with limited feature expressiveness
  • Existing deep learning methods: While utilizing neural networks to extract high-level features, they remain insufficient in handling emotion ambiguity
  • Existing discriminative learning methods: Approaches such as cosine similarity loss and triplet loss employ two-step strategies that may lead to performance degradation and depend on sample pair or triplet selection strategies

4. Research Motivation

Propose an end-to-end approach that learns discriminative features through joint supervised loss functions (softmax cross-entropy loss + center loss), avoiding the inconsistency problems of two-step strategies.

Core Contributions

  1. Proposes a novel joint loss function approach: Combines softmax cross-entropy loss with center loss for learning discriminative features from variable-length spectrograms
  2. Achieves end-to-end speech emotion recognition: Avoids the two-step strategy problems of existing methods without requiring construction of sample pairs or triplets
  3. Achieves significant performance improvements on the IEMOCAP dataset: Over 3% improvement on Mel-spectrogram inputs and over 4% on STFT spectrogram inputs
  4. Provides detailed visualization analysis: Demonstrates the enhancement of feature discriminability through center loss via PCA embedding visualization

Methodology Details

Task Definition

Input: Variable-length spectrograms (LT × LF, where LT is the time dimension and LF is the frequency dimension) Output: Emotion class labels (neutral, angry, happy, sad) Objective: Learn discriminative features with small intra-class variance and large inter-class variance

Model Architecture

The model comprises the following components:

  1. CNN Layers: Extract spatial information from spectrograms
    • Layer 1: 48 7×7 convolution kernels, stride 2,2, ReLU activation
    • Layer 2: 64 3×3 convolution kernels, stride 1,1, ReLU activation
    • Layer 3: 80 3×3 convolution kernels, stride 1,1, ReLU activation
    • Layer 4: 96 3×3 convolution kernels, stride 1,1, ReLU activation
    • Max pooling layer (2×2, stride 2,2) after each layer
  2. Bidirectional RNN Layer (Bi-RNN):
    • Uses 128-dimensional GRU units
    • Compresses variable-length sequences into fixed-length vectors (256 dimensions)
    • Concatenates the final outputs of forward and backward RNNs
  3. Fully Connected Layers:
    • FC1: Projects Bi-RNN output to target feature space (64 dimensions) with PReLU activation
    • FC2: Outputs posterior probabilities for computing softmax cross-entropy loss

Loss Function Design

1. Softmax Cross-Entropy Loss

L_s = -1/Σω_yi * Σ(i=1 to m) ω_yi * log(e^(W_yi^T * z_i + b_yi) / Σ(j=1 to n) e^(W_j^T * z_i + b_j))

where ω_j represents class weights used to address class imbalance.

2. Center Loss

L_c = 1/Σω_yi * Σ(i=1 to m) ω_yi * ||z_i - c_yi||²

where c_j is the global center of class j, updated as follows:

c_j^(t+1) = (1-α)c_j^t + α*ċ_j^t  (when class j samples exist in mini-batch)
c_j^(t+1) = c_j^t                  (when no class j samples exist in mini-batch)

3. Joint Loss

L = L_s + λL_c

where λ is a hyperparameter balancing the two losses.

Technical Innovations

  1. End-to-End Learning: Avoids the two-step strategy problems of traditional discriminative learning methods
  2. Natural Integration: Center loss can be naturally integrated into common SER models
  3. No Sample Pairing Required: Eliminates the need for constructing sample pairs or triplets, simplifying the training process
  4. Class Balance Handling: Effectively addresses data imbalance through weighted loss functions

Experimental Setup

Dataset

IEMOCAP Dataset:

  • Approximately 12 hours of audio-visual data
  • Four emotion classes: neutral (30.9%), angry (19.9%), happy+excited (29.6%), sad (19.6%)
  • Total of 5,531 utterances with happy and excited merged
  • 5-fold cross-validation maintaining emotion distribution

Evaluation Metrics

  • Unweighted Accuracy (UA): Average recall rate across all classes
  • Weighted Accuracy (WA): Number of correctly classified samples divided by total samples

Comparison Methods

  • Baseline method: Using only softmax cross-entropy loss (λ=0)
  • Proposed method: Joint softmax cross-entropy loss and center loss

Implementation Details

  • Optimizer: Adam with learning rate 0.0003
  • Batch size: 32
  • Feature dimension: 64 dimensions (FC1 output)
  • Spectrogram parameters: 10ms hop length, 40ms window length, 16kHz sampling rate, 1024 DFT length
  • Mel-spectrogram: 128 Mel bands
  • Maximum utterance length: 14 seconds

Experimental Results

Main Results

Mel-Spectrogram Experimental Results:

  • Baseline (λ=0): UA=63.80%, WA=61.83%
  • Proposed method (λ=0.3, α=0.5): UA=66.86%, WA=65.40%
  • Improvement: UA +3.06%, WA +3.57%

STFT Spectrogram Experimental Results:

  • Baseline (λ=0): UA=60.98%, WA=58.93%
  • Proposed method (λ=0.3, α=0.5): UA=65.13%, WA=62.96%
  • Improvement: UA +4.15%, WA +4.03%

Hyperparameter Sensitivity Analysis

  • α parameter: UA and WA are relatively insensitive to α, with stable performance in the range 0.1-0.9
  • λ parameter: Optimal performance achieved at λ=0.3; both larger and smaller values degrade performance

Visualization Analysis

PCA dimensionality reduction visualization shows:

  • After introducing center loss, features of the same class cluster more tightly
  • Separation between different classes improves
  • Similar improvement patterns observed in both training and test sets

Confusion Matrix Analysis

After introducing center loss, recognition accuracy for each emotion class improves to varying degrees:

  • Neutral: 57.5%→63.7%
  • Angry: 69.1%→70.5%
  • Happy: 51.1%→55.6%
  • Sad: 77.6%→77.7%

Traditional Methods

  • Statistical methods based on hand-crafted features
  • Frame-level feature extraction and statistical function application

Deep Learning Methods

  • DNN combined with extreme learning machines
  • Bidirectional LSTM for high-level feature representation
  • End-to-end raw waveform learning
  • CNN and RNN combined spectrogram learning

Discriminative Learning Methods

  • Pairwise discriminative tasks: Using cosine similarity loss with binary cross-entropy
  • Triplet framework: Using triplet loss for learning discriminative features
  • Advantages of this paper's method compared to these approaches: End-to-end learning without sample pairing

Conclusions and Discussion

Main Conclusions

  1. Center loss effectively reduces intra-class variance and enhances feature discriminability
  2. The joint loss function achieves significant performance improvements on both spectrogram input types
  3. The method can be naturally integrated into existing SER models without requiring additional classifiers

Limitations

  1. Primarily focuses on reducing intra-class variance with limited exploration of increasing inter-class variance
  2. Validation only on the IEMOCAP dataset; generalization requires further verification
  3. For severely imbalanced datasets, weighting strategies may require further optimization

Future Directions

The authors propose exploring additional loss function designs, particularly methods for increasing inter-class variance of features, to further improve SER performance.

In-Depth Evaluation

Strengths

  1. Strong methodological novelty: Successfully transfers center loss from face recognition to speech emotion recognition
  2. Rigorous experimental design: Includes hyperparameter sensitivity analysis, visualization verification, and detailed ablation studies
  3. Convincing results: Consistent performance improvements across two different spectrogram input types
  4. Clear presentation: Technical details are well-described with accurate mathematical formulations

Weaknesses

  1. Single dataset: Validation only on IEMOCAP dataset; lacks cross-dataset generalization verification
  2. Limited comparison methods: Primarily compares against self-baseline; lacks detailed comparison with other state-of-the-art methods
  3. Insufficient theoretical analysis: Lacks in-depth theoretical explanation for why center loss is effective in SER tasks
  4. Missing computational complexity analysis: Does not discuss the impact of introducing center loss on training and inference efficiency

Impact

  1. Technical contribution: Provides a simple and effective feature learning method for speech emotion recognition
  2. Practical value: The method is easy to implement and integrate with good practical applicability
  3. Reproducibility: Sufficient technical details facilitate reproduction

Applicable Scenarios

  1. Suitable for various spectrogram-based speech emotion recognition tasks
  2. Particularly effective for handling class-imbalanced emotion datasets
  3. Can serve as a performance enhancement module for existing SER systems

References

The paper cites 19 relevant references covering traditional methods, deep learning approaches, and discriminative feature learning in speech emotion recognition, providing sufficient theoretical foundation and technical comparison.


Overall Assessment: This is a technically solid paper with comprehensive experiments that successfully introduces center loss to speech emotion recognition and achieves significant performance improvements. While there is room for improvement in theoretical analysis and cross-dataset validation, its simple and effective approach combined with consistent experimental results provides good academic and practical value.