This paper presents our contributions to the Speech Emotion Recognition in Naturalistic Conditions (SERNC) Challenge, where we address categorical emotion recognition and emotional attribute prediction. To handle the complexities of natural speech, including intra- and inter-subject variability, we propose Multi-level Acoustic-Textual Emotion Representation (MATER), a novel hierarchical framework that integrates acoustic and textual features at the word, utterance, and embedding levels. By fusing low-level lexical and acoustic cues with high-level contextualized representations, MATER effectively captures both fine-grained prosodic variations and semantic nuances. Additionally, we introduce an uncertainty-aware ensemble strategy to mitigate annotator inconsistencies, improving robustness in ambiguous emotional expressions. MATER ranks fourth in both tasks with a Macro-F1 of 41.01% and an average CCC of 0.5928, securing second place in valence prediction with an impressive CCC of 0.6941.
MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition
- Paper ID: 2506.19887
- Title: MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition
- Authors: Hyo Jin Jon, Longbin Jin, Hyuntaek Jung, Hyunseo Kim, Donghun Min, Eun Yi Kim
- Classification: eess.AS cs.AI cs.SD
- Publication Time/Conference: Interspeech 2025
- Paper Link: https://arxiv.org/abs/2506.19887
This paper proposes MATER (Multi-level Acoustic-Textual Emotion Representation), a multi-level hierarchical framework for speech emotion recognition under natural conditions. The method integrates acoustic and textual features at three levels: word-level, utterance-level, and embedding-level, effectively capturing fine-grained prosodic variations and semantic nuances by fusing low-level lexical and acoustic cues with high-level contextualized representations. Additionally, an uncertainty-aware ensemble strategy is introduced to mitigate annotator inconsistency issues and improve robustness in ambiguous emotional expressions. MATER ranks fourth in both tasks with a Macro-F1 of 41.01% and average CCC of 0.5928, achieving second place in emotion value prediction with a CCC of 0.6941.
- Complexity of Natural Speech Emotion Recognition: Most existing SER datasets fail to fully capture real-world emotional expressions, typically consisting of acted or elicited recordings with limited generalization capability.
- Within-speaker and Between-speaker Variability: Natural speech exhibits significant individual differences and complexity in emotional expression.
- Annotation Inconsistency Issues: Overlapping, ambiguous, and highly variable emotional expressions result in insufficient annotator consensus, introducing confidence variations and category biases.
Emotion is fundamental to human experience, influencing decision-making, communication, and mental health. Speech, as the most common form of communication, carries rich emotional cues including speaker identity, emotional state, and linguistic stress.
- Most datasets have limited participant numbers, reducing generalization to diverse real-world scenarios
- Lack of effective integration of multi-level features
- Failure to effectively address bias issues caused by annotation inconsistency
- Proposes MATER Framework: A novel hierarchical framework integrating acoustic and textual features at word-level, utterance-level, and embedding-level
- Multi-level Feature Fusion: Systematically models emotion from low-level syntactic and prosodic cues to high-level contextualized representations
- Uncertainty-aware Ensemble Strategy: Improves robustness by selecting emotion predictions with minimum uncertainty, mitigating annotation bias
- Achieves Excellent Results in SERNC Challenge: Ranks fourth in both tasks, second place in emotion value prediction
The research addresses two tasks:
- Task 1: Categorical Emotion Recognition: Classifies speech segments into 8 emotion categories (anger, contempt, disgust, fear, happiness, neutral, sadness, surprise)
- Task 2: Emotion Attribute Prediction: Performs 7-point Likert scale ratings on three emotion dimensions (arousal, dominance, valence)
MATER extracts acoustic and textual features at three distinct levels:
Word-level Features:
- Syntactic features: Uses BERTweet syntactic parser to extract linguistic patterns, including grammatical person information of pronouns, forming 20-dimensional syntactic feature vectors
- Prosodic features: Extracts 22-dimensional feature vectors using openSMILE library, including loudness, jitter, shimmer, alpha ratio, and voiced/unvoiced segment statistics
- Forms syntactically-aware prosodic representations through concatenation
Utterance-level Features:
- Emotion features: Derived from SEANCE feature set, producing 517-dimensional representations capturing emotional tendencies across entire transcripts
- Rhythm features: Analyzes speech fluency, intensity, and nuances, including loudness, jitter, shimmer, harmonic-to-noise ratio (HNR), pauses, and voiced/unvoiced statistics, forming 34-dimensional feature vectors
Embedding-level Features:
- Audio encoders: WavLM and HuBERT capture rich phonetic and prosodic information
- Text encoders: BERT and T5 provide semantic information representations
- Post-pretraining on MSP-Podcast corpus to enhance domain adaptation
- Word-level: Processed through two-layer LSTM, with final hidden state serving as word-level embedding
- Utterance-level: First passes through Piecewise Linear Embedding (PLE) layer, then through linear layer producing fixed-dimension representations
- Embedding-level: Uses Perceiver architecture for fusion when multiple embedding sources are employed; otherwise directly uses pooled features
- Final Fusion: Concatenated multi-level embeddings input to linear layer for prediction
- Multi-level Feature Modeling: Systematically captures complete emotional information from fine-grained syntactic prosodic cues to high-level semantic representations
- Syntactically-aware Prosodic Representation: Models interaction between linguistic structure and intonation, which plays a key role in emotional expression
- Domain Adaptation Strategy: Post-pretrains pretrained encoders on target dataset
- Uncertainty-aware Ensemble: Estimates epistemic uncertainty through sorted prediction probabilities, prioritizing high-confidence predictions
Uses MSP-Podcast corpus:
- Training Set: 84,260 samples from 2,112 speakers
- Development Set: 31,961 samples from 714 speakers
- Test Set: 3,200 balanced samples covering 8 emotion categories
- Transcripts and forced alignment generated using Whisper-large-v3
- Task 1: Macro-F1 and Accuracy
- Task 2: Concordance Correlation Coefficient (CCC)
- WavLM baseline method
- Ablation experiments with various feature combinations
- Comparison of different ensemble strategies
- Word-level and utterance-level features projected to 128-dimensional vectors
- Perceiver produces 768-dimensional output using 64×768 latent array
- Task-specific loss functions: weighted cross-entropy for Task 1, CCC loss for Task 2
- Training for 50 epochs, learning rates from 1×10^-5 to 5×10^-7, batch sizes 128-2048
Task 1 (Categorical Emotion Recognition):
- Final submission results: Macro-F1 = 41.01%, Accuracy = 40.97%
- Significant improvement over WavLM baseline (32.93% Macro-F1)
- Ranks fourth in SERNC challenge
Task 2 (Emotion Attribute Prediction):
- Average CCC = 0.5928
- Emotion valence prediction CCC = 0.6941 (second place)
- Arousal CCC = 0.6119
- Dominance CCC = 0.4775
- Feature Level Contribution: Word-level features contribute more than utterance-level features, indicating syntactically-aware prosody is more informative for categorical emotion recognition
- Soft Label Effects: Effective in fine-tuned models but marginal gains in MATER
- Ensemble Strategy Comparison: Uncertainty-aware ensemble outperforms averaging and majority voting strategies
Post-challenge Analysis:
- Acoustic features outperform textual features in both tasks
- Optimal encoders differ across tasks, emphasizing necessity of task-specific encoder selection
- Multimodal fusion in MATER enhances performance at word and utterance levels
- Emotion valence more dependent on text, while arousal and dominance more dependent on acoustic cues
- Traditional SER Methods: Primarily use acted or elicited datasets
- Natural Speech SER: Emergence of datasets like MSP-Podcast
- Multimodal Emotion Recognition: Fusion of acoustic and textual features
- Uncertainty Handling: Methods for addressing annotation inconsistency
- Systematic multi-level feature modeling
- Novel uncertainty-aware ensemble strategy
- Validation on large-scale natural speech datasets
MATER effectively improves speech emotion recognition performance under natural conditions through multi-level feature fusion and uncertainty-aware ensemble, particularly excelling in emotion valence prediction.
- Arousal and Dominance Prediction: Remain challenging, possibly due to text-oriented fusion strategy failing to fully exploit acoustic variations
- Computational Complexity: Multi-level feature extraction and Perceiver architecture increase computational overhead
- Domain Adaptation: Primarily validated on podcast data; generalization to other domains remains to be verified
- Emotion-specific Feature Selection: Adopt adaptive feature weighting for different emotion dimensions
- Dynamic Fusion Strategy: Balance audio-text integration through dynamic fusion
- Extension to Diverse Datasets: Verify MATER performance across different SER datasets
- Methodological Innovation: Multi-level feature modeling and uncertainty-aware ensemble demonstrate novelty
- Systematic Design: Complete feature hierarchy from word-level to embedding-level is well-designed
- Experimental Sufficiency: Detailed ablation studies and post-hoc analysis provide deep insights
- Practical Application Value: Validation in large-scale challenge demonstrates method effectiveness
- Insufficient Theoretical Analysis: Lacks theoretical explanation for why multi-level fusion is effective
- Missing Computational Efficiency Analysis: No detailed computational complexity and inference time analysis provided
- Cross-domain Generalization: Validated only on podcast data; lacks cross-domain experiments
- Interpretability: Despite mentioning interpretability in title, paper lacks relevant analysis
- Academic Contribution: Provides new framework perspective for natural speech emotion recognition
- Practical Value: Excellent performance in actual challenge demonstrates practical utility
- Reproducibility: Detailed implementation details facilitate reproduction
- Natural speech emotion recognition systems
- Multimodal emotion analysis applications
- Emotion computing tasks requiring annotation uncertainty handling
- Podcast, dialogue system, and other natural speech scenarios
The paper cites 68 relevant references covering important works in emotion computing, speech processing, deep learning, and related fields, providing solid theoretical foundation for the research.