2025-11-19T20:28:14.220145

Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations

Zhang, Li, Yu et al.

Long-sequence causal reasoning seeks to uncover causal relationships within extended time series data but is hindered by complex dependencies and the challenges of validating causal links. To address the limitations of large-scale language models (e.g., GPT-4) in capturing intricate emotional causality within extended dialogues, we propose CauseMotion, a long-sequence emotional causal reasoning framework grounded in Retrieval-Augmented Generation (RAG) and multimodal fusion. Unlike conventional methods relying only on textual information, CauseMotion enriches semantic representations by incorporating audio-derived features-vocal emotion, emotional intensity, and speech rate-into textual modalities. By integrating RAG with a sliding window mechanism, it effectively retrieves and leverages contextually relevant dialogue segments, thus enabling the inference of complex emotional causal chains spanning multiple conversational turns. To evaluate its effectiveness, we constructed the first benchmark dataset dedicated to long-sequence emotional causal reasoning, featuring dialogues with over 70 turns. Experimental results demonstrate that the proposed RAG-based multimodal integrated approach, the efficacy of substantially enhances both the depth of emotional understanding and the causal inference capabilities of large-scale language models. A GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy over the original model and surpasses GPT-4o by 1.2%. Additionally, on the publicly available DiaASQ dataset, CauseMotion-GLM-4 achieves state-of-the-art results in accuracy, F1 score, and causal reasoning accuracy.

academic

Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations

Basic Information

Paper ID: 2501.00778
Title: Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations
Authors: Yuxuan Zhang, Yulong Li, Zichen Yu, Feilong Tang, Zhixiang Lu, Chong Li, Kang Dang, Jionglong Su
Categories: cs.CL (Computational Linguistics), cs.CY (Computers and Society)
Publication Date: January 1, 2025
Paper Link: https://arxiv.org/abs/2501.00778

Abstract

This paper proposes CauseMotion, a long-sequence emotional causality reasoning framework based on Retrieval-Augmented Generation (RAG) and multimodal fusion. The framework integrates audio features (vocal emotion, emotional intensity, speech rate) and textual modality, utilizing a sliding window mechanism to retrieve relevant dialogue segments, enabling reasoning over complex emotional causal chains spanning multiple dialogue turns. Experimental results demonstrate that GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy compared to the baseline model and surpasses GPT-4o by 1.2%.

Research Background and Motivation

Problem Definition

Long-sequence causality reasoning aims to discover causal relationships in extended temporal sequence data but is hindered by complex dependencies and challenges in causal chain verification. Existing large language models exhibit significant limitations in capturing complex emotional causal relationships within extended conversations.

Research Significance

Emotional causality reasoning is crucial for intelligent human-computer interaction systems. With the proliferation of social media, emotional expressions have become increasingly complex, involving long text sequences and multimodal information. Understanding the origins, development, and consequences of emotions is essential for building more emotionally intelligent systems.

Limitations of Existing Approaches

Input Length Constraints: Requires truncation or segmentation of text, resulting in loss of global context and hindering the capture of long-range dependencies across segments or dialogue turns
Difficulty in Long-range Dependency Modeling: Struggles to accurately establish global causal associations, leading to incomplete or imprecise reasoning
Segment-based Processing: May disrupt event sequences and logical relationships, weakening the model's understanding of overall causal chains
Multimodal Fusion Challenges: Significant differences between text and audio modalities in feature representation and statistical properties; proprietary nature of closed-source models restricts deep integration of audio features

Core Contributions

Multimodal Fusion Mechanism: Proposes methods for deeply embedding audio features into model input design and dialogue knowledge bases, achieving effective fusion of text and audio data
Large-scale Long-sequence Dataset: Constructs ATLAS-6, the first benchmark dataset specifically designed for long-sequence emotional causality reasoning, containing 70-300 dialogue turns
CauseMotion Framework: Proposes a novel causality reasoning framework integrating RAG, effectively capturing long-range dependencies and complex causal chains
State-of-the-art Performance: Achieves state-of-the-art performance on the DiaASQ dataset; CauseMotion-GLM-4 comprehensively surpasses GPT-4o on the ATLAS dataset

Methodology Details

Task Definition

Given a dialogue D = {u₁, u₂, ..., uₙ} containing n utterances, where each utterance uᵢ = {wᵢ₁, wᵢ₂, ..., wᵢₘ} contains m words. The objective is to extract all possible emotional causality sextuplets Q = {(hⱼ, tⱼ, aⱼ, oⱼ, pⱼ, rⱼ)} from the input temporal window W, where:

hⱼ: Holder (emotion holder)
tⱼ: Target (target)
aⱼ: Aspect (aspect)
oⱼ: Opinion (opinion)
pⱼ: Sentiment (sentiment)
rⱼ: Rationale (rationale)

Model Architecture

1. Multimodal Fusion Mechanism

Extracts emotional features from audio using SenseVoice, including:

Vocal emotion eᵢ ∈ ℝᵈ
Emotional intensity θᵢ ∈ ℝ
Speech rate rᵢ = m/(t_end_i - t_start_i)

Audio feature vector defined as:

aᵢ = {eᵢ, θᵢ}

Multimodal embedding achieved through concatenation operation:

Em = Concat(Et, Ee, Er)

2. Dialogue Knowledge Base Construction

Employs sliding temporal window method to create local dialogue subsets:

Dt = {ut, ut+1, ..., ut+k}

Constructs dialogue knowledge base containing multimodal features:

Kd = {(W1, Em1), (W2, Em2), ..., (Wj, Emj)}

3. RAG Mechanism

RAG module retrieves the most relevant dialogue segments through cosine similarity:

Similarity(Wj, Wi) = (Wj · Wi) / (||Wj|| ||Wi||)

Retrieval process defined as:

Cj = RAG(Wj, Kd)

Technical Innovations

1. Complex Causal Chain Reasoning

Establishes causal connections based on three scoring metrics:

Semantic Consistency Score:

Semantic Score(ojk, pik) = (ojk · pik) / (||ojk|| ||pik||)

Temporal Constraint Score:

Temporal Score(Δtij) = exp(-Δtij/τ)

Rationale Alignment Score:

Rationale Score(rjk, Qi) = log(1 + PNLI(rjk → Qi))

Final weight calculation:

Weight(eij) = α·Semantic Score + β·Temporal Score + γ·Rationale Score

2. Sliding Window Mechanism

Continuously processes dialogue sequences through sliding windows, effectively alleviating input length constraints while maintaining global contextual information.

Experimental Setup

Datasets

ATLAS-6 dataset comprises two components:

Auxiliary Synthetic Dataset: 20,000 extended dialogue texts (70-300 turns) covering 8 scenarios
Real Validation Dataset: 2,745 long-sequence dialogues sourced from movies and social networks

Each utterance is annotated with six key elements, subject to rigorous manual annotation and cross-checking.

Evaluation Metrics

Causal Correctness = Number of correct causal links / Total predicted causal links
Causal Consistency = Number of consistent causal links / Total causal links
Causal Chain Score = 0.5 × Causal Correctness + 0.5 × Causal Consistency

Baseline Methods

Open-source models: LLama-3.3-70B, Qwen2.5-72B, InternLM2.5-20B
Proprietary models: GLM-4, GPT-4o
Traditional methods: CRF-Extract-Classify, SpERT, DiaASQ, ParaPhrase, Span-ASTE

Implementation Details

Open-source models trained using distributed training on 64 A800 GPUs
Proprietary models accessed through official APIs
Weight parameters α, β, γ satisfy α + β + γ = 1 and 0 < α, β, γ < 1

Experimental Results

Main Results

Performance on DiaASQ Dataset

CauseMotion-GLM-4 significantly outperforms other models across all metrics:

Target span matching F1: 91.43
Aspect span matching F1: 77.63
Opinion extraction F1: 61.35
T-A pair extraction F1: 64.15
T-O pair extraction F1: 50.22
A-O pair extraction F1: 59.16

Performance on ATLAS Dataset

CauseMotion-GLM-4 achieves the highest emotional causality reasoning chain accuracy of 0.574, representing an 8.7% improvement over GPT-4o's 0.528.

Ablation Studies

Ablation experiments demonstrate significant performance degradation when removing the CauseMotion framework:

GLM-4: Drops from 0.574 to 0.487 (-0.075)
Other models exhibit similar performance decline trends

This validates the critical role of the CauseMotion framework in enhancing emotional causality reasoning.

Key Findings

Effectiveness of Multimodal Fusion: Integration of audio features significantly enhances the depth of emotional understanding
Importance of RAG Mechanism: Dynamic retrieval mechanism effectively mitigates challenges in long-sequence processing
Framework Generalizability: CauseMotion effectively improves performance across different base models

Development of Sentiment Analysis

Evolution from aspect-based sentiment analysis (ABSA) to fine-grained analysis, capable of extracting targets, aspects, opinions, and sentiments from text, while facing new challenges in processing long text sequences and multimodal information.

Long-sequence Reasoning

Existing research primarily focuses on short texts, lacking modeling capabilities for long-range dependencies and complex multi-layer relationships, limiting understanding of deep emotional causal chains.

Multimodal Fusion

Traditional approaches primarily rely on textual information; this work achieves more comprehensive understanding of emotional expression through integration of audio features.

Conclusions and Discussion

Main Conclusions

The CauseMotion framework effectively addresses challenges in long-sequence emotional causality reasoning through RAG and multimodal fusion
Deep integration of audio features significantly enhances emotional understanding capabilities
The constructed ATLAS-6 dataset provides important foundational resources for the field

Limitations

Current focus primarily on dialogue scenarios; applicability to other text types requires further verification
Audio feature extraction depends on specific pre-trained models (SenseVoice)
High computational complexity may limit practical applications

Future Directions

Extend framework to other domains and text types
Integrate additional modalities (e.g., visual information)
Optimize computational efficiency and model compression

In-depth Evaluation

Strengths

Strong Technical Innovation: First systematic application of RAG technology to long-sequence emotional causality reasoning
Deep Multimodal Fusion: Innovatively embeds audio features into knowledge bases and input design
Significant Dataset Contribution: Constructs the first large-scale long-sequence emotional causality reasoning dataset
Comprehensive Experiments: Conducts thorough evaluation across multiple datasets and models
Significant Performance Improvements: Achieves notable improvements over state-of-the-art methods

Weaknesses

Computational Complexity: Multimodal fusion and RAG mechanisms increase computational overhead
Strong Dependencies: Heavily relies on audio feature extraction models and pre-trained language models
Unknown Generalizability: Primarily validated in dialogue scenarios; applicability to other scenarios requires additional experiments
Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why the approach is effective

Impact

Academic Contribution: Opens new research directions for long-sequence emotional causality reasoning
Practical Value: Holds significant value in applications such as intelligent customer service and sentiment analysis
Reproducibility: Provides anonymized code repository facilitating research reproduction

Applicable Scenarios

Emotional understanding in long dialogue systems
Social media sentiment monitoring
Customer service quality analysis
Mental health assessment systems
Educational dialogue systems

References

The paper cites 34 relevant references covering multiple research domains including sentiment analysis, multimodal fusion, retrieval-augmented generation, and large language models, providing a solid theoretical foundation for this research.

Overall Assessment: This is a high-quality research paper that proposes innovative solutions to the important and challenging task of long-sequence emotional causality reasoning. The paper's technical contributions, experimental design, and results are impressive, making significant contributions to the development of the related field.