2025-11-19T20:28:14.220145

Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations

Zhang, Li, Yu et al.
Long-sequence causal reasoning seeks to uncover causal relationships within extended time series data but is hindered by complex dependencies and the challenges of validating causal links. To address the limitations of large-scale language models (e.g., GPT-4) in capturing intricate emotional causality within extended dialogues, we propose CauseMotion, a long-sequence emotional causal reasoning framework grounded in Retrieval-Augmented Generation (RAG) and multimodal fusion. Unlike conventional methods relying only on textual information, CauseMotion enriches semantic representations by incorporating audio-derived features-vocal emotion, emotional intensity, and speech rate-into textual modalities. By integrating RAG with a sliding window mechanism, it effectively retrieves and leverages contextually relevant dialogue segments, thus enabling the inference of complex emotional causal chains spanning multiple conversational turns. To evaluate its effectiveness, we constructed the first benchmark dataset dedicated to long-sequence emotional causal reasoning, featuring dialogues with over 70 turns. Experimental results demonstrate that the proposed RAG-based multimodal integrated approach, the efficacy of substantially enhances both the depth of emotional understanding and the causal inference capabilities of large-scale language models. A GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy over the original model and surpasses GPT-4o by 1.2%. Additionally, on the publicly available DiaASQ dataset, CauseMotion-GLM-4 achieves state-of-the-art results in accuracy, F1 score, and causal reasoning accuracy.
academic

Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations

Basic Information

  • Paper ID: 2501.00778
  • Title: Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations
  • Authors: Yuxuan Zhang, Yulong Li, Zichen Yu, Feilong Tang, Zhixiang Lu, Chong Li, Kang Dang, Jionglong Su
  • Categories: cs.CL (Computational Linguistics), cs.CY (Computers and Society)
  • Publication Date: January 1, 2025
  • Paper Link: https://arxiv.org/abs/2501.00778

Abstract

This paper proposes CauseMotion, a long-sequence emotional causality reasoning framework based on Retrieval-Augmented Generation (RAG) and multimodal fusion. The framework integrates audio features (vocal emotion, emotional intensity, speech rate) and textual modality, utilizing a sliding window mechanism to retrieve relevant dialogue segments, enabling reasoning over complex emotional causal chains spanning multiple dialogue turns. Experimental results demonstrate that GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy compared to the baseline model and surpasses GPT-4o by 1.2%.

Research Background and Motivation

Problem Definition

Long-sequence causality reasoning aims to discover causal relationships in extended temporal sequence data but is hindered by complex dependencies and challenges in causal chain verification. Existing large language models exhibit significant limitations in capturing complex emotional causal relationships within extended conversations.

Research Significance

Emotional causality reasoning is crucial for intelligent human-computer interaction systems. With the proliferation of social media, emotional expressions have become increasingly complex, involving long text sequences and multimodal information. Understanding the origins, development, and consequences of emotions is essential for building more emotionally intelligent systems.

Limitations of Existing Approaches

  1. Input Length Constraints: Requires truncation or segmentation of text, resulting in loss of global context and hindering the capture of long-range dependencies across segments or dialogue turns
  2. Difficulty in Long-range Dependency Modeling: Struggles to accurately establish global causal associations, leading to incomplete or imprecise reasoning
  3. Segment-based Processing: May disrupt event sequences and logical relationships, weakening the model's understanding of overall causal chains
  4. Multimodal Fusion Challenges: Significant differences between text and audio modalities in feature representation and statistical properties; proprietary nature of closed-source models restricts deep integration of audio features

Core Contributions

  1. Multimodal Fusion Mechanism: Proposes methods for deeply embedding audio features into model input design and dialogue knowledge bases, achieving effective fusion of text and audio data
  2. Large-scale Long-sequence Dataset: Constructs ATLAS-6, the first benchmark dataset specifically designed for long-sequence emotional causality reasoning, containing 70-300 dialogue turns
  3. CauseMotion Framework: Proposes a novel causality reasoning framework integrating RAG, effectively capturing long-range dependencies and complex causal chains
  4. State-of-the-art Performance: Achieves state-of-the-art performance on the DiaASQ dataset; CauseMotion-GLM-4 comprehensively surpasses GPT-4o on the ATLAS dataset

Methodology Details

Task Definition

Given a dialogue D = {u₁, u₂, ..., uₙ} containing n utterances, where each utterance uᵢ = {wᵢ₁, wᵢ₂, ..., wᵢₘ} contains m words. The objective is to extract all possible emotional causality sextuplets Q = {(hⱼ, tⱼ, aⱼ, oⱼ, pⱼ, rⱼ)} from the input temporal window W, where:

  • hⱼ: Holder (emotion holder)
  • tⱼ: Target (target)
  • aⱼ: Aspect (aspect)
  • oⱼ: Opinion (opinion)
  • pⱼ: Sentiment (sentiment)
  • rⱼ: Rationale (rationale)

Model Architecture

1. Multimodal Fusion Mechanism

Extracts emotional features from audio using SenseVoice, including:

  • Vocal emotion eᵢ ∈ ℝᵈ
  • Emotional intensity θᵢ ∈ ℝ
  • Speech rate rᵢ = m/(t_end_i - t_start_i)

Audio feature vector defined as:

aᵢ = {eᵢ, θᵢ}

Multimodal embedding achieved through concatenation operation:

Em = Concat(Et, Ee, Er)

2. Dialogue Knowledge Base Construction

Employs sliding temporal window method to create local dialogue subsets:

Dt = {ut, ut+1, ..., ut+k}

Constructs dialogue knowledge base containing multimodal features:

Kd = {(W1, Em1), (W2, Em2), ..., (Wj, Emj)}

3. RAG Mechanism

RAG module retrieves the most relevant dialogue segments through cosine similarity:

Similarity(Wj, Wi) = (Wj · Wi) / (||Wj|| ||Wi||)

Retrieval process defined as:

Cj = RAG(Wj, Kd)

Technical Innovations

1. Complex Causal Chain Reasoning

Establishes causal connections based on three scoring metrics:

Semantic Consistency Score:

Semantic Score(ojk, pik) = (ojk · pik) / (||ojk|| ||pik||)

Temporal Constraint Score:

Temporal Score(Δtij) = exp(-Δtij/τ)

Rationale Alignment Score:

Rationale Score(rjk, Qi) = log(1 + PNLI(rjk → Qi))

Final weight calculation:

Weight(eij) = α·Semantic Score + β·Temporal Score + γ·Rationale Score

2. Sliding Window Mechanism

Continuously processes dialogue sequences through sliding windows, effectively alleviating input length constraints while maintaining global contextual information.

Experimental Setup

Datasets

ATLAS-6 dataset comprises two components:

  1. Auxiliary Synthetic Dataset: 20,000 extended dialogue texts (70-300 turns) covering 8 scenarios
  2. Real Validation Dataset: 2,745 long-sequence dialogues sourced from movies and social networks

Each utterance is annotated with six key elements, subject to rigorous manual annotation and cross-checking.

Evaluation Metrics

  1. Causal Correctness = Number of correct causal links / Total predicted causal links
  2. Causal Consistency = Number of consistent causal links / Total causal links
  3. Causal Chain Score = 0.5 × Causal Correctness + 0.5 × Causal Consistency

Baseline Methods

  • Open-source models: LLama-3.3-70B, Qwen2.5-72B, InternLM2.5-20B
  • Proprietary models: GLM-4, GPT-4o
  • Traditional methods: CRF-Extract-Classify, SpERT, DiaASQ, ParaPhrase, Span-ASTE

Implementation Details

  • Open-source models trained using distributed training on 64 A800 GPUs
  • Proprietary models accessed through official APIs
  • Weight parameters α, β, γ satisfy α + β + γ = 1 and 0 < α, β, γ < 1

Experimental Results

Main Results

Performance on DiaASQ Dataset

CauseMotion-GLM-4 significantly outperforms other models across all metrics:

  • Target span matching F1: 91.43
  • Aspect span matching F1: 77.63
  • Opinion extraction F1: 61.35
  • T-A pair extraction F1: 64.15
  • T-O pair extraction F1: 50.22
  • A-O pair extraction F1: 59.16

Performance on ATLAS Dataset

CauseMotion-GLM-4 achieves the highest emotional causality reasoning chain accuracy of 0.574, representing an 8.7% improvement over GPT-4o's 0.528.

Ablation Studies

Ablation experiments demonstrate significant performance degradation when removing the CauseMotion framework:

  • GLM-4: Drops from 0.574 to 0.487 (-0.075)
  • Other models exhibit similar performance decline trends

This validates the critical role of the CauseMotion framework in enhancing emotional causality reasoning.

Key Findings

  1. Effectiveness of Multimodal Fusion: Integration of audio features significantly enhances the depth of emotional understanding
  2. Importance of RAG Mechanism: Dynamic retrieval mechanism effectively mitigates challenges in long-sequence processing
  3. Framework Generalizability: CauseMotion effectively improves performance across different base models

Development of Sentiment Analysis

Evolution from aspect-based sentiment analysis (ABSA) to fine-grained analysis, capable of extracting targets, aspects, opinions, and sentiments from text, while facing new challenges in processing long text sequences and multimodal information.

Long-sequence Reasoning

Existing research primarily focuses on short texts, lacking modeling capabilities for long-range dependencies and complex multi-layer relationships, limiting understanding of deep emotional causal chains.

Multimodal Fusion

Traditional approaches primarily rely on textual information; this work achieves more comprehensive understanding of emotional expression through integration of audio features.

Conclusions and Discussion

Main Conclusions

  1. The CauseMotion framework effectively addresses challenges in long-sequence emotional causality reasoning through RAG and multimodal fusion
  2. Deep integration of audio features significantly enhances emotional understanding capabilities
  3. The constructed ATLAS-6 dataset provides important foundational resources for the field

Limitations

  1. Current focus primarily on dialogue scenarios; applicability to other text types requires further verification
  2. Audio feature extraction depends on specific pre-trained models (SenseVoice)
  3. High computational complexity may limit practical applications

Future Directions

  1. Extend framework to other domains and text types
  2. Integrate additional modalities (e.g., visual information)
  3. Optimize computational efficiency and model compression

In-depth Evaluation

Strengths

  1. Strong Technical Innovation: First systematic application of RAG technology to long-sequence emotional causality reasoning
  2. Deep Multimodal Fusion: Innovatively embeds audio features into knowledge bases and input design
  3. Significant Dataset Contribution: Constructs the first large-scale long-sequence emotional causality reasoning dataset
  4. Comprehensive Experiments: Conducts thorough evaluation across multiple datasets and models
  5. Significant Performance Improvements: Achieves notable improvements over state-of-the-art methods

Weaknesses

  1. Computational Complexity: Multimodal fusion and RAG mechanisms increase computational overhead
  2. Strong Dependencies: Heavily relies on audio feature extraction models and pre-trained language models
  3. Unknown Generalizability: Primarily validated in dialogue scenarios; applicability to other scenarios requires additional experiments
  4. Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why the approach is effective

Impact

  1. Academic Contribution: Opens new research directions for long-sequence emotional causality reasoning
  2. Practical Value: Holds significant value in applications such as intelligent customer service and sentiment analysis
  3. Reproducibility: Provides anonymized code repository facilitating research reproduction

Applicable Scenarios

  1. Emotional understanding in long dialogue systems
  2. Social media sentiment monitoring
  3. Customer service quality analysis
  4. Mental health assessment systems
  5. Educational dialogue systems

References

The paper cites 34 relevant references covering multiple research domains including sentiment analysis, multimodal fusion, retrieval-augmented generation, and large language models, providing a solid theoretical foundation for this research.


Overall Assessment: This is a high-quality research paper that proposes innovative solutions to the important and challenging task of long-sequence emotional causality reasoning. The paper's technical contributions, experimental design, and results are impressive, making significant contributions to the development of the related field.