The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Ghazal, Caubrière, Vielzeuf
This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.
academic
The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
This paper presents a comparative study of context management strategies based on Speech-LLM for end-to-end spoken dialogue state tracking tasks. The authors systematically evaluate three approaches: traditional multimodal context (combining text history and current spoken turns), full spoken history, and compressed spoken history. Experiments on the SpokenWOZ corpus demonstrate that providing complete spoken dialogue as input achieves the highest performance among models of equivalent scale, significantly surpassing existing methods. Additionally, attention pooling-based spoken history compression provides a strong trade-off solution, maintaining competitive accuracy while reducing context size.
Dialogue state tracking (DST) is a critical component of task-oriented dialogue systems, responsible for understanding and maintaining context across multiple dialogue turns. However, spoken dialogue state tracking (Spoken DST) remains a relatively immature research field, with current system performance significantly lagging behind written dialogue scenarios.
Error Propagation in Cascade Systems: Traditional methods employ cascaded ASR + DST architectures, susceptible to error propagation from the ASR stage, particularly when handling proper nouns and domain-specific terminology
Inconsistent Context Management Strategies: Existing end-to-end methods diverge in context processing, with no consensus on effectively integrating spoken and textual information
Lack of Systematic Comparison: Absence of systematic evaluation and analysis of different context management strategies
The authors pose a core question: What if we rely entirely on spoken context? Should this be achieved by providing the system with speech representations of the entire dialogue, or through intermediate modules that compress these representations? This research aims to explore these possibilities and provide systematic answers.
Validates the effectiveness of Speech-LLM for spoken DST tasks, providing a new technical pathway for the field
Proposes two context management methods achieving SOTA performance: full spoken context and compressed spoken context
Demonstrates a simple yet effective approach: directly inputting entire spoken dialogue to the model without additional compression or modality mixing yields optimal performance
Provides detailed analysis and ablation experiments, verifying that improvements stem from more effective context utilization
Given a sequence of spoken dialogue turns U1,A2,...,At−1,Ut−1, the objective is to predict k relevant domains (domain1,domain2,...,domaink) and n slot-value pairs (slot1=value1,slot2=value2,...,slotn=valuen), represented as JSON structures.
Clear Problem Definition: Systematic study of context management, a key issue in spoken DST
Strong Method Innovation: First systematic comparison of different context management strategies, proposing a simple yet effective full spoken context approach
Comprehensive Experimental Design: Includes sufficient ablation experiments, fine-grained analysis, and error analysis
Convincing Results: Demonstrates method effectiveness across multiple dimensions with significant performance improvements
Thorough Analysis: Analyzes method advantages from multiple perspectives including slot types and dialogue turns
This paper cites important literature from dialogue state tracking, spoken dialogue systems, Speech-LLM and related fields, particularly:
SpokenWOZ dataset-related work
DSTC challenge series
End-to-end spoken dialogue system research
Speech-LLM model development
Overall Assessment: This is a high-quality research paper that proposes a simple yet effective solution to core problems in spoken dialogue state tracking. The experimental design is comprehensive, analysis is thorough, and it makes important contributions to the field. Despite some limitations, its innovation and practicality provide significant academic and application value.