2025-11-12T05:43:09.400515

The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Ghazal, Caubrière, Vielzeuf
This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.
academic

The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Basic Information

  • Paper ID: 2510.09424
  • Title: The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
  • Authors: Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf (Orange Innovation)
  • Classification: cs.CL cs.AI cs.LG eess.AS
  • Publication Date: October 10, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.09424

Abstract

This paper presents a comparative study of context management strategies based on Speech-LLM for end-to-end spoken dialogue state tracking tasks. The authors systematically evaluate three approaches: traditional multimodal context (combining text history and current spoken turns), full spoken history, and compressed spoken history. Experiments on the SpokenWOZ corpus demonstrate that providing complete spoken dialogue as input achieves the highest performance among models of equivalent scale, significantly surpassing existing methods. Additionally, attention pooling-based spoken history compression provides a strong trade-off solution, maintaining competitive accuracy while reducing context size.

Research Background and Motivation

Problem Definition

Dialogue state tracking (DST) is a critical component of task-oriented dialogue systems, responsible for understanding and maintaining context across multiple dialogue turns. However, spoken dialogue state tracking (Spoken DST) remains a relatively immature research field, with current system performance significantly lagging behind written dialogue scenarios.

Limitations of Existing Methods

  1. Error Propagation in Cascade Systems: Traditional methods employ cascaded ASR + DST architectures, susceptible to error propagation from the ASR stage, particularly when handling proper nouns and domain-specific terminology
  2. Inconsistent Context Management Strategies: Existing end-to-end methods diverge in context processing, with no consensus on effectively integrating spoken and textual information
  3. Lack of Systematic Comparison: Absence of systematic evaluation and analysis of different context management strategies

Research Motivation

The authors pose a core question: What if we rely entirely on spoken context? Should this be achieved by providing the system with speech representations of the entire dialogue, or through intermediate modules that compress these representations? This research aims to explore these possibilities and provide systematic answers.

Core Contributions

  1. Validates the effectiveness of Speech-LLM for spoken DST tasks, providing a new technical pathway for the field
  2. Proposes two context management methods achieving SOTA performance: full spoken context and compressed spoken context
  3. Demonstrates a simple yet effective approach: directly inputting entire spoken dialogue to the model without additional compression or modality mixing yields optimal performance
  4. Provides detailed analysis and ablation experiments, verifying that improvements stem from more effective context utilization

Methodology Details

Task Definition

Given a sequence of spoken dialogue turns U1,A2,...,At1,Ut1U_1, A_2, ..., A_{t-1}, U_{t-1}, the objective is to predict k relevant domains (domain1,domain2,...,domaink)(domain_1, domain_2, ..., domain_k) and n slot-value pairs (slot1=value1,slot2=value2,...,slotn=valuen)(slot_1 = value_1, slot_2 = value_2, ..., slot_n = value_n), represented as JSON structures.

Model Architecture

The system comprises three main components:

  1. Speech Encoder: Processes entire dialogue history, computing dense representations for each turn
  2. Connector: Maps speech features to LLM input space
  3. Large Language Model (LLM): Generates dialogue state in an autoregressive manner
  4. Compression Module (optional): Reduces context length

Three Context Management Strategies

1. Multimodal Context

  • Input: Spoken user utterance UnspokenU^{spoken}_n + written dialogue history
  • Prompt Format:
h_n { "history": Context_n, "user last turn": U^{text}_n, 
     "domains": D_n, "predicted state": S_n }
  • Characteristics: Combines spoken current turn and textual history information

2. Full Spoken Context

  • Input: Complete spoken dialogue Contextn=(U1spoken,A2spoken,...,Unspoken)Context_n = (U^{spoken}_1, A^{spoken}_2, ..., U^{spoken}_n)
  • Prompt Format:
Speech_Emb {"domains": D_n, "predicted state": S_n}
  • Characteristics: Pure spoken input, avoiding modality conversion loss

3. Compressed Spoken Context

  • Compression Mechanism: Uses NqueriesN_{queries} trainable query vectors Q, computed via TransformerDecoder:
z_i = TransformerDecoder(Q, h_i)
Speech_Emb = (z_1||z_2||...||z_n)
  • Characteristics: Significantly reduces context length while maintaining performance

Training Strategy

Employs two-stage training:

  1. ASR Pre-training: Freezes LLM, trains speech encoder and connector to align speech-text modalities
  2. DST Fine-tuning: Freezes speech encoder, trains connector, compression module, and LLM LoRA adapters

Experimental Setup

Datasets

  • ASR Pre-training: Loquacious Medium (2,500 hours) + Fisher Corpus (1,960 hours) + SpokenWOZ training set (200 hours)
  • DST Fine-tuning: SpokenWOZ dataset, removing 9 corrupted dialogues, evaluated using Joint Goal Accuracy (JGA)

Model Configuration

  • Speech Encoder: W2v-BERT
  • Connector: Single-layer Transformer encoder (hidden dimension 1024, 16 attention heads)
  • Compression Module: Single-layer Transformer decoder (same configuration)
  • LLM: OLMo 2 1B, using LoRA adapters (rank=16, alpha=1)

Evaluation Metrics

Primarily uses Joint Goal Accuracy (JGA), with post-processing including temporal expression normalization and fuzzy matching.

Experimental Results

Main Results

ModelSWOZ Test JGA
SPACE+WavLMalign25.65%
E2E (Whisper+T5)24.10%
UBAR + GenWOZ25.90%
WavLM + conn. + OLMo-1B34.66%
Compressed Spoken Context (This Work)36.49%
Full Spoken Context (This Work)39.32%
WavLM + conn. + Gemma-2-9B42.17%

Context Management Method Comparison

MethodSWOZ DevSWOZ Test
Multimodal Context (Baseline)31.85%32.06%
Full Spoken Context36.89%36.29%
Compressed Spoken Context (1 query)31.03%30.99%
Compressed Spoken Context (10 queries)34.26%33.51%

Fine-grained Analysis

Slot Type Analysis

  • Categorical Slots: All models perform well, with full spoken context showing slight advantages
  • Temporal and Open Slots: Full spoken context and 10-query compression significantly outperform other methods
  • Personal Information Slots: Most challenging category, with full spoken context leading and 1-query model performing worst

Dialogue Turn Analysis

  • Early Turns (1-5): All models perform well
  • Mid Turns (5-30): Accuracy drops rapidly, with full spoken context consistently leading
  • Late Turns (40+): Accuracy approaches zero, limited by small LLM capacity

Error Analysis

Analysis of six slots with highest error rates reveals:

  • Most predictions achieve high fuzzy ratios (>0.8), indicating models typically approximate correct slot values
  • Errors in restaurant names, attraction names, and hotel names primarily stem from insertions and deletions rather than substitutions
  • Personal information-related slots remain extremely challenging

Traditional Methods

  • Cascade Systems: Pipeline approaches combining ASR + DST, performing well in DSTC11 challenges
  • End-to-End Systems: Direct speech-to-dialogue-state mapping, avoiding error propagation

Speech-LLM Development

  • Speech-aware large language models demonstrate potential in tasks such as ASR and response generation
  • Recent work applies Speech-LLM to spoken DST, achieving SOTA performance

Context Management Strategies

Existing methods diverge in context processing; this paper provides the first systematic comparison of different strategy effects.

Conclusions and Discussion

Main Conclusions

  1. Full Spoken Context Strategy is Most Effective: Directly using entire spoken dialogue as input achieves optimal performance
  2. Compression Strategy Provides Good Trade-off: 10-query compression significantly reduces context size while maintaining competitive performance
  3. Speech-LLM Performs Excellently on Spoken DST Tasks: Provides a new technical pathway for the field

Limitations

  1. Computational Complexity: Full spoken context method may incur high computational overhead for very long dialogues
  2. Model Scale Constraints: Not validated on larger-scale LLMs (e.g., Gemma-2-9B)
  3. Dataset Limitations: Primarily validated on SpokenWOZ; generalization requires validation on additional datasets

Future Directions

  1. Explore more sophisticated and compact spoken context processing methods
  2. Extend to larger-scale models
  3. Validate on additional spoken dialogue datasets

In-Depth Evaluation

Strengths

  1. Clear Problem Definition: Systematic study of context management, a key issue in spoken DST
  2. Strong Method Innovation: First systematic comparison of different context management strategies, proposing a simple yet effective full spoken context approach
  3. Comprehensive Experimental Design: Includes sufficient ablation experiments, fine-grained analysis, and error analysis
  4. Convincing Results: Demonstrates method effectiveness across multiple dimensions with significant performance improvements
  5. Thorough Analysis: Analyzes method advantages from multiple perspectives including slot types and dialogue turns

Weaknesses

  1. Insufficient Computational Efficiency Analysis: Lacks detailed analysis of computational complexity and inference time for different methods
  2. Missing Large Model Validation: Does not verify method scalability on larger-scale LLMs
  3. Limited Cross-dataset Generalization: Primarily validated on single dataset; generalization requires further verification
  4. Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why full spoken context is more effective

Impact

  1. Academic Value: Provides new research perspectives and benchmark methods for the spoken DST field
  2. Practical Value: Simple and effective method, easy to reproduce and apply
  3. Technical Contribution: Demonstrates potential of Speech-LLM in spoken understanding tasks

Applicable Scenarios

  1. Task-Oriented Dialogue Systems: Particularly suitable for spoken dialogue systems requiring accurate state tracking
  2. Multi-turn Dialogue Understanding: Applicable to scenarios requiring long-term context understanding
  3. Resource-Constrained Environments: Relatively small model scale makes it suitable for deployment in resource-limited settings

References

This paper cites important literature from dialogue state tracking, spoken dialogue systems, Speech-LLM and related fields, particularly:

  • SpokenWOZ dataset-related work
  • DSTC challenge series
  • End-to-end spoken dialogue system research
  • Speech-LLM model development

Overall Assessment: This is a high-quality research paper that proposes a simple yet effective solution to core problems in spoken dialogue state tracking. The experimental design is comprehensive, analysis is thorough, and it makes important contributions to the field. Despite some limitations, its innovation and practicality provide significant academic and application value.