2025-11-12T05:43:09.400515

The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Ghazal, CaubriÃ¨re, Vielzeuf

This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.

academic

The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Basic Information

Paper ID: 2510.09424
Title: The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Authors: Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf (Orange Innovation)
Classification: cs.CL cs.AI cs.LG eess.AS
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09424

Abstract

This paper presents a comparative study of context management strategies based on Speech-LLM for end-to-end spoken dialogue state tracking tasks. The authors systematically evaluate three approaches: traditional multimodal context (combining text history and current spoken turns), full spoken history, and compressed spoken history. Experiments on the SpokenWOZ corpus demonstrate that providing complete spoken dialogue as input achieves the highest performance among models of equivalent scale, significantly surpassing existing methods. Additionally, attention pooling-based spoken history compression provides a strong trade-off solution, maintaining competitive accuracy while reducing context size.

Research Background and Motivation

Problem Definition

Dialogue state tracking (DST) is a critical component of task-oriented dialogue systems, responsible for understanding and maintaining context across multiple dialogue turns. However, spoken dialogue state tracking (Spoken DST) remains a relatively immature research field, with current system performance significantly lagging behind written dialogue scenarios.

Limitations of Existing Methods

Error Propagation in Cascade Systems: Traditional methods employ cascaded ASR + DST architectures, susceptible to error propagation from the ASR stage, particularly when handling proper nouns and domain-specific terminology
Inconsistent Context Management Strategies: Existing end-to-end methods diverge in context processing, with no consensus on effectively integrating spoken and textual information
Lack of Systematic Comparison: Absence of systematic evaluation and analysis of different context management strategies

Research Motivation

The authors pose a core question: What if we rely entirely on spoken context? Should this be achieved by providing the system with speech representations of the entire dialogue, or through intermediate modules that compress these representations? This research aims to explore these possibilities and provide systematic answers.

Core Contributions

Validates the effectiveness of Speech-LLM for spoken DST tasks, providing a new technical pathway for the field
Proposes two context management methods achieving SOTA performance: full spoken context and compressed spoken context
Demonstrates a simple yet effective approach: directly inputting entire spoken dialogue to the model without additional compression or modality mixing yields optimal performance
Provides detailed analysis and ablation experiments, verifying that improvements stem from more effective context utilization

Methodology Details

Task Definition

Given a sequence of spoken dialogue turns $U_1, A_2, ..., A_{t-1}, U_{t-1}$ , the objective is to predict k relevant domains $(domain_1, domain_2, ..., domain_k)$ and n slot-value pairs $(slot_1 = value_1, slot_2 = value_2, ..., slot_n = value_n)$ , represented as JSON structures.

Model Architecture

The system comprises three main components:

Speech Encoder: Processes entire dialogue history, computing dense representations for each turn
Connector: Maps speech features to LLM input space
Large Language Model (LLM): Generates dialogue state in an autoregressive manner
Compression Module (optional): Reduces context length

Three Context Management Strategies

1. Multimodal Context

Input: Spoken user utterance $U^{spoken}_n$ + written dialogue history
Prompt Format:

h_n { "history": Context_n, "user last turn": U^{text}_n, 
     "domains": D_n, "predicted state": S_n }

Characteristics: Combines spoken current turn and textual history information

2. Full Spoken Context

Input: Complete spoken dialogue $Context_n = (U^{spoken}_1, A^{spoken}_2, ..., U^{spoken}_n)$
Prompt Format:

Speech_Emb {"domains": D_n, "predicted state": S_n}

Characteristics: Pure spoken input, avoiding modality conversion loss

3. Compressed Spoken Context

Compression Mechanism: Uses $N_{queries}$ trainable query vectors Q, computed via TransformerDecoder:

z_i = TransformerDecoder(Q, h_i)
Speech_Emb = (z_1||z_2||...||z_n)

Characteristics: Significantly reduces context length while maintaining performance

Training Strategy

Employs two-stage training:

ASR Pre-training: Freezes LLM, trains speech encoder and connector to align speech-text modalities
DST Fine-tuning: Freezes speech encoder, trains connector, compression module, and LLM LoRA adapters

Experimental Setup

Datasets

ASR Pre-training: Loquacious Medium (2,500 hours) + Fisher Corpus (1,960 hours) + SpokenWOZ training set (200 hours)
DST Fine-tuning: SpokenWOZ dataset, removing 9 corrupted dialogues, evaluated using Joint Goal Accuracy (JGA)

Model Configuration

Speech Encoder: W2v-BERT
Connector: Single-layer Transformer encoder (hidden dimension 1024, 16 attention heads)
Compression Module: Single-layer Transformer decoder (same configuration)
LLM: OLMo 2 1B, using LoRA adapters (rank=16, alpha=1)

Evaluation Metrics

Primarily uses Joint Goal Accuracy (JGA), with post-processing including temporal expression normalization and fuzzy matching.

Experimental Results

Main Results

Model	SWOZ Test JGA
SPACE+WavLMalign	25.65%
E2E (Whisper+T5)	24.10%
UBAR + GenWOZ	25.90%
WavLM + conn. + OLMo-1B	34.66%
Compressed Spoken Context (This Work)	36.49%
Full Spoken Context (This Work)	39.32%
WavLM + conn. + Gemma-2-9B	42.17%

Context Management Method Comparison

Method	SWOZ Dev	SWOZ Test
Multimodal Context (Baseline)	31.85%	32.06%
Full Spoken Context	36.89%	36.29%
Compressed Spoken Context (1 query)	31.03%	30.99%
Compressed Spoken Context (10 queries)	34.26%	33.51%

Fine-grained Analysis

Slot Type Analysis

Categorical Slots: All models perform well, with full spoken context showing slight advantages
Temporal and Open Slots: Full spoken context and 10-query compression significantly outperform other methods
Personal Information Slots: Most challenging category, with full spoken context leading and 1-query model performing worst

Dialogue Turn Analysis

Early Turns (1-5): All models perform well
Mid Turns (5-30): Accuracy drops rapidly, with full spoken context consistently leading
Late Turns (40+): Accuracy approaches zero, limited by small LLM capacity

Error Analysis

Analysis of six slots with highest error rates reveals:

Most predictions achieve high fuzzy ratios (>0.8), indicating models typically approximate correct slot values
Errors in restaurant names, attraction names, and hotel names primarily stem from insertions and deletions rather than substitutions
Personal information-related slots remain extremely challenging

Traditional Methods

Cascade Systems: Pipeline approaches combining ASR + DST, performing well in DSTC11 challenges
End-to-End Systems: Direct speech-to-dialogue-state mapping, avoiding error propagation

Speech-LLM Development

Speech-aware large language models demonstrate potential in tasks such as ASR and response generation
Recent work applies Speech-LLM to spoken DST, achieving SOTA performance

Context Management Strategies

Existing methods diverge in context processing; this paper provides the first systematic comparison of different strategy effects.

Conclusions and Discussion

Main Conclusions

Full Spoken Context Strategy is Most Effective: Directly using entire spoken dialogue as input achieves optimal performance
Compression Strategy Provides Good Trade-off: 10-query compression significantly reduces context size while maintaining competitive performance
Speech-LLM Performs Excellently on Spoken DST Tasks: Provides a new technical pathway for the field

Limitations

Computational Complexity: Full spoken context method may incur high computational overhead for very long dialogues
Model Scale Constraints: Not validated on larger-scale LLMs (e.g., Gemma-2-9B)
Dataset Limitations: Primarily validated on SpokenWOZ; generalization requires validation on additional datasets

Future Directions

Explore more sophisticated and compact spoken context processing methods
Extend to larger-scale models
Validate on additional spoken dialogue datasets

In-Depth Evaluation

Strengths

Clear Problem Definition: Systematic study of context management, a key issue in spoken DST
Strong Method Innovation: First systematic comparison of different context management strategies, proposing a simple yet effective full spoken context approach
Comprehensive Experimental Design: Includes sufficient ablation experiments, fine-grained analysis, and error analysis
Convincing Results: Demonstrates method effectiveness across multiple dimensions with significant performance improvements
Thorough Analysis: Analyzes method advantages from multiple perspectives including slot types and dialogue turns

Weaknesses

Insufficient Computational Efficiency Analysis: Lacks detailed analysis of computational complexity and inference time for different methods
Missing Large Model Validation: Does not verify method scalability on larger-scale LLMs
Limited Cross-dataset Generalization: Primarily validated on single dataset; generalization requires further verification
Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why full spoken context is more effective

Impact

Academic Value: Provides new research perspectives and benchmark methods for the spoken DST field
Practical Value: Simple and effective method, easy to reproduce and apply
Technical Contribution: Demonstrates potential of Speech-LLM in spoken understanding tasks

Applicable Scenarios

Task-Oriented Dialogue Systems: Particularly suitable for spoken dialogue systems requiring accurate state tracking
Multi-turn Dialogue Understanding: Applicable to scenarios requiring long-term context understanding
Resource-Constrained Environments: Relatively small model scale makes it suitable for deployment in resource-limited settings

References

This paper cites important literature from dialogue state tracking, spoken dialogue systems, Speech-LLM and related fields, particularly:

SpokenWOZ dataset-related work
DSTC challenge series
End-to-end spoken dialogue system research
Speech-LLM model development

Overall Assessment: This is a high-quality research paper that proposes a simple yet effective solution to core problems in spoken dialogue state tracking. The experimental design is comprehensive, analysis is thorough, and it makes important contributions to the field. Despite some limitations, its innovation and practicality provide significant academic and application value.