2025-11-17T09:16:13.954696

CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR

Zhou, Jia, Sari et al.

CTC compressor can be an effective approach to integrate audio encoders to decoder-only models, which has gained growing interest for different speech applications. In this work, we propose a novel CTC compressor based joint speech and text training (CJST) framework for decoder-only ASR. CJST matches speech and text modalities from both directions by exploring a simple modality adaptor and several features of the CTC compressor, including sequence compression, on-the-fly forced peaky alignment and CTC class embeddings. Experimental results on the Librispeech and TED-LIUM2 corpora show that the proposed CJST achieves an effective text injection without the need of duration handling, leading to the best performance for both in-domain and cross-domain scenarios. We also provide a comprehensive study on CTC compressor, covering various compression modes, edge case handling and behavior under both clean and noisy data conditions, which reveals the most robust setting to use CTC compressor for decoder-only models.

academic

CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR

Basic Information

Paper ID: 2411.07607
Title: CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR
Authors: Wei Zhou, Junteng Jia, Leda Sari, Jay Mahadeokar, Ozlem Kalinli (Meta AI)
Categories: eess.AS cs.LG cs.SD
Publication Date: November 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2411.07607

Abstract

CTC compressors have emerged as an effective method for integrating audio encoders into decoder-only models, gaining increasing attention across various speech applications. This paper proposes a novel CTC compressor-based joint speech and text training (CJST) framework for decoder-only ASR. CJST achieves bidirectional modality matching between speech and text by exploring simple modal adapters and several properties of CTC compressors, including sequence compression, online forced peak alignment, and CTC class embeddings. Experimental results on Librispeech and TED-LIUM2 corpora demonstrate that the proposed CJST achieves effective text injection without requiring duration processing, attaining state-of-the-art performance in both in-domain and cross-domain scenarios.

Research Background and Motivation

Problem Definition

With the tremendous success of large language models (LLMs), decoder-only architectures have been widely adopted in various speech applications. However, how to effectively integrate speech information into decoder-only models and how to conduct joint speech-text training to enhance ASR performance remain challenging problems.

Research Motivation

Integration Challenge: Effectively integrating continuous acoustic embeddings into decoder-only models requires appropriate adapter methods
Modality Matching: Speech and text modalities exhibit significant differences in sequence length and representation space, requiring effective alignment mechanisms
Text Injection: In production-level ASR models, how to effectively leverage text data without external language models to improve performance

Limitations of Existing Methods

Simple Adapters: Traditional temporal reduction layers plus linear projection methods lack content-aware compression capabilities
RNN-T Methods: Existing joint training methods primarily target RNN-T models, requiring complex duration processing
CTC Compressor Sensitivity: Existing CTC compressor methods show unstable performance on noisy data

Core Contributions

Proposes CJST Framework: A novel joint speech-text training framework based on CTC compressors, achieving bidirectional modality matching
Extends CTC Compressor: Comprehensive investigation of various compression modes, edge case handling, and behavior on clean/noisy data
Duration-Free Processing: Achieves effective text injection through online forced peak alignment and CTC class embeddings, without requiring complex duration modeling
Performance Improvement: Achieves state-of-the-art performance in both in-domain and cross-domain scenarios, with approximately 6% relative improvement over baselines

Methodology Details

Task Definition

This paper investigates automatic speech recognition for decoder-only architectures, with speech feature sequences as input and corresponding text transcriptions as output. It simultaneously considers how to leverage paired speech-text data and pure text data for joint training.

Extended CTC Compressor

Compression Modes

The paper investigates four CTC compressor compression modes:

Blank Prediction Removal: Based on greedy CTC predictions, removes all blank frames
Same Prediction Averaging: Averages adjacent frames with identical predictions
Blank Probability Removal: Removes all frames where blank probability exceeds a predefined threshold
Combination Mode: First applies blank probability removal, then applies same prediction averaging

Edge Case Handling

To address the issue of CTC compressors potentially producing empty outputs, two solutions are proposed:

Empty Skip: Skips these utterances during training, directly outputs EOS during inference
Empty Fallback: Averages all encoder outputs into a single frame, then proceeds with normal training and inference

Explores sharing mechanisms between CTC class embeddings and text embeddings, bringing audio encoder outputs closer to text embeddings through the CTC objective function.

CJST Framework

Paired Data Processing

For paired speech-text data:

Conducts regular ASR training through model forward propagation
Utilizes compressed acoustic embeddings h' and CTC probabilities for forced peak alignment
Trains modal adapters via MSE loss to align h' with pseudo-acoustic embeddings h'_text

Pure Text Data Processing

For pure text data:

Randomly inserts blank symbols based on recorded length ratio R_len(h', y)
Generates pseudo-acoustic prompts h'_text through CTC embeddings and modal adapters
Trains the decoder model using the ASR objective function
Applies 20% random masking to h'_text to maintain learning difficulty

Uses a simple Conformer layer as the modal adapter, featuring a single attention head, convolution kernel size of 3, without dimension expansion in the feed-forward module.

Experimental Setup

Datasets

Librispeech: 960 hours of clean speech data
Internal Data: 2M hours of acoustically diverse data, including speed perturbation, simulated reverberation, and random background noise
Text Data: LM training text data from Librispeech and TED-LIUM2

Model Configuration

Decoder: 12-layer LLaMA decoder, 768 hidden dimensions, 12 attention heads
Audio Encoder: 24-layer Conformer, 512 hidden dimensions, 8 attention heads
Vocabulary: 4k SentencePiece units per dataset

Training Strategy

Audio encoder pre-training: 200k steps
Full model training: 200k steps on Librispeech, 500k steps on internal data
Joint training: speech and text loss weights both set to 1.0
Auxiliary CTC loss weight: 0.5

Evaluation Metrics

Word Error Rate (WER) is used as the primary evaluation metric, reported on test sets.

Experimental Results

Comprehensive CTC Compressor Evaluation

Librispeech Results (Table I)

All CTC compressor methods outperform simple adapter methods
Blank probability removal (threshold 0.95) achieves best performance: test-clean 2.17%, test-other 4.94%
Embedding sharing helps in some cases but lacks consistency

Internal Data Results (Table II)

Greedy prediction-based methods perform poorly on noisy data
Blank probability removal (threshold 0.95) is most robust: 12.85% WER
Empty fallback scheme outperforms empty skip scheme

Joint Training Results

Training from Scratch (Table III)

Results on Librispeech:

Baseline adapter: test-clean 3.38%, test-other 5.63%
LM-like text injection: test-clean 2.54%, test-other 5.26%
CJST: test-clean 2.09%, test-other 4.71%

Continued Training (Table IV)

Using in-domain and cross-domain text data:

CJST achieves best performance across all scenarios
Cross-domain TED-LIUM2 test set: reduced from 11.45% to 10.14%
Approximately 6% relative improvement over baseline

Key Findings

Blank probability removal is the most robust compression mode
LM-like training is already quite effective, serving as a strong baseline
CJST provides further improvements across all scenarios
CTC compressor is sensitive to data quality, requiring appropriate configuration

Decoder-Only Speech Models

Early work uses simple adapters to integrate audio encoders
Recent research explores discrete audio token approaches
This paper focuses on ASR tasks with continuous representations

CTC Compressor

Originally used for attention-based speech translation
Extended to speech translation in decoder-only models
This paper is the first to systematically investigate its application in ASR

Speech-Text Joint Training

Traditional methods primarily target RNN-T models
Include JOIST, textogram, MAESTRO, and other methods
This paper is the first to propose an effective solution for decoder-only ASR

Conclusions and Discussion

Main Conclusions

CJST Framework is Effective: Achieves effective text injection through bidirectional modality matching
CTC Compressor Configuration is Critical: Blank probability removal (high threshold) is most robust
Duration-Free Processing: Avoids complex duration modeling through forced alignment and CTC embeddings
Consistent Improvement: Achieves significant improvements in both in-domain and cross-domain scenarios

Limitations

Computational Overhead: Online forced alignment increases training computational cost
Data Dependency: CTC compressor performance is highly dependent on data quality
Parameter Sensitivity: Requires careful tuning of hyperparameters such as blank probability threshold
Evaluation Scope: Primarily evaluated on English data; multilingual generalization remains unknown

Future Directions

Explore more efficient online alignment methods
Investigate performance in multilingual and low-resource scenarios
Combine hybrid approaches with discrete audio tokens
Optimize robustness of CTC compressor

In-Depth Evaluation

Strengths

Methodological Innovation: First application of CTC compressor to joint speech-text training for decoder-only ASR
Systematic Investigation: Comprehensive experimental analysis of CTC compressor
Practical Value: Duration-free processing simplifies implementation complexity
Sufficient Experimentation: Validates method effectiveness across multiple datasets and scenarios
Clear Writing: Well-structured paper with detailed technical descriptions

Weaknesses

Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why CJST is effective
Computational Cost: Lacks detailed analysis of training and inference computational overhead
Hyperparameter Sensitivity: Method involves multiple hyperparameters with complex tuning requirements
Evaluation Limitations: Primarily evaluated on English data; lacks multilingual validation

Impact

Academic Contribution: Provides new insights for text injection in decoder-only ASR
Practical Value: Relatively simple method, easy to deploy in production environments
Reproducibility: Provides detailed implementation details and hyperparameter settings
Inspirational Value: Offers valuable insights for further research on CTC compressors

Applicable Scenarios

Production-Level ASR: Suitable for scenarios where external language models cannot be used
Cross-Domain Adaptation: Particularly suitable for applications requiring rapid domain adaptation
Resource-Constrained Settings: More efficient than complex duration modeling methods
Joint Training: Suitable for scenarios with abundant text data but relatively limited speech data

References

The paper cites 32 relevant references, covering important works in large language models, decoder-only architectures, CTC methods, speech recognition, and joint training, providing a solid theoretical foundation for the research.

Overall Assessment: This is a high-quality technical paper proposing an innovative CJST framework that addresses the important problem of joint speech-text training in decoder-only ASR. The experimental design is comprehensive, results are convincing, and the work has significant academic and practical value for the field.