2025-11-22T21:25:24.652246

FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

Shree, Jupuru

CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

academic

FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

Basic Information

Paper ID: 2510.09085
Title: FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms
Authors: Atul Shree, Harshith Jupuru
Classification: cs.LG cs.SD eess.AS
Submission Date: October 10, 2025 (arXiv submission)
Paper Link: https://arxiv.org/abs/2510.09085

Abstract

CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5× runtime speedup and 2.78× memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

Research Background and Motivation

Problem Definition

This research addresses computational and memory bottlenecks faced by CTC-based automatic speech recognition (ASR) systems in resource-constrained environments. Traditional CTC decoders require exhaustive processing of all possible tokens at each time step, resulting in severe efficiency issues.

Problem Significance

Computational Resource Bottleneck: In systems equipped with L4 GPUs and wav2vec2-large encoders, CTC decoding can consume up to 90% of processing time
Memory Constraints: Traditional CTC decoders incur substantial memory consumption in large-vocabulary models
Real-time Application Requirements: Real-time speech recognition and deployment on low-resource devices impose strict efficiency requirements on decoding

Limitations of Existing Methods

Static Pruning Strategies: Static top-N pruning employed by KenLM and Flashlight lacks frame-level adaptivity
Platform Specificity: GPU-specific acceleration solutions overlook CPU and resource-constrained device scenarios
Architecture Dependency: Optimization methods designed for RNN-T models cannot be directly transferred to CTC architectures

Research Motivation

To develop a universal, platform-agnostic CTC decoding optimization algorithm that significantly improves decoding efficiency through dynamic frame-level token pruning while maintaining recognition accuracy.

Core Contributions

Proposes FLToP CTC Algorithm: A dynamic frame-level token pruning decoding algorithm guided by relative threshold probability
Platform-Agnostic Design: The algorithm is simple and universal, enabling seamless integration into CTC decoders across various platforms (CPUs, GPUs, etc.)
Significant Performance Gains: Achieves 10.5× runtime speedup and 2.78× memory reduction on the LibriSpeech dataset
Statistical Behavior Analysis: Provides in-depth investigation of CTC decoder statistical behavior, offering theoretical support for algorithm design

Methodology Details

Task Definition

Input: Logits sequence [T×V] from CTC model output, where T is the number of time steps and V is the vocabulary size Output: Optimal text sequence Constraints: Minimize computational and memory overhead while maintaining WER performance

Model Architecture

FLToP CTC Algorithm Core

The algorithm employs a two-stage pruning strategy:

Top-N Selection: Select the top N tokens with highest probabilities for the current frame
Relative Threshold Pruning: Retain only tokens with scores exceeding R × highest_score, where R is the relative threshold parameter

Algorithm Procedure

procedure BEAMSEARCHFLTOPCTC(logits, beam_size, beam_threshold, LM, N, R):
    B ← {(ε, 0)}  # Initialize beam
    for t in 0...T:
        B' ← {}
        logits_idx_sorted ← PartialSortDesc(logits[t], N)
        logit_t0 ← logits[t][logits_idx_sorted[0]]  # Highest score
        
        for (prefix, score) in B:
            for i in 0...N:
                logit_ti ← logits[t][logits_idx_sorted[i]]
                if logit_ti ≤ logit_t0 × R:  # Relative threshold pruning
                    break
                # Expand hypothesis
                token ← IdToToken(logits_idx_sorted[i])
                prefix' ← prefix + token
                score' ← score + logit_ti + LM(prefix')
                B'.add((prefix', score'))
        
        B ← SelectTopK(B', beam_size, beam_threshold)
    return GetHighestScorePrefix(B)

Technical Innovations

Dynamic Adaptive Pruning: Compared to static top-N methods, dynamically adjusts the number of retained tokens based on each frame's probability distribution
Relative Threshold Design: Uses proportional thresholds relative to the highest score rather than absolute thresholds, improving adaptability across different scenarios
Conditional Termination Mechanism: Avoids unnecessary token evaluation through early break mechanism, further enhancing efficiency
Platform-Agnostic Implementation: Simple algorithm design requiring no special hardware support, deployable across various computing platforms

Experimental Setup

Datasets

LibriSpeech Dataset: Evaluation using dev-clean, dev-other, test-clean, and test-other subsets
Language Model: 4-gram KenLM language model constructed from training set
Encoder: wav2vec2-large model pre-trained on LibriSpeech and LibriVox data, fine-tuned on 960 hours of LibriSpeech data

Evaluation Metrics

Word Error Rate (WER): Measures recognition accuracy
Decoding Time: Measures computational efficiency
Memory Usage: Indirectly measured through beam count

Comparison Methods

Baseline Configuration: Standard CTC decoder using all 32 tokens
Top-N Pruning: Static top-N pruning method
FLToP CTC: Proposed dynamic pruning method

Implementation Details

Vocabulary: 32 tokens (26 letters + apostrophe + space + special tokens)
Beam Parameters: beam-size=1000, beam-threshold=25
Language Model Weights: lm-weight=1.0, word-score=0.95, sil-score=0.0
Tools: Experiments conducted using flashlight-text, fairseq, and KenLM

Experimental Results

Main Results

Token Selection Statistical Analysis

Through statistical analysis of token selection indices across all test samples:

99.9823% of cases select within top 4 tokens, supporting N=4 configuration
Index 0 (highest probability token) selected 1,123,792 times, far exceeding other indices
Average emission scores demonstrate significant advantage of top tokens

Top-N Threshold Experiments (N=1...32)

Optimal balance at N=4: WER=3.852, superior to baseline's 3.864
Linear growth in decoding time: baseline (N=32) 3.94× slower than N=4 configuration
Negligible WER improvement for N>4, validating N=4 rationality

Relative Threshold Experiments (N=4, varying R)

Key Findings:

Optimal efficiency at R=0.007: WER=3.843, decoding time 369.6 seconds
2.78× speedup compared to Top-4 method, 10.5× speedup versus baseline
Best WER at R=0.001: 3.831, slightly slower than R=0.007 but still faster than Top-4
WER range: Maintains 3.831-4.301 across different R values

Memory Efficiency Analysis

FLToP CTC demonstrates superior performance in beam count control:

Average beam count: 214.4 (FLToP CTC) vs 596.26 (baseline) vs 461.99 (Top-N)
Memory reduction: 2.78× compared to baseline, 2.15× compared to Top-N
Distribution characteristics: Mean, median, and quartiles all significantly lower than comparison methods

Ablation Studies

N Value Impact: Significant performance improvement from N=1 to N=4, diminishing returns for N>4
R Value Impact: R in range 0.001-0.007 provides optimal efficiency-accuracy trade-off
Combined Effect: N=4 combined with R=0.007 achieves optimal efficiency-accuracy balance

CTC Decoding Optimization

Static Pruning Methods: KenLM, Flashlight employ fixed top-N strategies
Hardware-Specific Optimization: GPU acceleration approaches lacking generality
Model Compression: Reduce computation through model compression, potentially affecting accuracy

RNN-T Optimization

Architecture Differences: RNN-T optimization methods cannot be directly applied to CTC due to architectural differences
Pruning Strategies: Provide pruning insights but require redesign for CTC characteristics

Traditional ASR Tools

HMM/Viterbi Methods: Kaldi, HARPY employ state-dependent pruning
Granularity Differences: Traditional methods operate at higher granularity, while FLToP CTC operates at frame level

Conclusions and Discussion

Main Conclusions

Significant Efficiency Gains: FLToP CTC achieves 10.5× runtime speedup and 2.78× memory reduction
Accuracy Preservation: Maintains and even slightly improves WER performance while substantially improving efficiency
Universal Applicability: Simple and universal algorithm deployable across platforms
Statistically-Driven Design: Algorithm parameters designed based on in-depth statistical analysis

Limitations

Vocabulary Scale Dependency: Validated on small vocabulary (32 tokens), large-vocabulary performance requires further verification
Language Specificity: Primarily tested on English datasets, multilingual adaptability needs verification
Model Dependency: Primarily based on wav2vec2 model, adaptability to other CTC models requires verification
Parameter Tuning: R and N parameters may require tuning for different application scenarios

Future Directions

Adaptive Parameter Adjustment: Develop methods to dynamically adjust R values based on input features
Large Vocabulary Extension: Verify algorithm effectiveness in larger vocabulary and multilingual scenarios
End-to-End Optimization: Combine model training process to optimize decoding efficiency
Hardware-Specific Optimization: Further optimize implementation for specific hardware platforms

In-Depth Evaluation

Strengths

High Practical Value: Addresses actual CTC decoding bottlenecks with direct application value
Simple and Effective Method: Algorithm design is simple yet effective, easy to understand and implement
Comprehensive Experiments: Systematic and thorough experimental design from statistical analysis to performance evaluation
Strong Generality: Platform-agnostic design provides broad applicability
Significant Performance Gains: Impressive 10.5× speedup and 2.78× memory reduction

Weaknesses

Limited Evaluation Scope: Evaluated only on LibriSpeech dataset and specific models, lacking broader validation
Insufficient Theoretical Analysis: Lacks analysis of algorithm convergence and theoretical guarantees
Parameter Sensitivity: R and N parameter selection may require tuning for different scenarios
Single Comparison Baseline: Primarily compared with standard CTC decoder, lacking comparison with other optimization methods

Impact

Technical Contribution: Provides new insights and practical methods for CTC decoding optimization
Practical Value: Significant importance for ASR deployment in resource-constrained environments
Reproducibility: Clear algorithm description and relatively simple implementation ensure good reproducibility
Promotion Potential: Strong generality suggests broad industrial application prospects

Applicable Scenarios

Resource-Constrained Environments: Mobile devices, edge computing and other scenarios with limited computational resources
Real-Time Applications: Real-time speech recognition applications sensitive to latency
Large-Scale Deployment: Cloud service scenarios requiring processing large volumes of speech requests
Embedded Systems: IoT devices and applications with strict power and memory constraints

References

The paper cites 32 relevant references, primarily including:

CTC Foundational Theory: Graves et al. (2006), Bourlard & Morgan (1994)
Modern ASR Models: wav2vec 2.0, WavLM
Decoding Optimization Tools: KenLM, Flashlight
Datasets: LibriSpeech, LibriVox
Related Optimization Methods: Important works in model compression and hardware acceleration domains

Overall Assessment: This is a highly practical technical paper that proposes the FLToP CTC algorithm, which is simple yet effective, achieving significant advances in CTC decoding optimization. While there is room for improvement in evaluation scope and theoretical analysis, its practical value and generality make it a valuable contribution to the ASR field.