FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms
Shree, Jupuru
CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.
academic
FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms
CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5× runtime speedup and 2.78× memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.
This research addresses computational and memory bottlenecks faced by CTC-based automatic speech recognition (ASR) systems in resource-constrained environments. Traditional CTC decoders require exhaustive processing of all possible tokens at each time step, resulting in severe efficiency issues.
To develop a universal, platform-agnostic CTC decoding optimization algorithm that significantly improves decoding efficiency through dynamic frame-level token pruning while maintaining recognition accuracy.
Proposes FLToP CTC Algorithm: A dynamic frame-level token pruning decoding algorithm guided by relative threshold probability
Platform-Agnostic Design: The algorithm is simple and universal, enabling seamless integration into CTC decoders across various platforms (CPUs, GPUs, etc.)
Significant Performance Gains: Achieves 10.5× runtime speedup and 2.78× memory reduction on the LibriSpeech dataset
Statistical Behavior Analysis: Provides in-depth investigation of CTC decoder statistical behavior, offering theoretical support for algorithm design
Input: Logits sequence [T×V] from CTC model output, where T is the number of time steps and V is the vocabulary size
Output: Optimal text sequence
Constraints: Minimize computational and memory overhead while maintaining WER performance
Dynamic Adaptive Pruning: Compared to static top-N methods, dynamically adjusts the number of retained tokens based on each frame's probability distribution
Relative Threshold Design: Uses proportional thresholds relative to the highest score rather than absolute thresholds, improving adaptability across different scenarios
Conditional Termination Mechanism: Avoids unnecessary token evaluation through early break mechanism, further enhancing efficiency
Platform-Agnostic Implementation: Simple algorithm design requiring no special hardware support, deployable across various computing platforms
The paper cites 32 relevant references, primarily including:
CTC Foundational Theory: Graves et al. (2006), Bourlard & Morgan (1994)
Modern ASR Models: wav2vec 2.0, WavLM
Decoding Optimization Tools: KenLM, Flashlight
Datasets: LibriSpeech, LibriVox
Related Optimization Methods: Important works in model compression and hardware acceleration domains
Overall Assessment: This is a highly practical technical paper that proposes the FLToP CTC algorithm, which is simple yet effective, achieving significant advances in CTC decoding optimization. While there is room for improvement in evaluation scope and theoretical analysis, its practical value and generality make it a valuable contribution to the ASR field.