2025-11-22T21:25:24.652246

FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

Shree, Jupuru
CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.
academic

FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

Basic Information

  • Paper ID: 2510.09085
  • Title: FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms
  • Authors: Atul Shree, Harshith Jupuru
  • Classification: cs.LG cs.SD eess.AS
  • Submission Date: October 10, 2025 (arXiv submission)
  • Paper Link: https://arxiv.org/abs/2510.09085

Abstract

CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5× runtime speedup and 2.78× memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

Research Background and Motivation

Problem Definition

This research addresses computational and memory bottlenecks faced by CTC-based automatic speech recognition (ASR) systems in resource-constrained environments. Traditional CTC decoders require exhaustive processing of all possible tokens at each time step, resulting in severe efficiency issues.

Problem Significance

  1. Computational Resource Bottleneck: In systems equipped with L4 GPUs and wav2vec2-large encoders, CTC decoding can consume up to 90% of processing time
  2. Memory Constraints: Traditional CTC decoders incur substantial memory consumption in large-vocabulary models
  3. Real-time Application Requirements: Real-time speech recognition and deployment on low-resource devices impose strict efficiency requirements on decoding

Limitations of Existing Methods

  1. Static Pruning Strategies: Static top-N pruning employed by KenLM and Flashlight lacks frame-level adaptivity
  2. Platform Specificity: GPU-specific acceleration solutions overlook CPU and resource-constrained device scenarios
  3. Architecture Dependency: Optimization methods designed for RNN-T models cannot be directly transferred to CTC architectures

Research Motivation

To develop a universal, platform-agnostic CTC decoding optimization algorithm that significantly improves decoding efficiency through dynamic frame-level token pruning while maintaining recognition accuracy.

Core Contributions

  1. Proposes FLToP CTC Algorithm: A dynamic frame-level token pruning decoding algorithm guided by relative threshold probability
  2. Platform-Agnostic Design: The algorithm is simple and universal, enabling seamless integration into CTC decoders across various platforms (CPUs, GPUs, etc.)
  3. Significant Performance Gains: Achieves 10.5× runtime speedup and 2.78× memory reduction on the LibriSpeech dataset
  4. Statistical Behavior Analysis: Provides in-depth investigation of CTC decoder statistical behavior, offering theoretical support for algorithm design

Methodology Details

Task Definition

Input: Logits sequence [T×V] from CTC model output, where T is the number of time steps and V is the vocabulary size Output: Optimal text sequence Constraints: Minimize computational and memory overhead while maintaining WER performance

Model Architecture

FLToP CTC Algorithm Core

The algorithm employs a two-stage pruning strategy:

  1. Top-N Selection: Select the top N tokens with highest probabilities for the current frame
  2. Relative Threshold Pruning: Retain only tokens with scores exceeding R × highest_score, where R is the relative threshold parameter

Algorithm Procedure

procedure BEAMSEARCHFLTOPCTC(logits, beam_size, beam_threshold, LM, N, R):
    B ← {(ε, 0)}  # Initialize beam
    for t in 0...T:
        B' ← {}
        logits_idx_sorted ← PartialSortDesc(logits[t], N)
        logit_t0 ← logits[t][logits_idx_sorted[0]]  # Highest score
        
        for (prefix, score) in B:
            for i in 0...N:
                logit_ti ← logits[t][logits_idx_sorted[i]]
                if logit_ti ≤ logit_t0 × R:  # Relative threshold pruning
                    break
                # Expand hypothesis
                token ← IdToToken(logits_idx_sorted[i])
                prefix' ← prefix + token
                score' ← score + logit_ti + LM(prefix')
                B'.add((prefix', score'))
        
        B ← SelectTopK(B', beam_size, beam_threshold)
    return GetHighestScorePrefix(B)

Technical Innovations

  1. Dynamic Adaptive Pruning: Compared to static top-N methods, dynamically adjusts the number of retained tokens based on each frame's probability distribution
  2. Relative Threshold Design: Uses proportional thresholds relative to the highest score rather than absolute thresholds, improving adaptability across different scenarios
  3. Conditional Termination Mechanism: Avoids unnecessary token evaluation through early break mechanism, further enhancing efficiency
  4. Platform-Agnostic Implementation: Simple algorithm design requiring no special hardware support, deployable across various computing platforms

Experimental Setup

Datasets

  • LibriSpeech Dataset: Evaluation using dev-clean, dev-other, test-clean, and test-other subsets
  • Language Model: 4-gram KenLM language model constructed from training set
  • Encoder: wav2vec2-large model pre-trained on LibriSpeech and LibriVox data, fine-tuned on 960 hours of LibriSpeech data

Evaluation Metrics

  • Word Error Rate (WER): Measures recognition accuracy
  • Decoding Time: Measures computational efficiency
  • Memory Usage: Indirectly measured through beam count

Comparison Methods

  1. Baseline Configuration: Standard CTC decoder using all 32 tokens
  2. Top-N Pruning: Static top-N pruning method
  3. FLToP CTC: Proposed dynamic pruning method

Implementation Details

  • Vocabulary: 32 tokens (26 letters + apostrophe + space + special tokens)
  • Beam Parameters: beam-size=1000, beam-threshold=25
  • Language Model Weights: lm-weight=1.0, word-score=0.95, sil-score=0.0
  • Tools: Experiments conducted using flashlight-text, fairseq, and KenLM

Experimental Results

Main Results

Token Selection Statistical Analysis

Through statistical analysis of token selection indices across all test samples:

  • 99.9823% of cases select within top 4 tokens, supporting N=4 configuration
  • Index 0 (highest probability token) selected 1,123,792 times, far exceeding other indices
  • Average emission scores demonstrate significant advantage of top tokens

Top-N Threshold Experiments (N=1...32)

  • Optimal balance at N=4: WER=3.852, superior to baseline's 3.864
  • Linear growth in decoding time: baseline (N=32) 3.94× slower than N=4 configuration
  • Negligible WER improvement for N>4, validating N=4 rationality

Relative Threshold Experiments (N=4, varying R)

Key Findings:

  • Optimal efficiency at R=0.007: WER=3.843, decoding time 369.6 seconds
  • 2.78× speedup compared to Top-4 method, 10.5× speedup versus baseline
  • Best WER at R=0.001: 3.831, slightly slower than R=0.007 but still faster than Top-4
  • WER range: Maintains 3.831-4.301 across different R values

Memory Efficiency Analysis

FLToP CTC demonstrates superior performance in beam count control:

  • Average beam count: 214.4 (FLToP CTC) vs 596.26 (baseline) vs 461.99 (Top-N)
  • Memory reduction: 2.78× compared to baseline, 2.15× compared to Top-N
  • Distribution characteristics: Mean, median, and quartiles all significantly lower than comparison methods

Ablation Studies

  1. N Value Impact: Significant performance improvement from N=1 to N=4, diminishing returns for N>4
  2. R Value Impact: R in range 0.001-0.007 provides optimal efficiency-accuracy trade-off
  3. Combined Effect: N=4 combined with R=0.007 achieves optimal efficiency-accuracy balance

CTC Decoding Optimization

  • Static Pruning Methods: KenLM, Flashlight employ fixed top-N strategies
  • Hardware-Specific Optimization: GPU acceleration approaches lacking generality
  • Model Compression: Reduce computation through model compression, potentially affecting accuracy

RNN-T Optimization

  • Architecture Differences: RNN-T optimization methods cannot be directly applied to CTC due to architectural differences
  • Pruning Strategies: Provide pruning insights but require redesign for CTC characteristics

Traditional ASR Tools

  • HMM/Viterbi Methods: Kaldi, HARPY employ state-dependent pruning
  • Granularity Differences: Traditional methods operate at higher granularity, while FLToP CTC operates at frame level

Conclusions and Discussion

Main Conclusions

  1. Significant Efficiency Gains: FLToP CTC achieves 10.5× runtime speedup and 2.78× memory reduction
  2. Accuracy Preservation: Maintains and even slightly improves WER performance while substantially improving efficiency
  3. Universal Applicability: Simple and universal algorithm deployable across platforms
  4. Statistically-Driven Design: Algorithm parameters designed based on in-depth statistical analysis

Limitations

  1. Vocabulary Scale Dependency: Validated on small vocabulary (32 tokens), large-vocabulary performance requires further verification
  2. Language Specificity: Primarily tested on English datasets, multilingual adaptability needs verification
  3. Model Dependency: Primarily based on wav2vec2 model, adaptability to other CTC models requires verification
  4. Parameter Tuning: R and N parameters may require tuning for different application scenarios

Future Directions

  1. Adaptive Parameter Adjustment: Develop methods to dynamically adjust R values based on input features
  2. Large Vocabulary Extension: Verify algorithm effectiveness in larger vocabulary and multilingual scenarios
  3. End-to-End Optimization: Combine model training process to optimize decoding efficiency
  4. Hardware-Specific Optimization: Further optimize implementation for specific hardware platforms

In-Depth Evaluation

Strengths

  1. High Practical Value: Addresses actual CTC decoding bottlenecks with direct application value
  2. Simple and Effective Method: Algorithm design is simple yet effective, easy to understand and implement
  3. Comprehensive Experiments: Systematic and thorough experimental design from statistical analysis to performance evaluation
  4. Strong Generality: Platform-agnostic design provides broad applicability
  5. Significant Performance Gains: Impressive 10.5× speedup and 2.78× memory reduction

Weaknesses

  1. Limited Evaluation Scope: Evaluated only on LibriSpeech dataset and specific models, lacking broader validation
  2. Insufficient Theoretical Analysis: Lacks analysis of algorithm convergence and theoretical guarantees
  3. Parameter Sensitivity: R and N parameter selection may require tuning for different scenarios
  4. Single Comparison Baseline: Primarily compared with standard CTC decoder, lacking comparison with other optimization methods

Impact

  1. Technical Contribution: Provides new insights and practical methods for CTC decoding optimization
  2. Practical Value: Significant importance for ASR deployment in resource-constrained environments
  3. Reproducibility: Clear algorithm description and relatively simple implementation ensure good reproducibility
  4. Promotion Potential: Strong generality suggests broad industrial application prospects

Applicable Scenarios

  1. Resource-Constrained Environments: Mobile devices, edge computing and other scenarios with limited computational resources
  2. Real-Time Applications: Real-time speech recognition applications sensitive to latency
  3. Large-Scale Deployment: Cloud service scenarios requiring processing large volumes of speech requests
  4. Embedded Systems: IoT devices and applications with strict power and memory constraints

References

The paper cites 32 relevant references, primarily including:

  • CTC Foundational Theory: Graves et al. (2006), Bourlard & Morgan (1994)
  • Modern ASR Models: wav2vec 2.0, WavLM
  • Decoding Optimization Tools: KenLM, Flashlight
  • Datasets: LibriSpeech, LibriVox
  • Related Optimization Methods: Important works in model compression and hardware acceleration domains

Overall Assessment: This is a highly practical technical paper that proposes the FLToP CTC algorithm, which is simple yet effective, achieving significant advances in CTC decoding optimization. While there is room for improvement in evaluation scope and theoretical analysis, its practical value and generality make it a valuable contribution to the ASR field.