2025-11-22T21:25:24.652246

FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

Shree, Jupuru

CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

academic

FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

基本信息

论文ID: 2510.09085
标题: FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms
作者: Atul Shree, Harshith Jupuru
分类: cs.LG cs.SD eess.AS
发表时间: 2025年10月10日 (arXiv提交)
论文链接: https://arxiv.org/abs/2510.09085

摘要

CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5× runtime speedup and 2.78× memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

研究背景与动机

问题定义

本研究要解决CTC-based自动语音识别(ASR)系统在资源受限环境下面临的计算和内存瓶颈问题。传统CTC解码器需要在每个时间步对所有可能的token进行穷举式处理，导致严重的效率问题。

问题重要性

计算资源瓶颈：在配备L4 GPU和wav2vec2-large编码器的系统中，CTC解码过程可占用高达90%的处理时间
内存限制：传统CTC解码器在大词汇量模型中内存消耗巨大
实时应用需求：实时语音识别和低资源设备部署对解码效率有严格要求

现有方法局限性

静态剪枝策略：如KenLM和Flashlight采用的静态top-N剪枝缺乏帧级自适应性
平台特异性：GPU特定的加速方案忽略了CPU和受限设备场景
架构依赖性：针对RNN-T模型的优化方法无法直接迁移到CTC架构

研究动机

开发一种通用的、平台无关的CTC解码优化算法，通过动态帧级token剪枝在保持识别精度的同时显著提升解码效率。

核心贡献

提出FLToP CTC算法：一种基于相对阈值概率的动态帧级token剪枝解码算法
平台无关设计：算法简单通用，可无缝集成到各种平台的CTC解码器中(CPU、GPU等)
显著性能提升：在LibriSpeech数据集上实现10.5×运行时加速和2.78×内存减少
统计行为分析：提供了CTC解码器统计行为的深入研究，为算法设计提供理论支撑

方法详解

任务定义

输入：CTC模型输出的logits序列 [T×V]，其中T为时间步数，V为词汇表大小输出：最优文本序列约束：在保持WER性能的前提下最小化计算和内存开销

模型架构

FLToP CTC算法核心

算法采用两阶段剪枝策略：

Top-N选择：为当前帧选择前N个最高概率token
相对阈值剪枝：仅保留分数高于 R × 最高分数 的token，其中R为相对阈值参数

算法流程

procedure BEAMSEARCHFLTOPCTC(logits, beam_size, beam_threshold, LM, N, R):
    B ← {(ε, 0)}  # 初始化beam
    for t in 0...T:
        B' ← {}
        logits_idx_sorted ← PartialSortDesc(logits[t], N)
        logit_t0 ← logits[t][logits_idx_sorted[0]]  # 最高分数
        
        for (prefix, score) in B:
            for i in 0...N:
                logit_ti ← logits[t][logits_idx_sorted[i]]
                if logit_ti ≤ logit_t0 × R:  # 相对阈值剪枝
                    break
                # 扩展hypothesis
                token ← IdToToken(logits_idx_sorted[i])
                prefix' ← prefix + token
                score' ← score + logit_ti + LM(prefix')
                B'.add((prefix', score'))
        
        B ← SelectTopK(B', beam_size, beam_threshold)
    return GetHighestScorePrefix(B)