2025-11-11T20:37:15.929319

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Zhang, Ullah, Schultheis et al.

Speculative decoding (a.k.a. speculative sampling) has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed subset of the target model's vocabulary, ranked in descending order of token frequency. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. On standard speculative-decoding benchmarks, we observe consistent gains in mean accepted length over fixed-shortlist baselines, while context-dependent selection enables smaller shortlists without degrading acceptance.

academic

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

基本信息

论文ID: 2510.13847
标题: DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
作者: Jinbin Zhang (Aalto University), Nasib Ullah (Aalto University), Erik Schultheis (IST Austria), Rohit Babbar (University of Bath)
分类: cs.CL cs.AI cs.LG
发表时间: October 17, 2025 (Preprint)
论文链接: https://arxiv.org/abs/2510.13847

摘要

推测解码(Speculative decoding)已成为加速大语言模型推理的标准方法：小型起草模型提出多个token，大型目标模型按推测长度一次性验证。随着LLM词汇表规模的扩大，token数量大幅增长。虽然在完整词汇表上验证对目标模型影响不大，但起草模型输出头的O(|V|d)参数成为延迟瓶颈，拖慢整个流水线。现有方法(如FR-Spec、VocabTrim)将起草模型词汇表限制为目标模型词汇表的固定子集，按token频率降序排列。虽然这减少了起草时间计算，但存在脆性：(i)频率列表依赖语料库且需要重新调优以泛化；(ii)静态短列表抑制稀有或领域特定token，降低每验证步骤的期望token数量。本文提出DynaSpec，一种上下文相关的动态短列表机制，具有鲁棒性、加速起草且在多样化任务中泛化良好。

研究背景与动机

核心问题

随着大语言模型的发展，词汇表规模急剧增长：从Llama-2的32k token增长到Llama-3的128k、DeepSeek-V3的129k、Qwen-2.5的152k，甚至Gemma-3的262k token。在推测解码中，虽然大型目标模型能够承受完整词汇表的计算负担，但小型起草模型的输出层O(|V|d)参数成为严重的延迟瓶颈。

现有方法局限性

FR-Spec和VocabTrim：使用固定的高频token子集，存在以下问题：
- 频率列表依赖于特定语料库，跨基准测试泛化性差
- 静态子集可能抑制稀有或领域特定token，降低接受率
缺乏上下文感知：现有方法无法根据当前上下文动态调整token候选集

研究动机

基于极端分类(extreme classification)中的粗到细路由思想，本文提出上下文感知的动态词汇表选择机制，在保持验证准确性的同时提升起草效率。

核心贡献

提出DynaSpec框架：引入轻量级粗粒度元分类器，将上下文路由到少量token簇，起草模型仅在选定簇的并集上操作
理论分析：证明动态上下文条件支持在期望接受率方面严格优于任何静态子集
位置感知调度：提出位置感知簇预算策略，早期token分配更多簇，后期逐渐减少，平衡接受率和延迟
系统优化：通过融合索引+GEMM内核和并行执行，减轻动态头的matmul开销
实验验证：在7个标准任务上验证，相比固定短列表基线在平均接受长度上取得一致提升

方法详解

任务定义

在推测解码框架下，给定目标模型T和起草模型D，目标是：

减少起草模型的每token延迟TD
保持较高的接受率α
确保验证过程的准确性(完整词汇表)

模型架构

1. 词汇表分区

使用球面k-means对列归一化的LM头权重进行聚类：

{WLM[:, v]/||WLM[:, v]||₂}v∈V → {C₁, ..., CM}

将词汇表V划分为M个粗粒度token簇。

2. 轻量级路由器

元分类器rθ: Rᵈʳ → RM，输入为token嵌入和前一步隐藏状态：

s = rθ([E(xt), H̃t-1])

在独立CUDA流上并行执行，计算各簇得分。

3. 位置感知簇选择

采用位置感知预算kc(t)：

kc(t) = {
  kmax,                    t ∈ {0,1}
  ⌊kmax/((t+1)·2)⌋,      t ≥ 2
}

选择top-k簇构建短列表：VS(c,t) = ⋃m∈K(c,t) Cm

4. 动态起草

起草时间分解为：

TD(c,t) ≈ Tembed + max{Tcore, Tmeta} + Tindex+gemm(B(c,t))

其中B(c,t) ≪ |V|，显著减少词汇表相关计算。

技术创新点

上下文感知动态选择：相比静态方法，能根据当前上下文选择最相关的token簇
粗到细路由：借鉴极端分类思想，用O((M + |VS|)d)复杂度替代O(|V|d)
位置感知策略：早期步骤优先策略，平衡接受率和计算效率
并行执行：路由器和起草编码在不同CUDA流上并行，减少wall-clock开销

实验设置

数据集

使用7个多样化任务：

Spec-Bench：6个任务包括机器翻译(WMT14 DE-EN)、多轮对话(MT-Bench)、检索问答(Natural Questions)、数学推理(GSM8K)、摘要(CNN/DailyMail)、RAG
代码生成：HumanEval (164个问题)
每个任务80个提示，生成限制1024 tokens

评价指标

平均接受长度(Mean Acceptance Length)：每个起草-验证周期平均提交的token数量
平均词汇表大小：动态短列表的平均大小

对比方法

Full Vocab (EAGLE-2)：完整128k词汇表基线
FR-Spec：基于频率排序的32k固定子集方法
DynaSpec变体：固定top-k vs 位置感知top-k

实现细节

模型：Llama-3-8B-Instruct (128k词汇表)
硬件：单个NVIDIA A6000 GPU
簇数M设置和路由器训练使用ShareGPT和UltraChat200K子集

实验结果

主要结果

方法	MT	Conv.	RAG	Math	QA	Summ.	Code	平均
Full Vocab	3.66	4.11	4.03	4.31	3.45	3.68	4.77	4.00
FR-Spec	3.38	3.87	3.85	4.16	3.32	3.51	4.11	3.74
DynaSpec	3.51	4.05	3.91	4.21	3.40	3.51	4.71	3.90