DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
Zhang, Ullah, Schultheis et al.
Speculative decoding (a.k.a. speculative sampling) has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed subset of the target model's vocabulary, ranked in descending order of token frequency. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. On standard speculative-decoding benchmarks, we observe consistent gains in mean accepted length over fixed-shortlist baselines, while context-dependent selection enables smaller shortlists without degrading acceptance.
academic
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
Speculative decoding has become a standard method for accelerating large language model inference: a small draft model proposes multiple tokens, and a large target model validates them at once over the speculative length. As LLM vocabulary sizes expand, the number of tokens grows substantially. While validation over the complete vocabulary has minimal impact on the target model, the O(|V|d) parameters in the draft model's output head become a latency bottleneck, slowing down the entire pipeline. Existing methods (e.g., FR-Spec, VocabTrim) restrict the draft model's vocabulary to a fixed subset of the target model's vocabulary, ordered by token frequency in descending order. While this reduces draft-time computation, it suffers from brittleness: (i) frequency lists are corpus-dependent and require retuning for generalization; (ii) static shortlists suppress rare or domain-specific tokens, reducing the expected number of tokens per verification step. This paper proposes DynaSpec, a context-aware dynamic shortlist mechanism that is robust, accelerates drafting, and generalizes well across diverse tasks.
With the development of large language models, vocabulary sizes have grown dramatically: from Llama-2's 32k tokens to Llama-3's 128k, DeepSeek-V3's 129k, Qwen-2.5's 152k, and even Gemma-3's 262k tokens. In speculative decoding, while large target models can handle the computational burden of complete vocabularies, the O(|V|d) parameters in the small draft model's output layer become a severe latency bottleneck.
Based on coarse-to-fine routing ideas from extreme classification, this paper proposes a context-aware dynamic vocabulary selection mechanism that improves drafting efficiency while maintaining verification accuracy.
Proposes DynaSpec Framework: Introduces a lightweight coarse-grained meta-classifier that routes context to a small number of token clusters, with the draft model operating only on the union of selected clusters
Theoretical Analysis: Proves that dynamic context-conditioned support strictly dominates any static subset in terms of expected acceptance rate
Position-Aware Scheduling: Proposes a position-aware cluster budget strategy that allocates more clusters to early tokens and gradually reduces them for later tokens, balancing acceptance rate and latency
System Optimization: Mitigates dynamic head matmul overhead through fused indexing + GEMM kernels and parallel execution
Experimental Validation: Verifies the method on 7 standard tasks, achieving consistent improvements in mean acceptance length compared to fixed shortlist baselines
The paper cites important works from related fields including speculative decoding, large-vocabulary LLMs, and extreme classification, providing a solid theoretical foundation for method design. Key references include the EAGLE series, FR-Spec, and works like LightXML and CascadeXML from extreme classification.