Quantize-Sample-and-Verify: LLM Acceleration via Adaptive Edge-Cloud Speculative Decoding
Zhang, Cai, Yu et al.
In edge-cloud speculative decoding (SD), edge devices equipped with small language models (SLMs) generate draft tokens that are verified by large language models (LLMs) in the cloud. A key bottleneck in such systems is the limited communication bandwidth between edge and cloud, which necessitates quantization of the information transmitted about generated tokens. In this work, we introduce a novel quantize-sample (Q-S) strategy that provably preserves the output distribution of the cloud-based model, ensuring that the verified tokens match the distribution of those that would have been generated directly by the LLM. We develop a throughput model for edge-cloud SD that explicitly accounts for communication latency. Leveraging this model, we propose an adaptive mechanism that optimizes token throughput by dynamically adjusting the draft length and quantization precision in response to both semantic uncertainty and channel conditions. Simulations demonstrate that the proposed Q-S approach significantly improves decoding efficiency in realistic edge-cloud deployment scenarios.
academic
Quantize-Sample-and-Verify: LLM Acceleration via Adaptive Edge-Cloud Speculative Decoding
In edge-cloud speculative decoding (SD) systems, edge devices equipped with small language models (SLMs) generate draft tokens, which are subsequently verified by large language models (LLMs) in the cloud. The critical bottleneck in such systems is the limited communication bandwidth between edge and cloud, necessitating quantization of transmitted token information. This work introduces a novel Quantize-Sample (Q-S) strategy that provably preserves the output distribution of the cloud model, ensuring that verified tokens match the distribution of tokens directly generated by the LLM. We develop an explicit throughput model for edge-cloud SD that accounts for communication latency. Based on this model, we propose an adaptive mechanism that dynamically adjusts draft length and quantization precision in response to semantic uncertainty and channel conditions, thereby optimizing token throughput. Simulation results demonstrate that the proposed Q-S method significantly improves decoding efficiency in realistic edge-cloud deployment scenarios.
The core problem addressed by this research is the communication bandwidth limitation in edge-cloud speculative decoding systems. In traditional speculative decoding, edge devices must transmit substantial probability distribution information to the cloud, which severely impacts system performance in bandwidth-constrained environments.
Practical Value: Edge-cloud collaborative inference is a current important trend in LLM deployment, balancing computational resources and response latency
Technical Challenges: Existing methods destroy the original output distribution of LLMs when quantizing probability distributions, affecting generation quality
Economic Benefits: Reduces redundant API calls, improving energy efficiency and system scalability
The motivation of this work is to design an edge-cloud speculative decoding scheme that both reduces communication overhead and strictly maintains consistency of LLM output distribution.
Proposed Quantize-Sample (Q-S) Strategy: Provably preserves the output distribution of the cloud LLM, ensuring generation quality is not compromised
Established Throughput Model Accounting for Communication Latency: Explicitly models the impact of uplink and downlink transmission delays on system performance
Designed Adaptive Resource Allocation Mechanism: Dynamically adjusts draft length and quantization precision based on reinforcement learning
Provided Theoretical Guarantees: Proves the distributional equivalence of the Q-S method through Proposition 1
The edge-cloud speculative decoding task is defined as: given an input prefix s¹, the system generates draft tokens through edge SLM, verifies them with cloud LLM, and ultimately produces a token sequence with the same distribution as direct LLM generation.
Proposition 1: Q-S edge-cloud SD guarantees that the probability P(X = xᵗₗ) of generated token xᵗₗ equals the corresponding probability pᵗₗ,xᵗₗ of the LLM.
The key to this property is that sampling and verification use the same quantized distribution, whereas the S-Q method uses different distributions causing distribution shift.
Shannon entropy of tokens increases with sampling temperature for all methods, confirming the correct influence of temperature parameter on output diversity.
The paper provides detailed experimental setup and open-source code links, demonstrating good reproducibility. However, deployment verification on real edge devices requires further work.
Chen, C., et al. "Accelerating large language model decoding with speculative sampling." arXiv:2302.01318, 2023.
Hao, Z., et al. "Hybrid SLM and LLM for edge-cloud collaborative inference." Proc. Worksh. Edge Mobil. Found. Models, 2024.
Leviathan, Y., et al. "Fast inference from transformers via speculative decoding." Proc. Int. Conf. Mach. Learn. (ICML), 2023.
Reznik, Y. A. "An algorithm for quantization of discrete probability distributions." Data Compress. Conf. (DCC), 2011.
Overall Assessment: This is a high-quality paper with important contributions to the field of edge-cloud speculative decoding. It features rigorous theoretical analysis, comprehensive experimental validation, and addresses critical problems in practical applications. Despite some limitations, its innovation and practical value make it an important work in this field.