2025-11-25T03:10:17.326863

Quantize-Sample-and-Verify: LLM Acceleration via Adaptive Edge-Cloud Speculative Decoding

Zhang, Cai, Yu et al.

In edge-cloud speculative decoding (SD), edge devices equipped with small language models (SLMs) generate draft tokens that are verified by large language models (LLMs) in the cloud. A key bottleneck in such systems is the limited communication bandwidth between edge and cloud, which necessitates quantization of the information transmitted about generated tokens. In this work, we introduce a novel quantize-sample (Q-S) strategy that provably preserves the output distribution of the cloud-based model, ensuring that the verified tokens match the distribution of those that would have been generated directly by the LLM. We develop a throughput model for edge-cloud SD that explicitly accounts for communication latency. Leveraging this model, we propose an adaptive mechanism that optimizes token throughput by dynamically adjusting the draft length and quantization precision in response to both semantic uncertainty and channel conditions. Simulations demonstrate that the proposed Q-S approach significantly improves decoding efficiency in realistic edge-cloud deployment scenarios.

academic

Quantize-Sample-and-Verify: LLM Acceleration via Adaptive Edge-Cloud Speculative Decoding

Basic Information

Paper ID: 2507.00605
Title: Quantize-Sample-and-Verify: LLM Acceleration via Adaptive Edge-Cloud Speculative Decoding
Authors: Guangyi Zhang, Yunlong Cai, Guanding Yu, Petar Popovski, Osvaldo Simeone
Classification: eess.SP (Electrical Engineering and Systems Science - Signal Processing)
Publication Date: July 1, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2507.00605

Abstract

In edge-cloud speculative decoding (SD) systems, edge devices equipped with small language models (SLMs) generate draft tokens, which are subsequently verified by large language models (LLMs) in the cloud. The critical bottleneck in such systems is the limited communication bandwidth between edge and cloud, necessitating quantization of transmitted token information. This work introduces a novel Quantize-Sample (Q-S) strategy that provably preserves the output distribution of the cloud model, ensuring that verified tokens match the distribution of tokens directly generated by the LLM. We develop an explicit throughput model for edge-cloud SD that accounts for communication latency. Based on this model, we propose an adaptive mechanism that dynamically adjusts draft length and quantization precision in response to semantic uncertainty and channel conditions, thereby optimizing token throughput. Simulation results demonstrate that the proposed Q-S method significantly improves decoding efficiency in realistic edge-cloud deployment scenarios.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is the communication bandwidth limitation in edge-cloud speculative decoding systems. In traditional speculative decoding, edge devices must transmit substantial probability distribution information to the cloud, which severely impacts system performance in bandwidth-constrained environments.

Significance

Practical Value: Edge-cloud collaborative inference is a current important trend in LLM deployment, balancing computational resources and response latency
Technical Challenges: Existing methods destroy the original output distribution of LLMs when quantizing probability distributions, affecting generation quality
Economic Benefits: Reduces redundant API calls, improving energy efficiency and system scalability

Limitations of Existing Methods

Existing Sample-Quantize (S-Q) methods have critical flaws:

The strategy of sampling first then quantizing leads to inconsistency between edge sampling distribution and cloud verification distribution
Violates the core property of speculative decoding to preserve LLM token distribution
Performance significantly degrades at high sampling temperatures

Research Motivation

The motivation of this work is to design an edge-cloud speculative decoding scheme that both reduces communication overhead and strictly maintains consistency of LLM output distribution.

Core Contributions

Proposed Quantize-Sample (Q-S) Strategy: Provably preserves the output distribution of the cloud LLM, ensuring generation quality is not compromised
Established Throughput Model Accounting for Communication Latency: Explicitly models the impact of uplink and downlink transmission delays on system performance
Designed Adaptive Resource Allocation Mechanism: Dynamically adjusts draft length and quantization precision based on reinforcement learning
Provided Theoretical Guarantees: Proves the distributional equivalence of the Q-S method through Proposition 1

Method Details

Task Definition

The edge-cloud speculative decoding task is defined as: given an input prefix s¹, the system generates draft tokens through edge SLM, verifies them with cloud LLM, and ultimately produces a token sequence with the same distribution as direct LLM generation.

Model Architecture

System Architecture

The system comprises four key stages:

Token Generation: Edge SLM autoregressively generates L^t draft tokens
Uplink Transmission: Transmits quantized probability distributions and tokens to the cloud
Token Verification: Cloud LLM verifies draft tokens in parallel
Downlink Transmission: Returns verification results and newly generated tokens

Q-S Strategy Core Mechanism

Key Innovation: Quantize probability distributions first, then sample from the quantized distribution

Mathematical Formulation:

Quantized probability vector: q̂ᵗₗ = Quantize(qᵗₗ)
Sampling from quantized distribution: xᵗₗ ~ q̂ᵗₗ
Verification probability: αᵗₗ = min(1, pᵗₗ,xᵗₗ/q̂ᵗₗ,xᵗₗ)

Lattice Quantization Algorithm

Employs lattice-based probability vector quantization:

Quantization set: Qₗ = {q₁,q₂,...,qᵥ ∈ Q^V | qᵢ = oᵢ/ℓ, ∑ᵢoᵢ = ℓ}
Encoded bits: b = ⌈log₂((ℓ+V-1)/(V-1))⌉
Complexity: O(V log(V))

Technical Innovations

1. Distribution Preservation Proof

Proposition 1: Q-S edge-cloud SD guarantees that the probability P(X = xᵗₗ) of generated token xᵗₗ equals the corresponding probability pᵗₗ,xᵗₗ of the LLM.

The key to this property is that sampling and verification use the same quantized distribution, whereas the S-Q method uses different distributions causing distribution shift.

2. Adaptive Optimization Mechanism

A reinforcement learning-based dynamic policy π with state space including:

Semantic information: prefix confidence vector fᵗ and average confidence f̄ᵗ
Connection information: current uplink channel rate Cᵗᵤ

Action space: aᵗ = (Lᵗ, bᵗ), representing draft length and quantization bits

3. Latency Modeling

Total latency model:

Tᵗ(Lᵗ, bᵗ; Cᵗᵤ, Cᵗd) = LᵗTSLM + Tᵗᵤ + TLLM + Tᵗd

Where:

Uplink latency: Tᵗᵤ = (Lᵗ⌈log₂(V)⌉ + bᵗ)/Cᵗᵤ
Downlink latency: Tᵗd = (⌈log₂(Lᵗ)⌉ + ⌈log₂(V)⌉)/Cᵗd

Experimental Setup

Dataset

Dataset: CNN/DailyMail abstractive text summarization dataset
Task: Abstractive text summarization generation
Evaluation Metrics: ROUGE-2 score, token throughput, Shannon entropy

Model Configuration

Cloud LLM: OPT-13B (13 billion parameters)
Edge SLM: OPT-125M (125 million parameters)
Hardware: NVIDIA A100 40GB GPU
Batch Size: 1 (consistent with existing literature standards)

Channel Model

Employs a two-state Markov model to simulate time-varying uplink channels:

Low-speed state: Average 350 kbps (similar to NB-IoT)
High-speed state: Average 4 Mbps
State transition probabilities: p_low→high and p_high→low

Comparison Methods

LLM: Direct use of cloud LLM
SLM: Edge SLM only
S-Q: Sample-Quantize speculative decoding
Q-S (Static): Static Quantize-Sample method
Q-S (Heuristic): Heuristic adaptive Q-S
Q-S (Dynamic): Reinforcement learning-based dynamic Q-S

Experimental Results

Main Results

1. Generation Quality Preservation

ROUGE-2 Score Comparison:

Q-S methods (static and dynamic) maintain identical ROUGE-2 scores with LLM across all sampling temperatures
S-Q method significantly deviates from LLM performance at high temperatures
Validates the theoretical guarantee of Proposition 1

2. Throughput Improvement

Low-speed Network Environment (350 kbps):

Q-S (Dynamic) achieves approximately 40-50% token throughput improvement compared to LLM
Approximately 15-20% improvement over static Q-S
Approximately 8-12% improvement over heuristic method

High-speed Network Environment (4 Mbps):

Communication is no longer the primary bottleneck, but dynamic method still achieves 5-10% improvement
Demonstrates robustness of adaptive strategy

3. Entropy Analysis

Shannon entropy of tokens increases with sampling temperature for all methods, confirming the correct influence of temperature parameter on output diversity.

Ablation Studies

By comparing static, heuristic, and dynamic Q-S variants, the following are validated:

Effectiveness of Quantization Strategy: Advantages of Q-S over S-Q
Value of Adaptive Mechanism: Improvements of dynamic adjustment over fixed parameters
Necessity of Reinforcement Learning: Improvements over simple heuristic rules

Key Findings

Distribution Consistency is Critical: Maintaining consistency between sampling and verification distributions is key to preserving generation quality
Communication Latency Significantly Affects Performance: In low-bandwidth environments, communication overhead becomes the primary bottleneck
Adaptive Strategy is Highly Effective: Dynamic parameter adjustment effectively adapts to different semantic and network conditions

Speculative Decoding Research

Foundational Speculative Decoding: Original speculative sampling method proposed by Chen et al. 1
Edge-Cloud Collaboration: First exploration of edge-cloud collaborative SD by Hao et al. 4
Uncertainty-based Skipping: Token skipping strategy based on uncertainty proposed by Oh et al. 5

Quantization Techniques

Probability Vector Quantization: Lattice quantization algorithm by Reznik 10
Prompt Quantization: Prompt-level quantization by Jiao et al. 11 and Hao et al. 12
KV Cache Quantization: Key-value cache quantization by He et al. 13

Advantages Relative to This Work

Theoretical Guarantees: First to provide rigorous proof of distribution preservation
System Modeling: Complete system model explicitly accounting for communication latency
Adaptive Optimization: Dynamic parameter adjustment based on reinforcement learning

Conclusions and Discussion

Main Conclusions

Q-S Strategy Outperforms S-Q: Achieves significant throughput improvement while preserving generation quality
Adaptive Mechanism is Effective: Dynamic adjustment of draft length and quantization precision adapts to different conditions
Theory and Practice Align: Theoretical analysis and experimental results mutually validate each other

Limitations

Model Assumptions: Assumes downlink transmission has no latency; actual scenarios may be more complex
Quantization Method: Only considers lattice quantization; effectiveness of other quantization methods unknown
Task Limitations: Validated only on text summarization task; generalization capability remains to be verified
Hardware Dependency: Experiments based on high-performance GPU; performance on real edge devices may differ

Future Directions

Extension to Other Tasks: Dialogue generation, code generation, and other application scenarios
More Complex Network Models: Considering packet loss, jitter, and other real-world network issues
Multimodal Extension: Image-text, speech-text, and other multimodal scenarios
Hardware Optimization: Optimization strategies for specific edge hardware

In-Depth Evaluation

Strengths

Solid Theoretical Contribution: Proposition 1 provides rigorous mathematical guarantees, filling theoretical gaps in existing methods
Clear Problem Definition: Accurately identifies fundamental flaws in S-Q method and proposes targeted solutions
Comprehensive System Modeling: Fully considers computational and communication latencies, establishing complete performance model
Reasonable Experimental Design: Multi-faceted validation of method effectiveness, including quality, throughput, and robustness
High Practical Value: Addresses practical problems in edge-cloud deployment with important application prospects

Weaknesses

Limited Experimental Scope: Validation on single task and dataset only; insufficient evidence of generalization
Simple Baseline Methods: Compared heuristic methods are relatively simple; lacks stronger baselines
Hardware Simulation: Simulates edge device performance through scaling factors, potentially differing from actual situations
Simplified Network Model: Two-state Markov model is overly simplified; actual networks are more complex
Insufficient Computational Overhead Analysis: Limited analysis of computational costs for quantization and reinforcement learning

Impact

Academic Value: Provides theoretical foundation and practical methods for edge-cloud speculative decoding
Industrial Application: Offers direct guidance for edge AI deployment
Research Inspiration: Provides new insights for related fields (federated learning, distributed inference, etc.)
Standardization Potential: May influence standard-setting for edge-cloud collaboration

Applicable Scenarios

Bandwidth-Constrained Environments: Satellite communications, remote area networks, etc.
Latency-Sensitive Applications: Real-time dialogue systems, edge AI services
Resource-Constrained Devices: Mobile devices, IoT devices, etc.
Hybrid Cloud Architecture: Enterprise applications requiring edge-cloud collaboration

Reproducibility

The paper provides detailed experimental setup and open-source code links, demonstrating good reproducibility. However, deployment verification on real edge devices requires further work.

References

Chen, C., et al. "Accelerating large language model decoding with speculative sampling." arXiv:2302.01318, 2023.
Hao, Z., et al. "Hybrid SLM and LLM for edge-cloud collaborative inference." Proc. Worksh. Edge Mobil. Found. Models, 2024.
Leviathan, Y., et al. "Fast inference from transformers via speculative decoding." Proc. Int. Conf. Mach. Learn. (ICML), 2023.
Reznik, Y. A. "An algorithm for quantization of discrete probability distributions." Data Compress. Conf. (DCC), 2011.

Overall Assessment: This is a high-quality paper with important contributions to the field of edge-cloud speculative decoding. It features rigorous theoretical analysis, comprehensive experimental validation, and addresses critical problems in practical applications. Despite some limitations, its innovation and practical value make it an important work in this field.