Remote inference allows lightweight devices to leverage powerful cloud models. However, communication network latency makes predictions stale and unsuitable for real-time tasks. To address this, we introduce Dedelayed, a delay-corrective method that mitigates arbitrary remote inference delays, allowing the local device to produce low-latency outputs in real time. Our method employs a lightweight local model that processes the current frame and fuses in features that a heavyweight remote model computes from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of the local-only and remote-only baselines across all realistic communication network delays beyond 33 ms. Without incurring additional delay, it improves accuracy by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference, for a round-trip delay of 100 ms. The advantage grows under longer delays and higher-motion scenes, as delay-mitigated split inference sustains accuracy more effectively, providing clear advantages for real-time tasks that must remain aligned with the current world state.
academic
Dedelayed: Deleting remote inference delay via on-device correction
Remote inference enables lightweight devices to leverage powerful cloud-based models. However, communication network latency renders predictions stale, making them unsuitable for real-time tasks. To address this challenge, this paper introduces Dedelayed, a latency correction method that mitigates arbitrary remote inference delays, enabling local devices to produce low-latency outputs in real-time. The method employs a lightweight local model to process the current frame and fuses features computed by a heavyweight remote model from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of pure local and pure remote baselines across all realistic communication network delays exceeding 33ms. Without introducing additional latency, for 100ms round-trip delays, it achieves a 6.4 mIoU improvement over pure local inference and a 9.8 mIoU improvement over remote inference.
The core problem addressed by this research is: how to overcome network latency in remote inference while maintaining prediction accuracy in real-time video processing applications.
Real-time Application Requirements: Autonomous driving, robotic control, and wearable devices are extremely latency-sensitive, where stale predictions can lead to catastrophic consequences
Resource Constraints: Mobile devices are limited by power consumption and computational capacity, making it infeasible to run complex deep learning models
Cloud Advantages: Cloud GPUs possess powerful computational capabilities for processing high-resolution video and complex models
Inspired by the human visual system, where the optic nerve can only transmit a small fraction of information received by the retina, with early processing primarily performing compression followed by metabolically intensive processing in deeper visual cortex layers. Similarly, machines equipped with digital video sensors face comparable constraints.
Proposed Dedelayed Framework: A latency-aware distributed inference framework that mitigates network latency effects by fusing local real-time information with remote delayed features
Latency Quantification Analysis: Provides quantitative measurements of latency's impact on dense visual prediction accuracy
Practical System Validation: Validates system effectiveness on video segmentation tasks in urban driving scenarios, surpassing existing local or remote inference approaches
Simple and Effective Fusion Strategy: Employs additive feature fusion that is easy to deploy and extensible to other real-time methods
Given a fresh input frame x_t at time t, the final prediction ŷ_t is computed through a lightweight local model f_light, which processes x_t and fuses temporally delayed features z_{t-τ} from a heavyweight remote model f_heavy.
Latency Embedding Mechanism: Similar to positional embeddings in text or visual transformers, allowing remote model behavior to adapt to channel variations
Temporal Predictive Training: Simulates D-frame delays during supervised training, training the remote model to predict future frames
Mixed-Resolution Inference: Local model uses low resolution, remote model uses high-resolution multi-frame processing
Performance Guarantee: System performance never degrades below either independent model
Existing work such as EfficientViT and MobileNetV4 focus on minimizing computation for real-time device performance, but are constrained by device power consumption and computational limitations.
Compared to related work, Dedelayed generalizes to longer and variable delays through adjustable latency conditioning while maintaining simple design and reusability.
The paper cites important works in related fields, including:
EfficientViT series lightweight models
BDD100K and Cityscapes datasets
Edge computing and distributed inference research
Biological research on human visual systems
Overall Assessment: This is a high-quality paper addressing practical problems, with the proposed Dedelayed framework having important value both theoretically and practically. The method is simple and effective, with sufficient experimental validation, providing valuable contributions to the edge-cloud collaborative inference field. While there is room for improvement in evaluation scope and latency handling capability, it is overall a meaningful research work.