2025-11-13T17:19:11.429701

Dedelayed: Deleting remote inference delay via on-device correction

Jacobellis, Ulhaq, RacapÃ© et al.

Remote inference allows lightweight devices to leverage powerful cloud models. However, communication network latency makes predictions stale and unsuitable for real-time tasks. To address this, we introduce Dedelayed, a delay-corrective method that mitigates arbitrary remote inference delays, allowing the local device to produce low-latency outputs in real time. Our method employs a lightweight local model that processes the current frame and fuses in features that a heavyweight remote model computes from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of the local-only and remote-only baselines across all realistic communication network delays beyond 33 ms. Without incurring additional delay, it improves accuracy by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference, for a round-trip delay of 100 ms. The advantage grows under longer delays and higher-motion scenes, as delay-mitigated split inference sustains accuracy more effectively, providing clear advantages for real-time tasks that must remain aligned with the current world state.

academic

Dedelayed: Deleting remote inference delay via on-device correction

Basic Information

Paper ID: 2510.13714
Title: Dedelayed: Deleting remote inference delay via on-device correction
Authors: Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja J. Yadwadkar
Classification: eess.IV cs.AI cs.CV cs.LG
Publication Date: October 15, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.13714

Abstract

Remote inference enables lightweight devices to leverage powerful cloud-based models. However, communication network latency renders predictions stale, making them unsuitable for real-time tasks. To address this challenge, this paper introduces Dedelayed, a latency correction method that mitigates arbitrary remote inference delays, enabling local devices to produce low-latency outputs in real-time. The method employs a lightweight local model to process the current frame and fuses features computed by a heavyweight remote model from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of pure local and pure remote baselines across all realistic communication network delays exceeding 33ms. Without introducing additional latency, for 100ms round-trip delays, it achieves a 6.4 mIoU improvement over pure local inference and a 9.8 mIoU improvement over remote inference.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is: how to overcome network latency in remote inference while maintaining prediction accuracy in real-time video processing applications.

Problem Significance

Real-time Application Requirements: Autonomous driving, robotic control, and wearable devices are extremely latency-sensitive, where stale predictions can lead to catastrophic consequences
Resource Constraints: Mobile devices are limited by power consumption and computational capacity, making it infeasible to run complex deep learning models
Cloud Advantages: Cloud GPUs possess powerful computational capabilities for processing high-resolution video and complex models

Limitations of Existing Methods

Existing distributed computing approaches have three major shortcomings:

Allocate all device resources to a single linear inference pipeline without reserving resources for local fallback options
Fail to consider the impact of latency on prediction accuracy
Significantly reduce spatiotemporal resolution to manage computational costs, losing rich visual details from modern camera systems

Research Motivation

Inspired by the human visual system, where the optic nerve can only transmit a small fraction of information received by the retina, with early processing primarily performing compression followed by metabolically intensive processing in deeper visual cortex layers. Similarly, machines equipped with digital video sensors face comparable constraints.

Core Contributions

Proposed Dedelayed Framework: A latency-aware distributed inference framework that mitigates network latency effects by fusing local real-time information with remote delayed features
Latency Quantification Analysis: Provides quantitative measurements of latency's impact on dense visual prediction accuracy
Practical System Validation: Validates system effectiveness on video segmentation tasks in urban driving scenarios, surpassing existing local or remote inference approaches
Simple and Effective Fusion Strategy: Employs additive feature fusion that is easy to deploy and extensible to other real-time methods

Methodology Details

Task Definition

Given a fresh input frame x_t at time t, the final prediction ŷ_t is computed through a lightweight local model f_light, which processes x_t and fuses temporally delayed features z_{t-τ} from a heavyweight remote model f_heavy.

Mathematical formulation:

z_{t-τ} = f_heavy(τ, x_{≤t-τ})     (1)
ŷ_t = f_light(x_t, z_{t-τ})        (2)

Model Architecture

Overall System Architecture

The Dedelayed system comprises two main components:

Local Lightweight Model: Processes current frames, providing real-time response capability
Remote Prediction Model: Processes historical frame sequences, providing high-quality features

Remote Prediction Module

Employs EfficientViT-L1 as a 2D ViT backbone with effective patch size of 8×8
Maintains a context window of K most recent frames
Concatenates features from each frame along the temporal axis, spatially merging into larger 16×16 patches
Adds learnable latency embeddings based on measured latency τ
Produces latency-conditioned features through 3D ViT encoder and learned pooling (MLP-pool-MLP)

Local Model and Fusion

Computes first-stage features: h = T1(x_t)
Performs early fusion via element-wise addition: h' = h + z_{t-τ}
Both tensors have shape 96 × H/8 × W/8, requiring no projection or resizing
If z_{t-τ} is unavailable, the local model falls back to h' = h

Technical Innovations

Latency Embedding Mechanism: Similar to positional embeddings in text or visual transformers, allowing remote model behavior to adapt to channel variations
Temporal Predictive Training: Simulates D-frame delays during supervised training, training the remote model to predict future frames
Mixed-Resolution Inference: Local model uses low resolution, remote model uses high-resolution multi-frame processing
Performance Guarantee: System performance never degrades below either independent model

Experimental Setup

Dataset

BDD100K Video Dataset: Contains 30fps driving scene videos
Uses pre-trained EoMT model to generate pseudo-labels, ignoring low-confidence pixels
Uses 19-label subset from Cityscapes
Applies WebP image codec (quality 85) for uplink video stream compression

Evaluation Metrics

mIoU (mean Intersection over Union): Standard evaluation metric for semantic segmentation
Latency Range: 0-5 frames (0-165ms), representing typical round-trip delays

Comparison Methods

Local image: Traditional single-frame local inference
Remote image: Traditional single-frame remote inference
Remote video: Remote video processing without future prediction
Remote predictive: Latency-aware remote prediction model
Local + remote predictive: Complete Dedelayed system

Implementation Details

Multi-stage Training Strategy: Remote and local models trained independently first, then jointly fine-tuned
Optimizer: Adan optimizer
Learning Rate Schedule: Trapezoidal cosine learning rate schedule
Loss Function: Cross-entropy loss
Pre-training: ImageNet classification → Cityscapes segmentation → BDD100K fine-tuning

Experimental Results

Main Results

Significant Performance Improvements:
- 6.4 mIoU improvement over pure local inference at 100ms round-trip latency
- 9.8 mIoU improvement over remote inference
- Outperforms strongest baseline across all realistic delays exceeding 33ms
Latency Robustness:
- Advantages increase with longer delays
- Better performance in high-motion scenes
- Distributed inference with latency mitigation more effectively maintains accuracy

Ablation Studies

Experiments validate the contribution of each component:

Remote video vs Remote image: Using only historical frame context is insufficient for performance improvement
Remote predictive vs Remote video: Temporal predictive training significantly enhances latency robustness
Local + remote predictive vs Remote predictive: Local information fusion further improves performance

Latency Jitter Analysis

Model maintains good performance when input delay mismatches observed delay
Performance degrades more gracefully when observed delay exceeds input delay
Maintains advantages even in high-jitter networks (σ=15ms)

Resolution Adaptability

Remote-assisted local models can operate at lower resolutions without accuracy loss, demonstrating system resource efficiency.

Lightweight Architecture Research

Existing work such as EfficientViT and MobileNetV4 focus on minimizing computation for real-time device performance, but are constrained by device power consumption and computational limitations.

Distributed Computing Methods

MPEG AI and JPEG AI: Focus on bandwidth reduction, lacking latency compensation mechanisms
Clockwork Convnets: Reuse stale features to reduce latency, but with limited temporal reasoning capability
Accel: Uses optical flow forward transformation for model features, but unsuitable for cross-network operations
Knowledge Boosting: Most closely related to this work, but assumes fixed latency

Advantages of This Work

Compared to related work, Dedelayed generalizes to longer and variable delays through adjustable latency conditioning while maintaining simple design and reusability.

Conclusions and Discussion

Main Conclusions

Dedelayed successfully addresses the core challenge of remote computation in real-time systems: prediction staleness caused by network latency
By elevating latency to a first-class variable, the system surpasses strong baselines under realistic network conditions
The framework applies to a broad range of real-time problem domains, enabling intelligent systems to be both accurate and reliably timely

Limitations

Fixed Latency Assumption: Current implementation primarily targets relatively stable latency, with limited adaptability to extreme jitter
Computational Overhead: While the local model is lightweight, additional fusion computation is still required
Dataset Limitations: Primarily validated on driving scenarios; generalization to other domains remains to be verified
Network Dependency: Completely dependent on network connectivity; relies solely on local models when network is unavailable

Future Directions

Future research directions proposed in the paper include:

Investigating variable and stochastic latency distributions
Handling high-motion data
Developing lighter local models
Exploring local future prediction capabilities

In-Depth Evaluation

Strengths

Problem Importance: Addresses a critical issue in edge computing with significant practical value
Method Novelty: The combination of latency embedding and temporal predictive training is novel
Experimental Comprehensiveness: Thorough ablation studies and latency jitter analysis
Strong Practicality: Simple fusion strategy based on existing models, easy to deploy
Theoretical Foundation: Inspired by human visual system, biologically plausible

Weaknesses

Limited Evaluation Scope: Validated only on semantic segmentation task, lacking verification on other tasks
Latency Range: Maximum 165ms latency may be insufficient to cover all practical scenarios
Insufficient Computational Cost Analysis: Lacks detailed analysis of computational and communication costs
Limited Baseline Comparisons: Could compare with more recent edge computing methods

Impact

Academic Contribution: Provides new insights for edge-cloud collaborative inference
Practical Value: Direct application potential in autonomous driving, robotics, and other domains
Reproducibility: Provides detailed implementation code for easy reproduction and extension

Applicable Scenarios

Autonomous Driving: Vehicle systems require real-time and accurate environmental perception
Mobile Robotics: Navigation and obstacle avoidance require low-latency response
AR/VR Applications: Real-time scene understanding and rendering
Video Surveillance: Real-time object detection and tracking

References

The paper cites important works in related fields, including:

EfficientViT series lightweight models
BDD100K and Cityscapes datasets
Edge computing and distributed inference research
Biological research on human visual systems

Overall Assessment: This is a high-quality paper addressing practical problems, with the proposed Dedelayed framework having important value both theoretically and practically. The method is simple and effective, with sufficient experimental validation, providing valuable contributions to the edge-cloud collaborative inference field. While there is room for improvement in evaluation scope and latency handling capability, it is overall a meaningful research work.