2025-11-13T17:19:11.429701

Dedelayed: Deleting remote inference delay via on-device correction

Jacobellis, Ulhaq, Racapé et al.
Remote inference allows lightweight devices to leverage powerful cloud models. However, communication network latency makes predictions stale and unsuitable for real-time tasks. To address this, we introduce Dedelayed, a delay-corrective method that mitigates arbitrary remote inference delays, allowing the local device to produce low-latency outputs in real time. Our method employs a lightweight local model that processes the current frame and fuses in features that a heavyweight remote model computes from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of the local-only and remote-only baselines across all realistic communication network delays beyond 33 ms. Without incurring additional delay, it improves accuracy by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference, for a round-trip delay of 100 ms. The advantage grows under longer delays and higher-motion scenes, as delay-mitigated split inference sustains accuracy more effectively, providing clear advantages for real-time tasks that must remain aligned with the current world state.
academic

Dedelayed: Deleting remote inference delay via on-device correction

Basic Information

  • Paper ID: 2510.13714
  • Title: Dedelayed: Deleting remote inference delay via on-device correction
  • Authors: Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja J. Yadwadkar
  • Classification: eess.IV cs.AI cs.CV cs.LG
  • Publication Date: October 15, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.13714

Abstract

Remote inference enables lightweight devices to leverage powerful cloud-based models. However, communication network latency renders predictions stale, making them unsuitable for real-time tasks. To address this challenge, this paper introduces Dedelayed, a latency correction method that mitigates arbitrary remote inference delays, enabling local devices to produce low-latency outputs in real-time. The method employs a lightweight local model to process the current frame and fuses features computed by a heavyweight remote model from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of pure local and pure remote baselines across all realistic communication network delays exceeding 33ms. Without introducing additional latency, for 100ms round-trip delays, it achieves a 6.4 mIoU improvement over pure local inference and a 9.8 mIoU improvement over remote inference.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is: how to overcome network latency in remote inference while maintaining prediction accuracy in real-time video processing applications.

Problem Significance

  1. Real-time Application Requirements: Autonomous driving, robotic control, and wearable devices are extremely latency-sensitive, where stale predictions can lead to catastrophic consequences
  2. Resource Constraints: Mobile devices are limited by power consumption and computational capacity, making it infeasible to run complex deep learning models
  3. Cloud Advantages: Cloud GPUs possess powerful computational capabilities for processing high-resolution video and complex models

Limitations of Existing Methods

Existing distributed computing approaches have three major shortcomings:

  1. Allocate all device resources to a single linear inference pipeline without reserving resources for local fallback options
  2. Fail to consider the impact of latency on prediction accuracy
  3. Significantly reduce spatiotemporal resolution to manage computational costs, losing rich visual details from modern camera systems

Research Motivation

Inspired by the human visual system, where the optic nerve can only transmit a small fraction of information received by the retina, with early processing primarily performing compression followed by metabolically intensive processing in deeper visual cortex layers. Similarly, machines equipped with digital video sensors face comparable constraints.

Core Contributions

  1. Proposed Dedelayed Framework: A latency-aware distributed inference framework that mitigates network latency effects by fusing local real-time information with remote delayed features
  2. Latency Quantification Analysis: Provides quantitative measurements of latency's impact on dense visual prediction accuracy
  3. Practical System Validation: Validates system effectiveness on video segmentation tasks in urban driving scenarios, surpassing existing local or remote inference approaches
  4. Simple and Effective Fusion Strategy: Employs additive feature fusion that is easy to deploy and extensible to other real-time methods

Methodology Details

Task Definition

Given a fresh input frame x_t at time t, the final prediction ŷ_t is computed through a lightweight local model f_light, which processes x_t and fuses temporally delayed features z_{t-τ} from a heavyweight remote model f_heavy.

Mathematical formulation:

z_{t-τ} = f_heavy(τ, x_{≤t-τ})     (1)
ŷ_t = f_light(x_t, z_{t-τ})        (2)

Model Architecture

Overall System Architecture

The Dedelayed system comprises two main components:

  1. Local Lightweight Model: Processes current frames, providing real-time response capability
  2. Remote Prediction Model: Processes historical frame sequences, providing high-quality features

Remote Prediction Module

  • Employs EfficientViT-L1 as a 2D ViT backbone with effective patch size of 8×8
  • Maintains a context window of K most recent frames
  • Concatenates features from each frame along the temporal axis, spatially merging into larger 16×16 patches
  • Adds learnable latency embeddings based on measured latency τ
  • Produces latency-conditioned features through 3D ViT encoder and learned pooling (MLP-pool-MLP)

Local Model and Fusion

  • Computes first-stage features: h = T1(x_t)
  • Performs early fusion via element-wise addition: h' = h + z_{t-τ}
  • Both tensors have shape 96 × H/8 × W/8, requiring no projection or resizing
  • If z_{t-τ} is unavailable, the local model falls back to h' = h

Technical Innovations

  1. Latency Embedding Mechanism: Similar to positional embeddings in text or visual transformers, allowing remote model behavior to adapt to channel variations
  2. Temporal Predictive Training: Simulates D-frame delays during supervised training, training the remote model to predict future frames
  3. Mixed-Resolution Inference: Local model uses low resolution, remote model uses high-resolution multi-frame processing
  4. Performance Guarantee: System performance never degrades below either independent model

Experimental Setup

Dataset

  • BDD100K Video Dataset: Contains 30fps driving scene videos
  • Uses pre-trained EoMT model to generate pseudo-labels, ignoring low-confidence pixels
  • Uses 19-label subset from Cityscapes
  • Applies WebP image codec (quality 85) for uplink video stream compression

Evaluation Metrics

  • mIoU (mean Intersection over Union): Standard evaluation metric for semantic segmentation
  • Latency Range: 0-5 frames (0-165ms), representing typical round-trip delays

Comparison Methods

  1. Local image: Traditional single-frame local inference
  2. Remote image: Traditional single-frame remote inference
  3. Remote video: Remote video processing without future prediction
  4. Remote predictive: Latency-aware remote prediction model
  5. Local + remote predictive: Complete Dedelayed system

Implementation Details

  • Multi-stage Training Strategy: Remote and local models trained independently first, then jointly fine-tuned
  • Optimizer: Adan optimizer
  • Learning Rate Schedule: Trapezoidal cosine learning rate schedule
  • Loss Function: Cross-entropy loss
  • Pre-training: ImageNet classification → Cityscapes segmentation → BDD100K fine-tuning

Experimental Results

Main Results

  1. Significant Performance Improvements:
    • 6.4 mIoU improvement over pure local inference at 100ms round-trip latency
    • 9.8 mIoU improvement over remote inference
    • Outperforms strongest baseline across all realistic delays exceeding 33ms
  2. Latency Robustness:
    • Advantages increase with longer delays
    • Better performance in high-motion scenes
    • Distributed inference with latency mitigation more effectively maintains accuracy

Ablation Studies

Experiments validate the contribution of each component:

  • Remote video vs Remote image: Using only historical frame context is insufficient for performance improvement
  • Remote predictive vs Remote video: Temporal predictive training significantly enhances latency robustness
  • Local + remote predictive vs Remote predictive: Local information fusion further improves performance

Latency Jitter Analysis

  • Model maintains good performance when input delay mismatches observed delay
  • Performance degrades more gracefully when observed delay exceeds input delay
  • Maintains advantages even in high-jitter networks (σ=15ms)

Resolution Adaptability

Remote-assisted local models can operate at lower resolutions without accuracy loss, demonstrating system resource efficiency.

Lightweight Architecture Research

Existing work such as EfficientViT and MobileNetV4 focus on minimizing computation for real-time device performance, but are constrained by device power consumption and computational limitations.

Distributed Computing Methods

  • MPEG AI and JPEG AI: Focus on bandwidth reduction, lacking latency compensation mechanisms
  • Clockwork Convnets: Reuse stale features to reduce latency, but with limited temporal reasoning capability
  • Accel: Uses optical flow forward transformation for model features, but unsuitable for cross-network operations
  • Knowledge Boosting: Most closely related to this work, but assumes fixed latency

Advantages of This Work

Compared to related work, Dedelayed generalizes to longer and variable delays through adjustable latency conditioning while maintaining simple design and reusability.

Conclusions and Discussion

Main Conclusions

  1. Dedelayed successfully addresses the core challenge of remote computation in real-time systems: prediction staleness caused by network latency
  2. By elevating latency to a first-class variable, the system surpasses strong baselines under realistic network conditions
  3. The framework applies to a broad range of real-time problem domains, enabling intelligent systems to be both accurate and reliably timely

Limitations

  1. Fixed Latency Assumption: Current implementation primarily targets relatively stable latency, with limited adaptability to extreme jitter
  2. Computational Overhead: While the local model is lightweight, additional fusion computation is still required
  3. Dataset Limitations: Primarily validated on driving scenarios; generalization to other domains remains to be verified
  4. Network Dependency: Completely dependent on network connectivity; relies solely on local models when network is unavailable

Future Directions

Future research directions proposed in the paper include:

  1. Investigating variable and stochastic latency distributions
  2. Handling high-motion data
  3. Developing lighter local models
  4. Exploring local future prediction capabilities

In-Depth Evaluation

Strengths

  1. Problem Importance: Addresses a critical issue in edge computing with significant practical value
  2. Method Novelty: The combination of latency embedding and temporal predictive training is novel
  3. Experimental Comprehensiveness: Thorough ablation studies and latency jitter analysis
  4. Strong Practicality: Simple fusion strategy based on existing models, easy to deploy
  5. Theoretical Foundation: Inspired by human visual system, biologically plausible

Weaknesses

  1. Limited Evaluation Scope: Validated only on semantic segmentation task, lacking verification on other tasks
  2. Latency Range: Maximum 165ms latency may be insufficient to cover all practical scenarios
  3. Insufficient Computational Cost Analysis: Lacks detailed analysis of computational and communication costs
  4. Limited Baseline Comparisons: Could compare with more recent edge computing methods

Impact

  1. Academic Contribution: Provides new insights for edge-cloud collaborative inference
  2. Practical Value: Direct application potential in autonomous driving, robotics, and other domains
  3. Reproducibility: Provides detailed implementation code for easy reproduction and extension

Applicable Scenarios

  1. Autonomous Driving: Vehicle systems require real-time and accurate environmental perception
  2. Mobile Robotics: Navigation and obstacle avoidance require low-latency response
  3. AR/VR Applications: Real-time scene understanding and rendering
  4. Video Surveillance: Real-time object detection and tracking

References

The paper cites important works in related fields, including:

  • EfficientViT series lightweight models
  • BDD100K and Cityscapes datasets
  • Edge computing and distributed inference research
  • Biological research on human visual systems

Overall Assessment: This is a high-quality paper addressing practical problems, with the proposed Dedelayed framework having important value both theoretically and practically. The method is simple and effective, with sufficient experimental validation, providing valuable contributions to the edge-cloud collaborative inference field. While there is room for improvement in evaluation scope and latency handling capability, it is overall a meaningful research work.