2025-11-11T23:28:21.956833

Bridging Memory Gaps: Scaling Federated Learning for Heterogeneous Clients

Wu, Li, Tian et al.
Federated Learning (FL) enables multiple clients to collaboratively train a shared model while preserving data privacy. However, the high memory demand during model training severely limits the deployment of FL on resource-constrained clients. To this end, we propose \our, a scalable and inclusive FL framework designed to overcome memory limitations through sequential block-wise training. The core idea of \our is to partition the global model into blocks and train them sequentially, thereby reducing training memory requirements. To mitigate information loss during block-wise training, \our introduces a Curriculum Mentor that crafts curriculum-aware training objectives for each block to steer their learning process. Moreover, \our incorporates a Training Harmonizer that designs a parameter co-adaptation training scheme to coordinate block updates, effectively breaking inter-block information isolation. Extensive experiments on both simulation and hardware testbeds demonstrate that \our significantly improves model performance by up to 84.2\%, reduces peak memory usage by up to 50.4\%, and accelerates training by up to 1.9$\times$.
academic

Bridging Memory Gaps: Scaling Federated Learning for Heterogeneous Clients

Basic Information

  • Paper ID: 2408.10826
  • Title: Bridging Memory Gaps: Scaling Federated Learning for Heterogeneous Clients
  • Authors: Yebo Wu, Jingguang Li, Chunlin Tian, KaHou Tam, Li Li, Chengzhong Xu (University of Macau)
  • Category: cs.DC (Distributed Computing)
  • Publication Date: August 2024 (arXiv v2: October 2025)
  • Paper Link: https://arxiv.org/abs/2408.10826v2

Abstract

Federated Learning (FL) enables multiple clients to collaboratively train a shared model while preserving data privacy. However, high memory requirements during model training severely limit FL deployment on resource-constrained clients. To address this, the paper proposes SCALEFL, a scalable and inclusive FL framework that overcomes memory limitations through sequential block-wise training. The core idea of SCALEFL is to partition the global model into blocks and train them sequentially, thereby reducing training memory requirements. To mitigate information loss in block-wise training, SCALEFL introduces a Curriculum Mentor that formulates curriculum-aware training objectives for each block. Additionally, SCALEFL integrates a Training Harmonizer that designs parameter synergistic adaptive training schemes to effectively break information isolation between blocks.

Research Background and Motivation

Core Problems

  1. Memory Wall Problem: Federated learning training requires retaining all intermediate activations, model weights, and optimizer states in memory, resulting in high memory consumption. For example, training ResNet34 on ImageNet consumes over 12GB of memory, while typical mobile devices have only 4-12GB of RAM.
  2. Device Heterogeneity: Resource-constrained edge devices cannot participate in local training, preventing their valuable data from contributing to the global model.
  3. Limitations of Existing Methods:
    • Model Heterogeneous Training: Requires high-quality public datasets for knowledge distillation, which is difficult to obtain in FL
    • Partial Training: Width scaling breaks model architecture, depth scaling is limited by the maximum memory capacity of the most constrained client

Research Motivation

As model architectures become deeper and wider to achieve higher analytical capability, the memory problem is further exacerbated. This paper aims to design an FL framework that can significantly reduce memory requirements while maintaining model performance.

Core Contributions

  1. Proposes SCALEFL Framework: Significantly reduces training memory requirements through sequential block-wise training, enabling resource-constrained devices to participate effectively
  2. Designs Two Core Components: Curriculum Mentor and Training Harmonizer jointly shape the learning behavior of each block, promoting coherent structured feature learning
  3. Comprehensive Experimental Validation: Demonstrates the effectiveness and robustness of SCALEFL on multiple benchmark datasets
  4. Theoretical Analysis: Provides convergence analysis proving the theoretical reliability of the method

Method Details

Task Definition

In an FL system containing N clients, each client n possesses a local dataset Dn. The goal is to train a global model Θ while satisfying memory constraints of all clients.

Sequential Block-wise Training Paradigm

Basic Procedure:

  1. Model Construction: Server constructs a sub-model Θg,t = θ1,F, θ2,F, ..., θt, θOp for the current training stage t
  2. Local Training: Only updates block θt and output module θOp
  3. Model Aggregation: Aggregates parameter updates using weighted averaging
  4. Progress Evaluation: Monitors training progress of block θt and determines convergence
  5. Model Growth: Freezes converged blocks and introduces new blocks

Core Technical Components

1. Curriculum Mentor

Problem Analysis: Based on information bottleneck theory, the paper discovers that sequential block-wise training causes severe information loss. Dynamic analysis using nHSIC planes shows that SBT loses substantial input information after training the first block, preventing subsequent blocks from extracting critical features.

Solution: Design curriculum-aware training objectives

L_θt = L_CE - λt · nHSIC(X;Zt) - γt · nHSIC(Y;Zt)

Where:

  • L_CE is cross-entropy loss
  • nHSIC(X;Zt) measures input information retention
  • nHSIC(Y;Zt) measures task relevance
  • λt and γt are dynamically adjusted according to training stage

Strategy: Early stages use higher λt and lower γt to emphasize input information retention, while later stages gradually decrease λt and increase γt to shift toward task-specific feature extraction.

2. Training Harmonizer

Problem Identification:

  • Limited Forward Information Flow: Downstream blocks only begin training after upstream blocks converge
  • Limited Backward Information Flow: Gradients are confined within blocks, causing gradient isolation

Parameter Synergistic Adaptation Scheme:

  1. Dynamic Model Growth: Dynamically orchestrates the learning process of each block in each round, enabling downstream blocks to adapt to upstream block updates in real-time
  2. Concurrent Training Strategy: Trains the current block simultaneously with the last few layers of the upstream block, promoting gradient flow

Update formula:

θ^(k+1)_(n,t) + L^(k+1)_(n,t-1) ← (θ^k_(n,t) + L^k_(n,t-1)) - η · ∂L^k_(n,t)/∂(θ^k_(n,t) + L^k_(n,t-1))

Complete Training Objective

Combining L2 regularization to handle data heterogeneity:

L^r_t = L_θt + (μ/2)||θ^r_t - θ^(r-1)_t||^2_2

Experimental Setup

Datasets

  • CIFAR10/CIFAR100: Classic image classification datasets
  • CINIC10: Extended version of CIFAR10
  • Mini-ImageNet: Small-scale ImageNet
  • FEMNIST: Large-scale FL dataset (805,263 images)

Model Architectures

  • ResNet18/ResNet34: Deep residual networks
  • VGG11 BN: Classic convolutional network
  • SqueezeNet: Lightweight network
  • Vision Transformer (ViT): Transformer architecture

Experimental Environment

  • Hybrid Setup: Simulation and real device testbed
  • Device Configuration: 100 heterogeneous mobile devices, 10% randomly selected per round
  • Memory Budget: 100-1000MB randomly allocated
  • Optimizer: SGD, weight decay 5e-4, local epoch=5

Baseline Methods

  1. AllSmall: Scales down global model based on weakest device memory
  2. ExclusiveFL: Only allows devices with sufficient memory to participate
  3. DepthFL: Depth scaling to adapt to heterogeneous devices
  4. HeteroFL: Static channel scaling
  5. FedRolex: Dynamic width scaling
  6. SmartFreeze: Simple sequential block-wise training
  7. ProFL: Decomposed sequential training

Experimental Results

Main Results

Performance in Non-IID Scenarios:

MethodCIFAR10 (ResNet18/VGG11/SqueezeNet)Participation Rate
AllSmall69.5%/75.1%/49.6%100%/100%/100%
ExclusiveFL76.8%/79.3%/40.6%18%/22%/11%
SCALEFL80.4%/87.6%/58.0%100%/100%/100%

Key Findings:

  1. Significant Performance Improvement: 10.9%, 12.5%, and 8.4% improvement over AllSmall respectively
  2. Full Device Participation: Achieves 100% device participation rate, while ExclusiveFL only achieves 18-22%
  3. Memory Efficiency: Peak memory usage reduced by up to 50.4%
  4. Training Acceleration: Convergence speed improved by 1.9x

Scalability Analysis

Robustness Under Different Memory Constraints:

  • ExclusiveFL completely infeasible in ResNet34 scenario (0% participation rate)
  • SCALEFL achieves up to 27.4% improvement over other methods

Large-scale Datasets:

  • 3% accuracy improvement over FedAvg on FEMNIST dataset
  • Supports 120-500 device scale

Transformer Compatibility:

  • Only 2% lower than theoretical baseline on ViT model, but theoretical baseline is impractical

Hardware Evaluation

Memory Efficiency:

  • Testing on Jetson TX2 shows 50.4% reduction in peak memory usage
  • Single-round training time reduced by 1.84-2.31x

Training Efficiency:

  • Significantly reduced single-round training time compared to end-to-end training
  • Achieved 1.9x acceleration on ViT

Ablation Study

Component Contribution Analysis:

  • Removing Curriculum Mentor: 1.2% accuracy drop in CIFAR100 IID scenario
  • Removing Training Harmonizer: Significant 9.0% accuracy drop
  • Synergistic effect of both components is critical for performance

Resource-Constrained FL

  1. Model Heterogeneous Training: Methods like FedMD require public datasets for knowledge distillation
  2. Partial Training: HeteroFL, FedRolex use width scaling; DepthFL, InclusiveFL use depth scaling

Block-wise Training

  1. ProgFed: Progressively introduces new blocks but still requires end-to-end training
  2. SmartFreeze: Sequentially trains each block but ignores information loss
  3. ProFL: Decomposes into shrinking and growth stages but fails to address core challenges

Theoretical Analysis

Convergence Proof

The paper provides convergence analysis for SCALEFL, proving under standard assumptions (smoothness, bounded gradients):

(1/R) Σ E[||∇L^r_t(Θ^r_(g,t))||^2] ≤ Ψ/√R

That is, the average gradient norm converges to 0, and the model converges to a stable point.

Conclusions and Discussion

Main Conclusions

  1. SCALEFL successfully addresses the memory wall problem in FL, enabling resource-constrained devices to participate in training
  2. Curriculum Mentor and Training Harmonizer effectively mitigate core challenges of sequential block-wise training
  3. Achieves significant performance improvements and memory savings across multiple datasets and models

Limitations

  1. Block Partitioning Strategy: The paper does not deeply discuss optimal block partitioning methods
  2. Communication Overhead: While reducing memory usage, it may increase communication rounds
  3. Hyperparameter Sensitivity: Setting λt and γt requires careful tuning

Future Directions

  1. Adaptive block partitioning strategies
  2. Integration with other FL optimization techniques
  3. Validation in larger-scale practical deployments

In-Depth Evaluation

Strengths

  1. Problem Importance: Addresses a critical bottleneck in practical FL deployment
  2. Method Novelty: Curriculum-aware training objectives and parameter synergistic adaptation schemes are original
  3. Theoretical Foundation: Analysis based on information bottleneck theory provides solid theoretical support
  4. Experimental Comprehensiveness: Covers multiple models, datasets, and real hardware testing
  5. Practical Value: Significant memory savings and performance improvements have practical application value

Weaknesses

  1. Complexity: The introduction of two components increases system complexity
  2. Hyperparameter Tuning: Parameters like λt, γt require tuning for different scenarios
  3. Communication Analysis: Lacks detailed analysis of communication overhead
  4. Convergence Speed: While single-round training is faster, total convergence rounds may increase

Impact

  1. Academic Contribution: Provides new insights for resource-constrained FL
  2. Practical Value: Can be deployed in resource-limited environments like mobile devices
  3. Reproducibility: Provides detailed experimental settings and parameter configurations

Applicable Scenarios

  1. Mobile Device FL: Memory-constrained scenarios like smartphones and IoT devices
  2. Edge Computing: Environments with limited edge server resources
  3. Large Model Training: Scenarios requiring training large models with limited device resources

References

The paper cites important works in the FL domain, including FedAvg, HeteroFL, FedRolex and other classical methods, as well as theoretical foundations like information bottleneck theory and HSIC, with comprehensive and authoritative references.


Overall Assessment: This is a high-quality federated learning paper that proposes innovative solutions to critical problems in practical deployment. The method design is sound, experimental validation is comprehensive, theoretical analysis is complete, and it possesses significant academic and practical value.