2025-11-11T23:28:21.956833

Bridging Memory Gaps: Scaling Federated Learning for Heterogeneous Clients

Wu, Li, Tian et al.

Federated Learning (FL) enables multiple clients to collaboratively train a shared model while preserving data privacy. However, the high memory demand during model training severely limits the deployment of FL on resource-constrained clients. To this end, we propose \our, a scalable and inclusive FL framework designed to overcome memory limitations through sequential block-wise training. The core idea of \our is to partition the global model into blocks and train them sequentially, thereby reducing training memory requirements. To mitigate information loss during block-wise training, \our introduces a Curriculum Mentor that crafts curriculum-aware training objectives for each block to steer their learning process. Moreover, \our incorporates a Training Harmonizer that designs a parameter co-adaptation training scheme to coordinate block updates, effectively breaking inter-block information isolation. Extensive experiments on both simulation and hardware testbeds demonstrate that \our significantly improves model performance by up to 84.2\%, reduces peak memory usage by up to 50.4\%, and accelerates training by up to 1.9$\times$.

academic

Bridging Memory Gaps: Scaling Federated Learning for Heterogeneous Clients

Basic Information

Paper ID: 2408.10826
Title: Bridging Memory Gaps: Scaling Federated Learning for Heterogeneous Clients
Authors: Yebo Wu, Jingguang Li, Chunlin Tian, KaHou Tam, Li Li, Chengzhong Xu (University of Macau)
Category: cs.DC (Distributed Computing)
Publication Date: August 2024 (arXiv v2: October 2025)
Paper Link: https://arxiv.org/abs/2408.10826v2

Abstract

Federated Learning (FL) enables multiple clients to collaboratively train a shared model while preserving data privacy. However, high memory requirements during model training severely limit FL deployment on resource-constrained clients. To address this, the paper proposes SCALEFL, a scalable and inclusive FL framework that overcomes memory limitations through sequential block-wise training. The core idea of SCALEFL is to partition the global model into blocks and train them sequentially, thereby reducing training memory requirements. To mitigate information loss in block-wise training, SCALEFL introduces a Curriculum Mentor that formulates curriculum-aware training objectives for each block. Additionally, SCALEFL integrates a Training Harmonizer that designs parameter synergistic adaptive training schemes to effectively break information isolation between blocks.

Research Background and Motivation

Core Problems

Memory Wall Problem: Federated learning training requires retaining all intermediate activations, model weights, and optimizer states in memory, resulting in high memory consumption. For example, training ResNet34 on ImageNet consumes over 12GB of memory, while typical mobile devices have only 4-12GB of RAM.
Device Heterogeneity: Resource-constrained edge devices cannot participate in local training, preventing their valuable data from contributing to the global model.
Limitations of Existing Methods:
- Model Heterogeneous Training: Requires high-quality public datasets for knowledge distillation, which is difficult to obtain in FL
- Partial Training: Width scaling breaks model architecture, depth scaling is limited by the maximum memory capacity of the most constrained client

Research Motivation

As model architectures become deeper and wider to achieve higher analytical capability, the memory problem is further exacerbated. This paper aims to design an FL framework that can significantly reduce memory requirements while maintaining model performance.

Core Contributions

Proposes SCALEFL Framework: Significantly reduces training memory requirements through sequential block-wise training, enabling resource-constrained devices to participate effectively
Designs Two Core Components: Curriculum Mentor and Training Harmonizer jointly shape the learning behavior of each block, promoting coherent structured feature learning
Comprehensive Experimental Validation: Demonstrates the effectiveness and robustness of SCALEFL on multiple benchmark datasets
Theoretical Analysis: Provides convergence analysis proving the theoretical reliability of the method

Method Details

Task Definition

In an FL system containing N clients, each client n possesses a local dataset Dn. The goal is to train a global model Θ while satisfying memory constraints of all clients.

Sequential Block-wise Training Paradigm

Basic Procedure:

Model Construction: Server constructs a sub-model Θg,t = θ1,F, θ2,F, ..., θt, θOp for the current training stage t
Local Training: Only updates block θt and output module θOp
Model Aggregation: Aggregates parameter updates using weighted averaging
Progress Evaluation: Monitors training progress of block θt and determines convergence
Model Growth: Freezes converged blocks and introduces new blocks

Core Technical Components

1. Curriculum Mentor

Problem Analysis: Based on information bottleneck theory, the paper discovers that sequential block-wise training causes severe information loss. Dynamic analysis using nHSIC planes shows that SBT loses substantial input information after training the first block, preventing subsequent blocks from extracting critical features.

Solution: Design curriculum-aware training objectives

L_θt = L_CE - λt · nHSIC(X;Zt) - γt · nHSIC(Y;Zt)

Where:

L_CE is cross-entropy loss
nHSIC(X;Zt) measures input information retention
nHSIC(Y;Zt) measures task relevance
λt and γt are dynamically adjusted according to training stage

Strategy: Early stages use higher λt and lower γt to emphasize input information retention, while later stages gradually decrease λt and increase γt to shift toward task-specific feature extraction.

2. Training Harmonizer

Problem Identification:

Limited Forward Information Flow: Downstream blocks only begin training after upstream blocks converge
Limited Backward Information Flow: Gradients are confined within blocks, causing gradient isolation

Parameter Synergistic Adaptation Scheme:

Dynamic Model Growth: Dynamically orchestrates the learning process of each block in each round, enabling downstream blocks to adapt to upstream block updates in real-time
Concurrent Training Strategy: Trains the current block simultaneously with the last few layers of the upstream block, promoting gradient flow

Update formula:

θ^(k+1)_(n,t) + L^(k+1)_(n,t-1) ← (θ^k_(n,t) + L^k_(n,t-1)) - η · ∂L^k_(n,t)/∂(θ^k_(n,t) + L^k_(n,t-1))

Complete Training Objective

Combining L2 regularization to handle data heterogeneity:

L^r_t = L_θt + (μ/2)||θ^r_t - θ^(r-1)_t||^2_2

Experimental Setup

Datasets

CIFAR10/CIFAR100: Classic image classification datasets
CINIC10: Extended version of CIFAR10
Mini-ImageNet: Small-scale ImageNet
FEMNIST: Large-scale FL dataset (805,263 images)

Model Architectures

ResNet18/ResNet34: Deep residual networks
VGG11 BN: Classic convolutional network
SqueezeNet: Lightweight network
Vision Transformer (ViT): Transformer architecture

Experimental Environment

Hybrid Setup: Simulation and real device testbed
Device Configuration: 100 heterogeneous mobile devices, 10% randomly selected per round
Memory Budget: 100-1000MB randomly allocated
Optimizer: SGD, weight decay 5e-4, local epoch=5

Baseline Methods

AllSmall: Scales down global model based on weakest device memory
ExclusiveFL: Only allows devices with sufficient memory to participate
DepthFL: Depth scaling to adapt to heterogeneous devices
HeteroFL: Static channel scaling
FedRolex: Dynamic width scaling
SmartFreeze: Simple sequential block-wise training
ProFL: Decomposed sequential training

Experimental Results

Main Results

Performance in Non-IID Scenarios:

Method	CIFAR10 (ResNet18/VGG11/SqueezeNet)	Participation Rate
AllSmall	69.5%/75.1%/49.6%	100%/100%/100%
ExclusiveFL	76.8%/79.3%/40.6%	18%/22%/11%
SCALEFL	80.4%/87.6%/58.0%	100%/100%/100%

Key Findings:

Significant Performance Improvement: 10.9%, 12.5%, and 8.4% improvement over AllSmall respectively
Full Device Participation: Achieves 100% device participation rate, while ExclusiveFL only achieves 18-22%
Memory Efficiency: Peak memory usage reduced by up to 50.4%
Training Acceleration: Convergence speed improved by 1.9x

Scalability Analysis

Robustness Under Different Memory Constraints:

ExclusiveFL completely infeasible in ResNet34 scenario (0% participation rate)
SCALEFL achieves up to 27.4% improvement over other methods

Large-scale Datasets:

3% accuracy improvement over FedAvg on FEMNIST dataset
Supports 120-500 device scale

Transformer Compatibility:

Only 2% lower than theoretical baseline on ViT model, but theoretical baseline is impractical

Hardware Evaluation

Memory Efficiency:

Testing on Jetson TX2 shows 50.4% reduction in peak memory usage
Single-round training time reduced by 1.84-2.31x

Training Efficiency:

Significantly reduced single-round training time compared to end-to-end training
Achieved 1.9x acceleration on ViT

Ablation Study

Component Contribution Analysis:

Removing Curriculum Mentor: 1.2% accuracy drop in CIFAR100 IID scenario
Removing Training Harmonizer: Significant 9.0% accuracy drop
Synergistic effect of both components is critical for performance

Resource-Constrained FL

Model Heterogeneous Training: Methods like FedMD require public datasets for knowledge distillation
Partial Training: HeteroFL, FedRolex use width scaling; DepthFL, InclusiveFL use depth scaling

Block-wise Training

ProgFed: Progressively introduces new blocks but still requires end-to-end training
SmartFreeze: Sequentially trains each block but ignores information loss
ProFL: Decomposes into shrinking and growth stages but fails to address core challenges

Theoretical Analysis

Convergence Proof

The paper provides convergence analysis for SCALEFL, proving under standard assumptions (smoothness, bounded gradients):

(1/R) Σ E[||∇L^r_t(Θ^r_(g,t))||^2] ≤ Ψ/√R

That is, the average gradient norm converges to 0, and the model converges to a stable point.

Conclusions and Discussion

Main Conclusions

SCALEFL successfully addresses the memory wall problem in FL, enabling resource-constrained devices to participate in training
Curriculum Mentor and Training Harmonizer effectively mitigate core challenges of sequential block-wise training
Achieves significant performance improvements and memory savings across multiple datasets and models

Limitations

Block Partitioning Strategy: The paper does not deeply discuss optimal block partitioning methods
Communication Overhead: While reducing memory usage, it may increase communication rounds
Hyperparameter Sensitivity: Setting λt and γt requires careful tuning

Future Directions

Adaptive block partitioning strategies
Integration with other FL optimization techniques
Validation in larger-scale practical deployments

In-Depth Evaluation

Strengths

Problem Importance: Addresses a critical bottleneck in practical FL deployment
Method Novelty: Curriculum-aware training objectives and parameter synergistic adaptation schemes are original
Theoretical Foundation: Analysis based on information bottleneck theory provides solid theoretical support
Experimental Comprehensiveness: Covers multiple models, datasets, and real hardware testing
Practical Value: Significant memory savings and performance improvements have practical application value

Weaknesses

Complexity: The introduction of two components increases system complexity
Hyperparameter Tuning: Parameters like λt, γt require tuning for different scenarios
Communication Analysis: Lacks detailed analysis of communication overhead
Convergence Speed: While single-round training is faster, total convergence rounds may increase

Impact

Academic Contribution: Provides new insights for resource-constrained FL
Practical Value: Can be deployed in resource-limited environments like mobile devices
Reproducibility: Provides detailed experimental settings and parameter configurations

Applicable Scenarios

Mobile Device FL: Memory-constrained scenarios like smartphones and IoT devices
Edge Computing: Environments with limited edge server resources
Large Model Training: Scenarios requiring training large models with limited device resources

References

The paper cites important works in the FL domain, including FedAvg, HeteroFL, FedRolex and other classical methods, as well as theoretical foundations like information bottleneck theory and HSIC, with comprehensive and authoritative references.

Overall Assessment: This is a high-quality federated learning paper that proposes innovative solutions to critical problems in practical deployment. The method design is sound, experimental validation is comprehensive, theoretical analysis is complete, and it possesses significant academic and practical value.