Bridging Memory Gaps: Scaling Federated Learning for Heterogeneous Clients
Wu, Li, Tian et al.
Federated Learning (FL) enables multiple clients to collaboratively train a shared model while preserving data privacy. However, the high memory demand during model training severely limits the deployment of FL on resource-constrained clients. To this end, we propose \our, a scalable and inclusive FL framework designed to overcome memory limitations through sequential block-wise training. The core idea of \our is to partition the global model into blocks and train them sequentially, thereby reducing training memory requirements. To mitigate information loss during block-wise training, \our introduces a Curriculum Mentor that crafts curriculum-aware training objectives for each block to steer their learning process. Moreover, \our incorporates a Training Harmonizer that designs a parameter co-adaptation training scheme to coordinate block updates, effectively breaking inter-block information isolation. Extensive experiments on both simulation and hardware testbeds demonstrate that \our significantly improves model performance by up to 84.2\%, reduces peak memory usage by up to 50.4\%, and accelerates training by up to 1.9$\times$.
academic
Bridging Memory Gaps: Scaling Federated Learning for Heterogeneous Clients
Federated Learning (FL) enables multiple clients to collaboratively train a shared model while preserving data privacy. However, high memory requirements during model training severely limit FL deployment on resource-constrained clients. To address this, the paper proposes SCALEFL, a scalable and inclusive FL framework that overcomes memory limitations through sequential block-wise training. The core idea of SCALEFL is to partition the global model into blocks and train them sequentially, thereby reducing training memory requirements. To mitigate information loss in block-wise training, SCALEFL introduces a Curriculum Mentor that formulates curriculum-aware training objectives for each block. Additionally, SCALEFL integrates a Training Harmonizer that designs parameter synergistic adaptive training schemes to effectively break information isolation between blocks.
Memory Wall Problem: Federated learning training requires retaining all intermediate activations, model weights, and optimizer states in memory, resulting in high memory consumption. For example, training ResNet34 on ImageNet consumes over 12GB of memory, while typical mobile devices have only 4-12GB of RAM.
Device Heterogeneity: Resource-constrained edge devices cannot participate in local training, preventing their valuable data from contributing to the global model.
Limitations of Existing Methods:
Model Heterogeneous Training: Requires high-quality public datasets for knowledge distillation, which is difficult to obtain in FL
Partial Training: Width scaling breaks model architecture, depth scaling is limited by the maximum memory capacity of the most constrained client
As model architectures become deeper and wider to achieve higher analytical capability, the memory problem is further exacerbated. This paper aims to design an FL framework that can significantly reduce memory requirements while maintaining model performance.
Proposes SCALEFL Framework: Significantly reduces training memory requirements through sequential block-wise training, enabling resource-constrained devices to participate effectively
Designs Two Core Components: Curriculum Mentor and Training Harmonizer jointly shape the learning behavior of each block, promoting coherent structured feature learning
Comprehensive Experimental Validation: Demonstrates the effectiveness and robustness of SCALEFL on multiple benchmark datasets
Theoretical Analysis: Provides convergence analysis proving the theoretical reliability of the method
In an FL system containing N clients, each client n possesses a local dataset Dn. The goal is to train a global model Θ while satisfying memory constraints of all clients.
Problem Analysis: Based on information bottleneck theory, the paper discovers that sequential block-wise training causes severe information loss. Dynamic analysis using nHSIC planes shows that SBT loses substantial input information after training the first block, preventing subsequent blocks from extracting critical features.
Solution: Design curriculum-aware training objectives
L_θt = L_CE - λt · nHSIC(X;Zt) - γt · nHSIC(Y;Zt)
Where:
L_CE is cross-entropy loss
nHSIC(X;Zt) measures input information retention
nHSIC(Y;Zt) measures task relevance
λt and γt are dynamically adjusted according to training stage
Strategy: Early stages use higher λt and lower γt to emphasize input information retention, while later stages gradually decrease λt and increase γt to shift toward task-specific feature extraction.
Limited Forward Information Flow: Downstream blocks only begin training after upstream blocks converge
Limited Backward Information Flow: Gradients are confined within blocks, causing gradient isolation
Parameter Synergistic Adaptation Scheme:
Dynamic Model Growth: Dynamically orchestrates the learning process of each block in each round, enabling downstream blocks to adapt to upstream block updates in real-time
Concurrent Training Strategy: Trains the current block simultaneously with the last few layers of the upstream block, promoting gradient flow
The paper cites important works in the FL domain, including FedAvg, HeteroFL, FedRolex and other classical methods, as well as theoretical foundations like information bottleneck theory and HSIC, with comprehensive and authoritative references.
Overall Assessment: This is a high-quality federated learning paper that proposes innovative solutions to critical problems in practical deployment. The method design is sound, experimental validation is comprehensive, theoretical analysis is complete, and it possesses significant academic and practical value.