2025-11-23T16:10:18.050621

Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

Li, Wang, Xu et al.
Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.
academic

Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

Basic Information

  • Paper ID: 2507.10348
  • Title: Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning
  • Authors: Yichen Li, Xiuying Wang, Wenchao Xu, Haozhao Wang, Yining Qi, Jiahua Dong, Ruixuan Li
  • Classification: cs.LG cs.AI
  • Publication Time/Venue: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
  • Paper Link: https://arxiv.org/abs/2507.10348

Abstract

Model-heterogeneous federated learning (Hetero-FL) has attracted significant attention for its ability to aggregate knowledge from heterogeneous models while preserving data privacy. To better aggregate client knowledge, ensemble distillation has been widely adopted as an effective technique to enhance global model performance after aggregation. However, simply combining Hetero-FL with ensemble distillation does not always yield satisfactory results and may lead to training instability. The root cause lies in the fact that existing methods primarily rely on logit distillation, which, although model-agnostic through softmax predictions, fails to compensate for knowledge bias introduced by heterogeneous models. To address this challenge, this paper proposes FedFD, a stable and efficient feature distillation method that better integrates heterogeneous model knowledge through orthogonal projection-based alignment of feature information.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is how to effectively aggregate knowledge from client models with different architectures in model-heterogeneous federated learning. Traditional federated learning assumes all clients use identical model architectures, but in practical IoT environments, different devices possess varying computational resources and model training capabilities.

Problem Significance

  1. Practical Necessity: The heterogeneity of IoT devices makes unified model architecture impractical
  2. Resource Maximization: Full utilization of distributed computational resources is required
  3. Privacy Protection: Knowledge sharing must be achieved while protecting data privacy

Limitations of Existing Methods

Through t-SNE visualization analysis and empirical experiments, the authors identify the following problems with existing logit distillation-based methods:

  1. Ambiguous Representations: Aggregated logit representations exhibit ambiguous classification boundaries
  2. Training Instability: Training oscillations occur in heterogeneous model settings
  3. Knowledge Bias: Inability to handle feature space differences introduced by different model architectures

Research Motivation

Based on in-depth analysis of existing method limitations, the authors propose using feature distillation instead of logit distillation, employing orthogonal projection techniques to address bias issues in heterogeneous model knowledge aggregation.

Core Contributions

  1. In-depth Analysis: Provides comprehensive analysis of model-agnostic federated knowledge distillation, identifying limitations of existing methods that primarily rely on logit distillation in heterogeneous model settings
  2. Novel Framework: Proposes the FedFD framework, a plug-and-play personalization enhancement module that inherits privacy protection and efficiency characteristics of traditional distillation methods
  3. Performance Improvement: Extensive experiments across multiple datasets and settings demonstrate improvements up to 16.09% in test accuracy compared to state-of-the-art methods

Methodology Details

Task Definition

Consider a federated learning problem with K clients, where each client k has access only to its local private dataset Dk={xk(i),yk(i)}D_k = \{x_k^{(i)}, y_k^{(i)}\}. The objective is to learn a global model w that minimizes the overall empirical loss:

minwL(w)=k=1KDkDLk(w)\min_w L(w) = \sum_{k=1}^K \frac{|D_k|}{|D|} L_k(w)

where Lk(w)=1Dki=1DkLCE(w;xki,yki)L_k(w) = \frac{1}{|D_k|} \sum_{i=1}^{|D_k|} L_{CE}(w; x_k^i, y_k^i)

Model Architecture

1. Hierarchical Feature Alignment

FedFD first groups client models by architecture. For each distillation sample x, its feature representation extracted by wkdw_k^d is: ekd=f(wkd;x),k[1,K]e_k^d = f(w_k^d; x), \forall k \in [1,K]

Features are then divided into m groups {S1d,...,Smd}\{S_1^d, ..., S_m^d\}, where each group contains extractors with identical structure. Feature representations within the same group are aggregated: ed=1Sdi=1Sdeide^d = \frac{1}{|S^d|} \sum_{i=1}^{|S^d|} e_i^d

2. Orthogonal Projection Technique

To address knowledge conflict issues, orthogonal projection transformation is employed. A projection layer MdM_d is generated through antisymmetric matrix WdW_d: exp(Wd)exp(Wd)T=exp(Wd+WdT)=exp(WdT+WdT)=I\exp(W_d) \cdot \exp(W_d)^T = \exp(W_d + W_d^T) = \exp(-W_d^T + W_d^T) = I

where: exp(Wd)=I+Wd+Wd22!+Wd33!++Wdnn!\exp(W_d) = I + W_d + \frac{W_d^2}{2!} + \frac{W_d^3}{3!} + \cdots + \frac{W_d^n}{n!}

3. Feature Distillation Loss

KL divergence is used to align feature representations: minw,{M2,...,Mm}1m1i=2mKL(Mi(wx),ei)\min_{w,\{M_2,...,M_m\}} \frac{1}{m-1} \sum_{i=2}^m KL(M_i(w_x), e^i)

Technical Innovations

  1. From Logits to Features: First systematically analyzes problems with logit distillation in heterogeneous models, proposing feature distillation as an alternative
  2. Hierarchical Alignment Strategy: Reduces the number of projection layers through architecture grouping, improving training efficiency
  3. Orthogonal Projection Technique: Uses antisymmetric matrices to generate orthogonal projections, resolving knowledge conflicts while maintaining computational efficiency
  4. Modular Design: Seamlessly integrates with existing FL techniques

Experimental Setup

Datasets

  • CIFAR-10: 10-class image classification, 50,000 training samples, 10,000 test samples
  • CIFAR-100: 100-class image classification, 50,000 training samples, 10,000 test samples
  • Tiny-ImageNet: 200-class image classification, larger-scale dataset

Data heterogeneity is simulated using Dirichlet distribution Dir(α), where smaller α values indicate more non-uniform data distribution.

Evaluation Metrics

  • Test Accuracy: Classification accuracy of global and local models
  • Communication Efficiency: Number of communication rounds required to reach target accuracy
  • Convergence Stability: Analysis of training learning curves

Comparison Methods

  1. Classical FL Methods: HeteroFL, MOON-hetero
  2. Homogeneous FL Methods: FedFusion-hetero, FedGen-hetero, DaFKD-hetero
  3. Heterogeneous FL Methods: FedMD, MSFKD, FedGD

Implementation Details

  • Local training rounds E=10, communication rounds T=200, number of clients K=20, participation rate r=0.4
  • Batch size 64, weight decay 1e-4
  • Distillation learning rate 0.01, local training learning rate 0.001
  • Server model uses ResNet-18, client models have 10 different complexity levels

Experimental Results

Main Results

FedFD achieves optimal performance across all datasets and settings:

DatasetαHeteroFLFedGDFedFDImprovement
CIFAR-101.087.53±0.1587.22±0.1389.64±0.232.11%
CIFAR-100.178.02±0.6579.31±0.7582.74±0.583.43%
CIFAR-1001.057.42±0.1258.03±0.2660.86±0.102.83%
Tiny-ImageNet1.029.88±2.7230.66±1.5934.24±1.134.36%

Communication Efficiency

FedFD also demonstrates superior communication efficiency:

  • CIFAR-10 reaching 80% accuracy: FedFD requires 20 rounds, HeteroFL requires 25 rounds
  • CIFAR-100 reaching 60% accuracy: FedFD requires 60 rounds, other methods require 171-200+ rounds

Ablation Study

Validates the importance of each component:

  • Removing feature alignment: Performance drops 0.63-1.56%
  • Removing orthogonal projection: Performance drops 1.68-2.43%
  • Removing both components: Significant performance degradation, reverting to FedFusion level

Stability Analysis

Learning curve comparisons reveal:

  • Homogeneous models: All logit distillation methods converge rapidly and stably
  • Heterogeneous models: Logit distillation methods exhibit training oscillations, while FedFD maintains stable convergence

Scalability Experiments

FedFD maintains optimal performance even under more extreme data heterogeneity settings (α=0.01) and different model architecture combinations.

Federated Learning

Evolution from FedAvg's homogeneous model aggregation to methods supporting heterogeneous models, such as HeteroFL through partial parameter aggregation and NeFL through nested structures accommodating different depths.

Knowledge Distillation

Encompasses two major categories: logit distillation and feature distillation. This paper focuses on feature distillation applications in federated learning, breaking existing limitations through orthogonal projection and ensemble distillation.

Federated Distillation

Existing methods primarily rely on logit distillation or require additional proxy datasets. This paper analyzes limitations of these methods in heterogeneous model settings.

Conclusions and Discussion

Main Conclusions

  1. Problem Identification: Logit distillation exhibits knowledge bias and training instability issues in heterogeneous model settings
  2. Solution: Feature distillation combined with orthogonal projection effectively addresses heterogeneous model knowledge aggregation problems
  3. Performance Verification: FedFD achieves significant performance improvements across various settings

Limitations

  1. Computational Overhead: Requires maintaining projection layers for different architectures, increasing server-side computational costs
  2. Architecture Dependency: Method effectiveness may depend on the degree of diversity in client model architectures
  3. Distillation Data: Still requires auxiliary datasets for distillation, though can be combined with data-free methods

Future Directions

  1. Explore completely data-free feature distillation methods
  2. Investigate more efficient projection layer designs
  3. Extend to additional modalities and task types

In-Depth Evaluation

Strengths

  1. Deep Problem Insights: Clearly identifies fundamental problems with existing methods through visualization and empirical analysis
  2. Reasonable Method Design: Orthogonal projection technique addresses knowledge conflicts while maintaining computational efficiency
  3. Comprehensive Experiments: Covers multiple datasets, varying heterogeneity levels, and ablation studies
  4. Strong Engineering Practicality: Modular design enables easy integration into existing FL frameworks

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical explanation for why feature distillation outperforms logit distillation
  2. Limited Computational Complexity Analysis: Insufficient detailed analysis of orthogonal projection computational overhead
  3. Limited Large-Scale Validation: Experiments primarily conducted on medium-scale datasets

Impact

  1. Academic Value: Provides new technical pathways for heterogeneous federated learning
  2. Practical Value: Directly applicable to real-world IoT scenarios
  3. Inspirational Significance: Offers new perspectives for knowledge distillation research in federated learning

Applicable Scenarios

  1. IoT Device Federated Learning: Collaborative training among devices with varying computational capabilities
  2. Cross-Institutional Collaboration: Knowledge sharing when different organizations use different model architectures
  3. Edge Computing: Distributed learning in resource-constrained environments

References

This paper cites important works in federated learning, knowledge distillation, and federated distillation domains, including:

  • FedAvg 34: Foundational work in federated learning
  • HeteroFL 6: Representative method in heterogeneous federated learning
  • Knowledge distillation related works 14, 15, 44: Provide theoretical foundations for this paper
  • Federated distillation methods 33, 49, 58: Direct comparison baselines

This paper presents important innovations in heterogeneous federated learning. Through in-depth analysis of existing method limitations and proposal of effective solutions, it makes valuable contributions to the field's development. The method's modular design and superior experimental results demonstrate strong practical value.