2025-11-23T16:10:18.050621

Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

Li, Wang, Xu et al.

Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.

academic

Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

Basic Information

Paper ID: 2507.10348
Title: Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning
Authors: Yichen Li, Xiuying Wang, Wenchao Xu, Haozhao Wang, Yining Qi, Jiahua Dong, Ruixuan Li
Classification: cs.LG cs.AI
Publication Time/Venue: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Paper Link: https://arxiv.org/abs/2507.10348

Abstract

Model-heterogeneous federated learning (Hetero-FL) has attracted significant attention for its ability to aggregate knowledge from heterogeneous models while preserving data privacy. To better aggregate client knowledge, ensemble distillation has been widely adopted as an effective technique to enhance global model performance after aggregation. However, simply combining Hetero-FL with ensemble distillation does not always yield satisfactory results and may lead to training instability. The root cause lies in the fact that existing methods primarily rely on logit distillation, which, although model-agnostic through softmax predictions, fails to compensate for knowledge bias introduced by heterogeneous models. To address this challenge, this paper proposes FedFD, a stable and efficient feature distillation method that better integrates heterogeneous model knowledge through orthogonal projection-based alignment of feature information.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is how to effectively aggregate knowledge from client models with different architectures in model-heterogeneous federated learning. Traditional federated learning assumes all clients use identical model architectures, but in practical IoT environments, different devices possess varying computational resources and model training capabilities.

Problem Significance

Practical Necessity: The heterogeneity of IoT devices makes unified model architecture impractical
Resource Maximization: Full utilization of distributed computational resources is required
Privacy Protection: Knowledge sharing must be achieved while protecting data privacy

Limitations of Existing Methods

Through t-SNE visualization analysis and empirical experiments, the authors identify the following problems with existing logit distillation-based methods:

Ambiguous Representations: Aggregated logit representations exhibit ambiguous classification boundaries
Training Instability: Training oscillations occur in heterogeneous model settings
Knowledge Bias: Inability to handle feature space differences introduced by different model architectures

Research Motivation

Based on in-depth analysis of existing method limitations, the authors propose using feature distillation instead of logit distillation, employing orthogonal projection techniques to address bias issues in heterogeneous model knowledge aggregation.

Core Contributions

In-depth Analysis: Provides comprehensive analysis of model-agnostic federated knowledge distillation, identifying limitations of existing methods that primarily rely on logit distillation in heterogeneous model settings
Novel Framework: Proposes the FedFD framework, a plug-and-play personalization enhancement module that inherits privacy protection and efficiency characteristics of traditional distillation methods
Performance Improvement: Extensive experiments across multiple datasets and settings demonstrate improvements up to 16.09% in test accuracy compared to state-of-the-art methods

Methodology Details

Task Definition

Consider a federated learning problem with K clients, where each client k has access only to its local private dataset $D_k = \{x_k^{(i)}, y_k^{(i)}\}$ . The objective is to learn a global model w that minimizes the overall empirical loss:

$\min_w L(w) = \sum_{k=1}^K \frac{|D_k|}{|D|} L_k(w)$

where $L_k(w) = \frac{1}{|D_k|} \sum_{i=1}^{|D_k|} L_{CE}(w; x_k^i, y_k^i)$

Model Architecture

1. Hierarchical Feature Alignment

FedFD first groups client models by architecture. For each distillation sample x, its feature representation extracted by $w_k^d$ is: $e_k^d = f(w_k^d; x), \forall k \in [1,K]$

Features are then divided into m groups $\{S_1^d, ..., S_m^d\}$ , where each group contains extractors with identical structure. Feature representations within the same group are aggregated: $e^d = \frac{1}{|S^d|} \sum_{i=1}^{|S^d|} e_i^d$

2. Orthogonal Projection Technique

To address knowledge conflict issues, orthogonal projection transformation is employed. A projection layer $M_d$ is generated through antisymmetric matrix $W_d$ : $\exp(W_d) \cdot \exp(W_d)^T = \exp(W_d + W_d^T) = \exp(-W_d^T + W_d^T) = I$

where: $\exp(W_d) = I + W_d + \frac{W_d^2}{2!} + \frac{W_d^3}{3!} + \cdots + \frac{W_d^n}{n!}$

3. Feature Distillation Loss

KL divergence is used to align feature representations: $\min_{w,\{M_2,...,M_m\}} \frac{1}{m-1} \sum_{i=2}^m KL(M_i(w_x), e^i)$

Technical Innovations

From Logits to Features: First systematically analyzes problems with logit distillation in heterogeneous models, proposing feature distillation as an alternative
Hierarchical Alignment Strategy: Reduces the number of projection layers through architecture grouping, improving training efficiency
Orthogonal Projection Technique: Uses antisymmetric matrices to generate orthogonal projections, resolving knowledge conflicts while maintaining computational efficiency
Modular Design: Seamlessly integrates with existing FL techniques

Experimental Setup

Datasets

CIFAR-10: 10-class image classification, 50,000 training samples, 10,000 test samples
CIFAR-100: 100-class image classification, 50,000 training samples, 10,000 test samples
Tiny-ImageNet: 200-class image classification, larger-scale dataset

Data heterogeneity is simulated using Dirichlet distribution Dir(α), where smaller α values indicate more non-uniform data distribution.

Evaluation Metrics

Test Accuracy: Classification accuracy of global and local models
Communication Efficiency: Number of communication rounds required to reach target accuracy
Convergence Stability: Analysis of training learning curves

Comparison Methods

Classical FL Methods: HeteroFL, MOON-hetero
Homogeneous FL Methods: FedFusion-hetero, FedGen-hetero, DaFKD-hetero
Heterogeneous FL Methods: FedMD, MSFKD, FedGD

Implementation Details

Local training rounds E=10, communication rounds T=200, number of clients K=20, participation rate r=0.4
Batch size 64, weight decay 1e-4
Distillation learning rate 0.01, local training learning rate 0.001
Server model uses ResNet-18, client models have 10 different complexity levels

Experimental Results

Main Results

FedFD achieves optimal performance across all datasets and settings:

Dataset	α	HeteroFL	FedGD	FedFD	Improvement
CIFAR-10	1.0	87.53±0.15	87.22±0.13	89.64±0.23	2.11%
CIFAR-10	0.1	78.02±0.65	79.31±0.75	82.74±0.58	3.43%
CIFAR-100	1.0	57.42±0.12	58.03±0.26	60.86±0.10	2.83%
Tiny-ImageNet	1.0	29.88±2.72	30.66±1.59	34.24±1.13	4.36%

Communication Efficiency

FedFD also demonstrates superior communication efficiency:

CIFAR-10 reaching 80% accuracy: FedFD requires 20 rounds, HeteroFL requires 25 rounds
CIFAR-100 reaching 60% accuracy: FedFD requires 60 rounds, other methods require 171-200+ rounds

Ablation Study

Validates the importance of each component:

Removing feature alignment: Performance drops 0.63-1.56%
Removing orthogonal projection: Performance drops 1.68-2.43%
Removing both components: Significant performance degradation, reverting to FedFusion level

Stability Analysis

Learning curve comparisons reveal:

Homogeneous models: All logit distillation methods converge rapidly and stably
Heterogeneous models: Logit distillation methods exhibit training oscillations, while FedFD maintains stable convergence

Scalability Experiments

FedFD maintains optimal performance even under more extreme data heterogeneity settings (α=0.01) and different model architecture combinations.

Federated Learning

Evolution from FedAvg's homogeneous model aggregation to methods supporting heterogeneous models, such as HeteroFL through partial parameter aggregation and NeFL through nested structures accommodating different depths.

Knowledge Distillation

Encompasses two major categories: logit distillation and feature distillation. This paper focuses on feature distillation applications in federated learning, breaking existing limitations through orthogonal projection and ensemble distillation.

Federated Distillation

Existing methods primarily rely on logit distillation or require additional proxy datasets. This paper analyzes limitations of these methods in heterogeneous model settings.

Conclusions and Discussion

Main Conclusions

Problem Identification: Logit distillation exhibits knowledge bias and training instability issues in heterogeneous model settings
Solution: Feature distillation combined with orthogonal projection effectively addresses heterogeneous model knowledge aggregation problems
Performance Verification: FedFD achieves significant performance improvements across various settings

Limitations

Computational Overhead: Requires maintaining projection layers for different architectures, increasing server-side computational costs
Architecture Dependency: Method effectiveness may depend on the degree of diversity in client model architectures
Distillation Data: Still requires auxiliary datasets for distillation, though can be combined with data-free methods

Future Directions

Explore completely data-free feature distillation methods
Investigate more efficient projection layer designs
Extend to additional modalities and task types

In-Depth Evaluation

Strengths

Deep Problem Insights: Clearly identifies fundamental problems with existing methods through visualization and empirical analysis
Reasonable Method Design: Orthogonal projection technique addresses knowledge conflicts while maintaining computational efficiency
Comprehensive Experiments: Covers multiple datasets, varying heterogeneity levels, and ablation studies
Strong Engineering Practicality: Modular design enables easy integration into existing FL frameworks

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation for why feature distillation outperforms logit distillation
Limited Computational Complexity Analysis: Insufficient detailed analysis of orthogonal projection computational overhead
Limited Large-Scale Validation: Experiments primarily conducted on medium-scale datasets

Impact

Academic Value: Provides new technical pathways for heterogeneous federated learning
Practical Value: Directly applicable to real-world IoT scenarios
Inspirational Significance: Offers new perspectives for knowledge distillation research in federated learning

Applicable Scenarios

IoT Device Federated Learning: Collaborative training among devices with varying computational capabilities
Cross-Institutional Collaboration: Knowledge sharing when different organizations use different model architectures
Edge Computing: Distributed learning in resource-constrained environments

References

This paper cites important works in federated learning, knowledge distillation, and federated distillation domains, including:

FedAvg 34: Foundational work in federated learning
HeteroFL 6: Representative method in heterogeneous federated learning
Knowledge distillation related works 14, 15, 44: Provide theoretical foundations for this paper
Federated distillation methods 33, 49, 58: Direct comparison baselines

This paper presents important innovations in heterogeneous federated learning. Through in-depth analysis of existing method limitations and proposal of effective solutions, it makes valuable contributions to the field's development. The method's modular design and superior experimental results demonstrate strong practical value.