2025-11-23T16:10:18.050621

Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

Li, Wang, Xu et al.

Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.

academic

Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

基本信息

论文ID: 2507.10348
标题: Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning
作者: Yichen Li, Xiuying Wang, Wenchao Xu, Haozhao Wang, Yining Qi, Jiahua Dong, Ruixuan Li
分类: cs.LG cs.AI
发表时间/会议: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
论文链接: https://arxiv.org/abs/2507.10348

摘要

模型异构联邦学习（Hetero-FL）因其能够在保持数据本地隐私的同时聚合异构模型知识而备受关注。为了更好地聚合客户端知识，集成蒸馏作为一种广泛使用且有效的技术，通常在全局聚合后用于增强全局模型性能。然而，简单地结合Hetero-FL和集成蒸馏并不总是产生良好结果，还可能导致训练过程不稳定。原因在于现有方法主要依赖logit蒸馏，虽然通过softmax预测具有模型无关性，但无法补偿异构模型产生的知识偏差。为解决这一挑战，本文提出了一种稳定高效的特征蒸馏方法FedFD，通过正交投影整合对齐的特征信息，更好地集成异构模型知识。

研究背景与动机

问题定义

本研究要解决的核心问题是在模型异构联邦学习中，如何有效地聚合来自不同架构客户端模型的知识。传统的联邦学习假设所有客户端使用相同的模型架构，但在实际IoT环境中，不同设备具有不同的计算资源和模型训练能力。

问题重要性

现实需求：IoT设备的异构性使得统一模型架构不现实
资源最大化：需要充分利用分布式计算资源
隐私保护：在保护数据隐私的同时实现知识共享

现有方法局限性

通过t-SNE可视化分析和实证实验，作者发现现有基于logit蒸馏的方法存在以下问题：

表示模糊：聚合的logit表示具有模糊的分类边界
训练不稳定：在异构模型设置下出现训练震荡
知识偏差：无法处理不同模型架构带来的特征空间差异

研究动机

基于对现有方法局限性的深入分析，作者提出使用特征蒸馏替代logit蒸馏，通过正交投影技术解决异构模型知识聚合中的偏差问题。

核心贡献

深入分析：提供了对模型无关联邦知识蒸馏的深入分析，识别出现有方法主要依赖logit蒸馏在异构模型下的局限性
新框架提出：提出了FedFD框架，这是一个即插即用的个性化增强模块，继承了传统蒸馏方法的隐私保护和效率特性
性能提升：在多个数据集和设置下进行了广泛实验，相比最先进方法在测试准确率上提升高达16.09%

方法详解

任务定义

考虑K个客户端的联邦学习问题，每个客户端k只能访问其本地私有数据集 $D_k = \{x_k^{(i)}, y_k^{(i)}\}$ 。目标是学习一个全局模型w，最小化总体经验损失：

$\min_w L(w) = \sum_{k=1}^K \frac{|D_k|}{|D|} L_k(w)$

其中 $L_k(w) = \frac{1}{|D_k|} \sum_{i=1}^{|D_k|} L_{CE}(w; x_k^i, y_k^i)$

从logit到特征：首次系统分析了logit蒸馏在异构模型下的问题，提出特征蒸馏作为替代方案
分层对齐策略：通过架构分组减少投影层数量，提高训练效率
正交投影技术：使用反对称矩阵生成正交投影，解决知识冲突同时保持计算效率
模块化设计：可与现有FL技术无缝集成

实验设置

数据集

CIFAR-10: 10类图像分类，50,000训练样本，10,000测试样本
CIFAR-100: 100类图像分类，50,000训练样本，10,000测试样本
Tiny-ImageNet: 200类图像分类，更大规模数据集

使用Dirichlet分布Dir(α)模拟数据异构性，α值越小表示数据分布越不均匀。

评价指标

测试准确率：全局模型和本地模型的分类准确率
通信效率：达到目标准确率所需的通信轮数
收敛稳定性：训练过程的学习曲线分析

对比方法

经典FL方法：HeteroFL, MOON-hetero
同构FL方法：FedFusion-hetero, FedGen-hetero, DaFKD-hetero
异构FL方法：FedMD, MSFKD, FedGD

实现细节

本地训练轮数E=10，通信轮数T=200，客户端数K=20，参与率r=0.4
批大小64，权重衰减1e-4
蒸馏学习率0.01，本地训练学习率0.001
服务器模型使用ResNet-18，客户端模型有10个不同复杂度等级

实验结果

主要结果

在所有数据集和设置下，FedFD都取得了最佳性能：

数据集	α值	HeteroFL	FedGD	FedFD	提升
CIFAR-10	1.0	87.53±0.15	87.22±0.13	89.64±0.23	2.11%
CIFAR-10	0.1	78.02±0.65	79.31±0.75	82.74±0.58	3.43%
CIFAR-100	1.0	57.42±0.12	58.03±0.26	60.86±0.10	2.83%
Tiny-ImageNet	1.0	29.88±2.72	30.66±1.59	34.24±1.13	4.36%