2025-11-26T09:37:18.284926

Uncertainty-Aware Dual-Student Knowledge Distillation for Efficient Image Classification

Gore, Dey, Mishra

Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, traditional knowledge distillation methods treat all teacher predictions equally, regardless of the teacher's confidence in those predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages teacher prediction uncertainty to selectively guide student learning. We introduce a peer-learning mechanism where two heterogeneous student architectures, specifically ResNet-18 and MobileNetV2, learn collaboratively from both the teacher network and each other. Experimental results on ImageNet-100 demonstrate that our approach achieves superior performance compared to baseline knowledge distillation methods, with ResNet-18 achieving 83.84\% top-1 accuracy and MobileNetV2 achieving 81.46\% top-1 accuracy, representing improvements of 2.04\% and 0.92\% respectively over traditional single-student distillation approaches.

academic

Uncertainty-Aware Dual-Student Knowledge Distillation for Efficient Image Classification

Basic Information

Paper ID: 2511.18826
Title: Uncertainty-Aware Dual-Student Knowledge Distillation for Efficient Image Classification
Authors: Aakash Gore, Anoushka Dey, Aryan Mishra (Indian Institute of Technology Bombay)
Classification: cs.CV, cs.LG
Publication Date: November 24, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.18826

Abstract

Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, conventional knowledge distillation methods treat all teacher predictions uniformly, overlooking the varying confidence levels of the teacher across different predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages the uncertainty in teacher predictions to selectively guide student learning. A peer learning mechanism is introduced, enabling two heterogeneous student architectures (ResNet-18 and MobileNetV2) to learn collaboratively from the teacher network and each other. Experimental results on ImageNet-100 demonstrate that the proposed method outperforms baseline knowledge distillation approaches, achieving 83.84% top-1 accuracy for ResNet-18 and 81.46% top-1 accuracy for MobileNetV2, representing improvements of 2.04% and 0.92% respectively over conventional single-student distillation methods.

Research Background and Motivation

1. Problem Statement

Deep neural networks have achieved remarkable success in computer vision tasks, yet their deployment on resource-constrained devices remains challenging. This paper addresses:

Blindness of conventional knowledge distillation: Existing methods assign equal weight to all teacher predictions, ignoring confidence variations across different samples
Limitations of single-student models: A single student model cannot fully exploit complementary advantages of multiple architectures
Negative knowledge transfer: Uncertain predictions from the teacher may mislead student learning

2. Problem Significance

With the growing demand for complex machine learning models on edge devices, mobile platforms, and embedded systems, model compression has become critical. Knowledge distillation, as a core technique, directly impacts the feasibility of practical deployment.

3. Limitations of Existing Methods

Uniform treatment: Traditional methods (e.g., Hinton et al.'s original KD) apply unified temperature parameters to all teacher predictions without considering prediction reliability
Unidirectional knowledge flow: Knowledge transfer only flows from teacher to students, failing to fully exploit synergistic potential among multiple students
Neglecting uncertainty: High-entropy predictions from teachers at decision boundaries or on ambiguous samples may contain misleading information

4. Research Motivation

Key observations:

Teacher models exhibit significant confidence variations across different samples
High-entropy (uncertain) predictions may contain contradictory information and should have reduced influence
Heterogeneous student architectures can learn complementary representations and mutually enhance each other through peer learning

Core Contributions

Uncertainty-Aware Distillation Framework: Proposes a mechanism that dynamically adjusts teacher guidance weights based on prediction entropy, enabling students to prioritize learning from high-confidence predictions while maintaining robustness through hard label supervision
Dual-Student Peer Learning Architecture: Introduces a collaborative learning mechanism between two heterogeneous models (ResNet-18 and MobileNetV2), enabling mutual knowledge exchange and complementary feature learning
Significant Improvements on ImageNet-100: Validates method effectiveness across student architectures of different capacities and design principles, achieving 2.04% improvement for ResNet-18 and 0.92% for MobileNetV2
In-Depth Analysis of Teacher Confidence Patterns: Provides mechanistic insights into how uncertainty-aware distillation improves performance through detailed ablation studies validating individual component contributions

Method Details

Task Definition

Given training dataset $D = \{(x_i, y_i)\}_{i=1}^N$ , where $x_i \in \mathbb{R}^{H \times W \times 3}$ is an input image and $y_i \in \{1, ..., C\}$ is the ground truth label. The objective is to:

Utilize a pre-trained frozen teacher network $T(\theta_T)$
Simultaneously train two heterogeneous student networks $S_1(\theta_{S1})$ and $S_2(\theta_{S2})$
Achieve high classification accuracy while maintaining significantly lower computational cost

Model Architecture

1. Overall Framework Design

The framework comprises three core components:

Teacher Network: Pre-trained ResNet-50 (25.6M parameters), frozen parameters serve as knowledge source
Student 1: ResNet-18 (11.7M parameters), compression ratio 2.19×
Student 2: MobileNetV2 (3.5M parameters), compression ratio 7.31×

2. Uncertainty Estimation Module

For input $x$ , the teacher produces logits $z_T = T(x)$ , and prediction entropy is computed as the uncertainty measure:

$H(x) = -\sum_{c=1}^{C} p_c \log p_c$

where $p_c = \frac{\exp(z_c^T)}{\sum_{j=1}^C \exp(z_j^T)}$ is the softmax probability for class $c$ .

Normalized entropy yields confidence weight:

$w(x) = 1 - \frac{H(x)}{\log C}$

where $\log C$ is the maximum possible entropy for $C$ classes. High-confidence predictions (low entropy) produce $w(x) \approx 1$ , while uncertain predictions (high entropy) produce $w(x) \approx 0$ .

3. Loss Function Design

The total loss for student $S_i$ ( $i \in \{1, 2\}$ ) is a weighted combination of three complementary learning objectives:

$\mathcal{L}_{S_i} = \alpha \mathcal{L}_{\text{hard}} + \beta \mathcal{L}_{\text{teacher}} + \gamma \mathcal{L}_{\text{peer}}$

Hard Label Loss (maintaining ground truth supervision): $\mathcal{L}_{\text{hard}} = \text{CE}(S_i(x), y)$

Uncertainty-Weighted Teacher Loss (selective knowledge transfer): $\mathcal{L}_{\text{teacher}} = w(x) \cdot \tau^2 \cdot \text{KL}(q_{S_i}^\tau \| p_T^\tau)$

where $q_{S_i}^\tau$ and $p_T^\tau$ are temperature-scaled softmax distributions with temperature $\tau$ , and $\tau^2$ corrects for magnitude changes introduced by temperature scaling.

Peer Learning Loss (knowledge exchange between students): $\mathcal{L}_{\text{peer}} = \tau^2 \cdot \text{KL}(q_{S_i}^\tau \| q_{S_j}^\tau)$

where $j \neq i$ represents the peer student. Gradient flow is stopped via detach operations to prevent circular dependencies.

4. Training Strategy

Synchronized training procedure:

Teacher Forward Pass: Compute logits $z_T$ and uncertainty weights $w(x)$
Student Forward Pass: Obtain $z_{S1}$ and $z_{S2}$
Loss Computation: Calculate $\mathcal{L}_{S1}$ and $\mathcal{L}_{S2}$ respectively
Independent Optimization: Update $\theta_{S1}$ and $\theta_{S2}$ using independent optimizers

Technical Innovations

1. Differences from Baseline

Traditional KD: Uniform weights $\mathcal{L} = \alpha \mathcal{L}_{\text{hard}} + \beta \mathcal{L}_{\text{teacher}}$
Proposed Method: Introduces $w(x)$ for sample-level modulation and adds peer learning term

2. Design Rationale

Entropy as uncertainty: Computationally efficient (single forward pass), intuitively reflects prediction confidence
Heterogeneous student selection: ResNet-18 (deep residual) and MobileNetV2 (depthwise separable convolution) possess different inductive biases
Independent optimization: Allows students of different capacities to converge at their respective optimal rates

3. Problem-Solving Mechanisms

Filtering negative transfer: Reduces weight of uncertain predictions, minimizing misleading information
Complementary learning: ResNet-18 captures fine-grained spatial features while MobileNetV2 learns compact discriminative representations
Robustness guarantee: Hard label loss provides reliable anchor points, preventing over-reliance on teacher

Experimental Setup

Dataset

ImageNet-100:

Scale: 100 classes, approximately 130,000 training images, 5,000 validation images
Categories: Encompasses diverse visual categories including animals, vehicles, objects, and natural scenes
Selection Rationale: Maintains sufficient complexity while enabling faster experimental iterations compared to full ImageNet (1,000 classes, 1.2M images)

Data Preprocessing:

Training Augmentation:
- Random crop to 224×224 pixels
- Horizontal flip with 50% probability
- Color jittering (brightness, contrast, saturation ±0.4)
Validation Preprocessing:
- Resize to 256×256, center crop to 224×224
- Normalize using ImageNet statistics (mean=0.485, 0.456, 0.406, std=0.229, 0.224, 0.225)

Evaluation Metrics

Top-1 Accuracy: Proportion of correct predictions with highest confidence
Top-5 Accuracy: Proportion of ground truth labels within top-5 predictions
Training Efficiency: Total training time (hours)
Model Size: Parameter count and compression ratio

Comparison Methods

Baseline KD (ResNet-18): Conventional knowledge distillation, $\alpha=0.3, \beta=0.7$
Baseline KD (MobileNetV2): Same configuration applied to more compact architecture
Hard Labels Only: Training using only ground truth labels ( $\alpha=1$ )

Implementation Details

Batch Size: 64
Training Epochs: 50
Optimizer: SGD with momentum 0.9
Learning Rate: Initial 0.1, cosine annealing to 0
Weight Decay: 1×10⁻⁴
Temperature Parameter: $\tau=4.0$
Loss Weights (dual-student): $\alpha=0.4, \beta=0.4, \gamma=0.2$
Hardware: Not explicitly specified, but training time approximately 7.5-12.4 hours

Experimental Results

Main Results

Table I: ImageNet-100 Performance Comparison

Method	Architecture	Top-1	Top-5
Baseline KD	ResNet-18	81.86%	94.54%
Baseline KD	MobileNetV2	80.54%	94.54%
Proposed Method	ResNet-18	83.84%	96.36%
Proposed Method	MobileNetV2	81.46%	95.54%
Improvement	ResNet-18	+2.04%	+1.82%
Improvement	MobileNetV2	+0.92%	+1.00%

Key Findings:

Consistent Improvements: Both student architectures show significant gains, validating method generality
Capacity Sensitivity: ResNet-18 (larger capacity) achieves greater absolute improvement (2.04% vs 0.92%)
Top-5 Improvements: Indicates method not only improves highest-confidence predictions but also optimizes category ranking

Ablation Study

Table III: Loss Component Ablation Study

Configuration	ResNet-18	MobileNetV2
Hard labels only ( $\alpha=1$ )	78.2%	76.1%
+ Teacher distillation ( $\beta=0.7$ )	81.9%	80.5%
+ Uncertainty weighting	82.8%	81.0%
+ Peer learning ( $\gamma=0.2$ )	83.8%	81.5%

Incremental Contribution Analysis:

Conventional KD: Improves 3.7% over hard labels (ResNet-18) and 4.4% (MobileNetV2), validating soft label value
Uncertainty Weighting: Additional improvement of 0.9-1.0%, demonstrating selective knowledge transfer effectiveness
Peer Learning: Further improvement of 0.5-1.0%, showcasing complementary advantages of heterogeneous collaboration

Cumulative Effect: Three components synergistically achieve total improvement of 5.6% (ResNet-18) and 5.4% (MobileNetV2)

Training Dynamics Analysis

Table II: Training Efficiency

Method	Training Time	Epochs
Baseline (ResNet-18)	7.58 hours	50
Baseline (MobileNetV2)	7.50 hours	50
Dual-Student (both)	12.36 hours	50

Efficiency Analysis:

Training time increases 1.63× (not 2×) due to shared teacher inference and data loading
Single training yields two complementary models, providing deployment flexibility
Training cost is one-time investment with no additional inference overhead

Convergence Characteristics (final epoch):

ResNet-18: Training loss 0.3030, training accuracy 84.88%, validation accuracy 83.84% (generalization gap 1.04%)
MobileNetV2: Training loss 0.3789, training accuracy 79.35%, validation accuracy 81.46% (generalization gap -2.11%, validation exceeds training)

Small generalization gaps indicate effective overfitting prevention.

Uncertainty Pattern Analysis

Teacher Confidence Statistics:

Average confidence weight: 0.816 (indicating overall teacher confidence)
Average entropy: 4.533 (maximum entropy 4.605 for 100 classes)
Normalized uncertainty: 0.184

Interpretation:

Teacher is well-trained on ImageNet-100, with most predictions high-confidence
Meaningful subset of uncertain samples exists (approximately 18.4%)
Variability in confidence distribution validates necessity of uncertainty weighting

Model Compression Effects

Table IV: Model Size Comparison

Model	Parameters	Compression Ratio
Teacher (ResNet-50)	25.6M	1.00×
Student 1 (ResNet-18)	11.7M	2.19×
Student 2 (MobileNetV2)	3.5M	7.31×

Deployment Trade-offs:

MobileNetV2: 7.31× compression, 81.46% accuracy, suitable for mobile devices
ResNet-18: 2.19× compression, 83.84% accuracy, balances accuracy and efficiency
Dual-model approach provides flexible selection based on resource constraints

1. Knowledge Distillation

Original KD Hinton et al., 2015: Temperature-scaled soft labels
Attention Transfer Zagoruyko & Komodakis, 2017: Matching attention maps
Feature Distillation Romero et al., 2015: Intermediate representation alignment
Relational Distillation Park et al., 2019: Preserving sample relationships

Paper Positioning: Builds on output-layer distillation by introducing uncertainty modulation

2. Uncertainty Estimation

Bayesian Neural Networks Gal & Ghahramani, 2016: Parameter distributions
Deep Ensembles Lakshminarayanan et al., 2017: Multi-model disagreement
Prediction Entropy Shannon, 1948: Probability distribution spread

Method Selection: Adopts entropy-based uncertainty for computational efficiency (single forward pass)

3. Multi-Student Distillation

Deep Mutual Learning Zhang et al., 2018: Peer learning without teacher

Paper Innovation: Combines teacher-student and peer learning with uncertainty weighting

Conclusions and Discussion

Main Conclusions

Uncertainty-Awareness Effectiveness: Selective knowledge transfer based on teacher confidence significantly improves student performance
Peer Learning Benefits: Heterogeneous student collaboration produces complementary advantages benefiting both
Generality Validation: Method proves effective across different capacity architectures (ResNet-18 and MobileNetV2)
Practical Balance: Achieves significant accuracy improvements and deployment flexibility with acceptable training cost increase

Limitations

Increased Training Cost: Dual-student framework requires 1.63× training time, potentially limiting resource-constrained scenarios
Hyperparameter Sensitivity: Loss weights $\alpha, \beta, \gamma$ require careful tuning with optimal configuration dependent on dataset and architecture
Single Uncertainty Measure: Uses only entropy without distinguishing epistemic and aleatoric uncertainty
Limited Evaluation Scope: Validation only on ImageNet-100 image classification; other tasks (detection, segmentation) and domains (NLP) unexplored
Synchronous Training Assumption: Requires both students trained simultaneously from scratch, unsuitable for scenarios with partially trained models

Future Directions

Extended Student Numbers: Richer collaborative learning with three or more heterogeneous students
Advanced Uncertainty Estimation: Monte Carlo Dropout or evidential deep learning
Cross-Domain Applications: NLP, speech recognition, multimodal learning
Dynamic Weight Scheduling: Adaptive adjustment of $\alpha, \beta, \gamma$ during training
Integration with Other Compression: Pruning, quantization, neural architecture search
Uncertainty Pattern Transferability: Study uncertainty consistency across datasets/tasks

Reference Literature (Key References)

Hinton et al., 2015 - Foundational knowledge distillation work
Gal & Ghahramani, 2016 - Dropout as Bayesian approximation
Zhang et al., 2018 - Deep mutual learning (peer learning pioneer)
Zagoruyko & Komodakis, 2017 - Attention transfer
Park et al., 2019 - Relational knowledge distillation

Summary Evaluation

Dimension	Score (1-5)	Explanation
Novelty	3.5/5	Uncertainty weighting is incremental innovation; peer learning combination shows originality
Technical Depth	3/5	Simple method lacking theoretical analysis; shallow uncertainty measure
Experimental Completeness	3.5/5	Thorough ablation studies but lacks multi-dataset and SOTA comparisons
Practical Value	4/5	Easy implementation, stable results, high deployment flexibility
Writing Quality	4/5	Clear structure, smooth expression, intuitive figures
Overall Assessment	3.6/5	Solid application-oriented work with practical utility but limited novelty

Recommended Audience: Researchers and engineers working on model compression and knowledge distillation, particularly practitioners focused on mobile deployment.