Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, traditional knowledge distillation methods treat all teacher predictions equally, regardless of the teacher's confidence in those predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages teacher prediction uncertainty to selectively guide student learning. We introduce a peer-learning mechanism where two heterogeneous student architectures, specifically ResNet-18 and MobileNetV2, learn collaboratively from both the teacher network and each other. Experimental results on ImageNet-100 demonstrate that our approach achieves superior performance compared to baseline knowledge distillation methods, with ResNet-18 achieving 83.84\% top-1 accuracy and MobileNetV2 achieving 81.46\% top-1 accuracy, representing improvements of 2.04\% and 0.92\% respectively over traditional single-student distillation approaches.
- Paper ID: 2511.18826
- Title: Uncertainty-Aware Dual-Student Knowledge Distillation for Efficient Image Classification
- Authors: Aakash Gore, Anoushka Dey, Aryan Mishra (Indian Institute of Technology Bombay)
- Classification: cs.CV, cs.LG
- Publication Date: November 24, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2511.18826
Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, conventional knowledge distillation methods treat all teacher predictions uniformly, overlooking the varying confidence levels of the teacher across different predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages the uncertainty in teacher predictions to selectively guide student learning. A peer learning mechanism is introduced, enabling two heterogeneous student architectures (ResNet-18 and MobileNetV2) to learn collaboratively from the teacher network and each other. Experimental results on ImageNet-100 demonstrate that the proposed method outperforms baseline knowledge distillation approaches, achieving 83.84% top-1 accuracy for ResNet-18 and 81.46% top-1 accuracy for MobileNetV2, representing improvements of 2.04% and 0.92% respectively over conventional single-student distillation methods.
Deep neural networks have achieved remarkable success in computer vision tasks, yet their deployment on resource-constrained devices remains challenging. This paper addresses:
- Blindness of conventional knowledge distillation: Existing methods assign equal weight to all teacher predictions, ignoring confidence variations across different samples
- Limitations of single-student models: A single student model cannot fully exploit complementary advantages of multiple architectures
- Negative knowledge transfer: Uncertain predictions from the teacher may mislead student learning
With the growing demand for complex machine learning models on edge devices, mobile platforms, and embedded systems, model compression has become critical. Knowledge distillation, as a core technique, directly impacts the feasibility of practical deployment.
- Uniform treatment: Traditional methods (e.g., Hinton et al.'s original KD) apply unified temperature parameters to all teacher predictions without considering prediction reliability
- Unidirectional knowledge flow: Knowledge transfer only flows from teacher to students, failing to fully exploit synergistic potential among multiple students
- Neglecting uncertainty: High-entropy predictions from teachers at decision boundaries or on ambiguous samples may contain misleading information
Key observations:
- Teacher models exhibit significant confidence variations across different samples
- High-entropy (uncertain) predictions may contain contradictory information and should have reduced influence
- Heterogeneous student architectures can learn complementary representations and mutually enhance each other through peer learning
- Uncertainty-Aware Distillation Framework: Proposes a mechanism that dynamically adjusts teacher guidance weights based on prediction entropy, enabling students to prioritize learning from high-confidence predictions while maintaining robustness through hard label supervision
- Dual-Student Peer Learning Architecture: Introduces a collaborative learning mechanism between two heterogeneous models (ResNet-18 and MobileNetV2), enabling mutual knowledge exchange and complementary feature learning
- Significant Improvements on ImageNet-100: Validates method effectiveness across student architectures of different capacities and design principles, achieving 2.04% improvement for ResNet-18 and 0.92% for MobileNetV2
- In-Depth Analysis of Teacher Confidence Patterns: Provides mechanistic insights into how uncertainty-aware distillation improves performance through detailed ablation studies validating individual component contributions
Given training dataset D={(xi,yi)}i=1N, where xi∈RH×W×3 is an input image and yi∈{1,...,C} is the ground truth label. The objective is to:
- Utilize a pre-trained frozen teacher network T(θT)
- Simultaneously train two heterogeneous student networks S1(θS1) and S2(θS2)
- Achieve high classification accuracy while maintaining significantly lower computational cost
The framework comprises three core components:
- Teacher Network: Pre-trained ResNet-50 (25.6M parameters), frozen parameters serve as knowledge source
- Student 1: ResNet-18 (11.7M parameters), compression ratio 2.19×
- Student 2: MobileNetV2 (3.5M parameters), compression ratio 7.31×
For input x, the teacher produces logits zT=T(x), and prediction entropy is computed as the uncertainty measure:
H(x)=−∑c=1Cpclogpc
where pc=∑j=1Cexp(zjT)exp(zcT) is the softmax probability for class c.
Normalized entropy yields confidence weight:
w(x)=1−logCH(x)
where logC is the maximum possible entropy for C classes. High-confidence predictions (low entropy) produce w(x)≈1, while uncertain predictions (high entropy) produce w(x)≈0.
The total loss for student Si (i∈{1,2}) is a weighted combination of three complementary learning objectives:
LSi=αLhard+βLteacher+γLpeer
Hard Label Loss (maintaining ground truth supervision):
Lhard=CE(Si(x),y)
Uncertainty-Weighted Teacher Loss (selective knowledge transfer):
Lteacher=w(x)⋅τ2⋅KL(qSiτ∥pTτ)
where qSiτ and pTτ are temperature-scaled softmax distributions with temperature τ, and τ2 corrects for magnitude changes introduced by temperature scaling.
Peer Learning Loss (knowledge exchange between students):
Lpeer=τ2⋅KL(qSiτ∥qSjτ)
where j=i represents the peer student. Gradient flow is stopped via detach operations to prevent circular dependencies.
Synchronized training procedure:
- Teacher Forward Pass: Compute logits zT and uncertainty weights w(x)
- Student Forward Pass: Obtain zS1 and zS2
- Loss Computation: Calculate LS1 and LS2 respectively
- Independent Optimization: Update θS1 and θS2 using independent optimizers
- Traditional KD: Uniform weights L=αLhard+βLteacher
- Proposed Method: Introduces w(x) for sample-level modulation and adds peer learning term
- Entropy as uncertainty: Computationally efficient (single forward pass), intuitively reflects prediction confidence
- Heterogeneous student selection: ResNet-18 (deep residual) and MobileNetV2 (depthwise separable convolution) possess different inductive biases
- Independent optimization: Allows students of different capacities to converge at their respective optimal rates
- Filtering negative transfer: Reduces weight of uncertain predictions, minimizing misleading information
- Complementary learning: ResNet-18 captures fine-grained spatial features while MobileNetV2 learns compact discriminative representations
- Robustness guarantee: Hard label loss provides reliable anchor points, preventing over-reliance on teacher
ImageNet-100:
- Scale: 100 classes, approximately 130,000 training images, 5,000 validation images
- Categories: Encompasses diverse visual categories including animals, vehicles, objects, and natural scenes
- Selection Rationale: Maintains sufficient complexity while enabling faster experimental iterations compared to full ImageNet (1,000 classes, 1.2M images)
Data Preprocessing:
- Training Augmentation:
- Random crop to 224×224 pixels
- Horizontal flip with 50% probability
- Color jittering (brightness, contrast, saturation ±0.4)
- Validation Preprocessing:
- Resize to 256×256, center crop to 224×224
- Normalize using ImageNet statistics (mean=0.485, 0.456, 0.406, std=0.229, 0.224, 0.225)
- Top-1 Accuracy: Proportion of correct predictions with highest confidence
- Top-5 Accuracy: Proportion of ground truth labels within top-5 predictions
- Training Efficiency: Total training time (hours)
- Model Size: Parameter count and compression ratio
- Baseline KD (ResNet-18): Conventional knowledge distillation, α=0.3,β=0.7
- Baseline KD (MobileNetV2): Same configuration applied to more compact architecture
- Hard Labels Only: Training using only ground truth labels (α=1)
- Batch Size: 64
- Training Epochs: 50
- Optimizer: SGD with momentum 0.9
- Learning Rate: Initial 0.1, cosine annealing to 0
- Weight Decay: 1×10⁻⁴
- Temperature Parameter: τ=4.0
- Loss Weights (dual-student): α=0.4,β=0.4,γ=0.2
- Hardware: Not explicitly specified, but training time approximately 7.5-12.4 hours
Table I: ImageNet-100 Performance Comparison
| Method | Architecture | Top-1 | Top-5 |
|---|
| Baseline KD | ResNet-18 | 81.86% | 94.54% |
| Baseline KD | MobileNetV2 | 80.54% | 94.54% |
| Proposed Method | ResNet-18 | 83.84% | 96.36% |
| Proposed Method | MobileNetV2 | 81.46% | 95.54% |
| Improvement | ResNet-18 | +2.04% | +1.82% |
| Improvement | MobileNetV2 | +0.92% | +1.00% |
Key Findings:
- Consistent Improvements: Both student architectures show significant gains, validating method generality
- Capacity Sensitivity: ResNet-18 (larger capacity) achieves greater absolute improvement (2.04% vs 0.92%)
- Top-5 Improvements: Indicates method not only improves highest-confidence predictions but also optimizes category ranking
Table III: Loss Component Ablation Study
| Configuration | ResNet-18 | MobileNetV2 |
|---|
| Hard labels only (α=1) | 78.2% | 76.1% |
| + Teacher distillation (β=0.7) | 81.9% | 80.5% |
| + Uncertainty weighting | 82.8% | 81.0% |
| + Peer learning (γ=0.2) | 83.8% | 81.5% |
Incremental Contribution Analysis:
- Conventional KD: Improves 3.7% over hard labels (ResNet-18) and 4.4% (MobileNetV2), validating soft label value
- Uncertainty Weighting: Additional improvement of 0.9-1.0%, demonstrating selective knowledge transfer effectiveness
- Peer Learning: Further improvement of 0.5-1.0%, showcasing complementary advantages of heterogeneous collaboration
Cumulative Effect: Three components synergistically achieve total improvement of 5.6% (ResNet-18) and 5.4% (MobileNetV2)
Table II: Training Efficiency
| Method | Training Time | Epochs |
|---|
| Baseline (ResNet-18) | 7.58 hours | 50 |
| Baseline (MobileNetV2) | 7.50 hours | 50 |
| Dual-Student (both) | 12.36 hours | 50 |
Efficiency Analysis:
- Training time increases 1.63× (not 2×) due to shared teacher inference and data loading
- Single training yields two complementary models, providing deployment flexibility
- Training cost is one-time investment with no additional inference overhead
Convergence Characteristics (final epoch):
- ResNet-18: Training loss 0.3030, training accuracy 84.88%, validation accuracy 83.84% (generalization gap 1.04%)
- MobileNetV2: Training loss 0.3789, training accuracy 79.35%, validation accuracy 81.46% (generalization gap -2.11%, validation exceeds training)
Small generalization gaps indicate effective overfitting prevention.
Teacher Confidence Statistics:
- Average confidence weight: 0.816 (indicating overall teacher confidence)
- Average entropy: 4.533 (maximum entropy 4.605 for 100 classes)
- Normalized uncertainty: 0.184
Interpretation:
- Teacher is well-trained on ImageNet-100, with most predictions high-confidence
- Meaningful subset of uncertain samples exists (approximately 18.4%)
- Variability in confidence distribution validates necessity of uncertainty weighting
Table IV: Model Size Comparison
| Model | Parameters | Compression Ratio |
|---|
| Teacher (ResNet-50) | 25.6M | 1.00× |
| Student 1 (ResNet-18) | 11.7M | 2.19× |
| Student 2 (MobileNetV2) | 3.5M | 7.31× |
Deployment Trade-offs:
- MobileNetV2: 7.31× compression, 81.46% accuracy, suitable for mobile devices
- ResNet-18: 2.19× compression, 83.84% accuracy, balances accuracy and efficiency
- Dual-model approach provides flexible selection based on resource constraints
- Original KD Hinton et al., 2015: Temperature-scaled soft labels
- Attention Transfer Zagoruyko & Komodakis, 2017: Matching attention maps
- Feature Distillation Romero et al., 2015: Intermediate representation alignment
- Relational Distillation Park et al., 2019: Preserving sample relationships
Paper Positioning: Builds on output-layer distillation by introducing uncertainty modulation
- Bayesian Neural Networks Gal & Ghahramani, 2016: Parameter distributions
- Deep Ensembles Lakshminarayanan et al., 2017: Multi-model disagreement
- Prediction Entropy Shannon, 1948: Probability distribution spread
Method Selection: Adopts entropy-based uncertainty for computational efficiency (single forward pass)
- Deep Mutual Learning Zhang et al., 2018: Peer learning without teacher
Paper Innovation: Combines teacher-student and peer learning with uncertainty weighting
- Uncertainty-Awareness Effectiveness: Selective knowledge transfer based on teacher confidence significantly improves student performance
- Peer Learning Benefits: Heterogeneous student collaboration produces complementary advantages benefiting both
- Generality Validation: Method proves effective across different capacity architectures (ResNet-18 and MobileNetV2)
- Practical Balance: Achieves significant accuracy improvements and deployment flexibility with acceptable training cost increase
- Increased Training Cost: Dual-student framework requires 1.63× training time, potentially limiting resource-constrained scenarios
- Hyperparameter Sensitivity: Loss weights α,β,γ require careful tuning with optimal configuration dependent on dataset and architecture
- Single Uncertainty Measure: Uses only entropy without distinguishing epistemic and aleatoric uncertainty
- Limited Evaluation Scope: Validation only on ImageNet-100 image classification; other tasks (detection, segmentation) and domains (NLP) unexplored
- Synchronous Training Assumption: Requires both students trained simultaneously from scratch, unsuitable for scenarios with partially trained models
- Extended Student Numbers: Richer collaborative learning with three or more heterogeneous students
- Advanced Uncertainty Estimation: Monte Carlo Dropout or evidential deep learning
- Cross-Domain Applications: NLP, speech recognition, multimodal learning
- Dynamic Weight Scheduling: Adaptive adjustment of α,β,γ during training
- Integration with Other Compression: Pruning, quantization, neural architecture search
- Uncertainty Pattern Transferability: Study uncertainty consistency across datasets/tasks
- Hinton et al., 2015 - Foundational knowledge distillation work
- Gal & Ghahramani, 2016 - Dropout as Bayesian approximation
- Zhang et al., 2018 - Deep mutual learning (peer learning pioneer)
- Zagoruyko & Komodakis, 2017 - Attention transfer
- Park et al., 2019 - Relational knowledge distillation
| Dimension | Score (1-5) | Explanation |
|---|
| Novelty | 3.5/5 | Uncertainty weighting is incremental innovation; peer learning combination shows originality |
| Technical Depth | 3/5 | Simple method lacking theoretical analysis; shallow uncertainty measure |
| Experimental Completeness | 3.5/5 | Thorough ablation studies but lacks multi-dataset and SOTA comparisons |
| Practical Value | 4/5 | Easy implementation, stable results, high deployment flexibility |
| Writing Quality | 4/5 | Clear structure, smooth expression, intuitive figures |
| Overall Assessment | 3.6/5 | Solid application-oriented work with practical utility but limited novelty |
Recommended Audience: Researchers and engineers working on model compression and knowledge distillation, particularly practitioners focused on mobile deployment.