2025-11-26T20:43:18.584587

Towards Characterizing Knowledge Distillation of PPG Heart Rate Estimation Models

Arora, Narayanswamy, Patel et al.
Heart rate estimation from photoplethysmography (PPG) signals generated by wearable devices such as smartwatches and fitness trackers has significant implications for the health and well-being of individuals. Although prior work has demonstrated deep learning models with strong performance in the heart rate estimation task, in order to deploy these models on wearable devices, these models must also adhere to strict memory and latency constraints. In this work, we explore and characterize how large pre-trained PPG models may be distilled to smaller models appropriate for real-time inference on the edge. We evaluate four distillation strategies through comprehensive sweeps of teacher and student model capacities: (1) hard distillation, (2) soft distillation, (3) decoupled knowledge distillation (DKD), and (4) feature distillation. We present a characterization of the resulting scaling laws describing the relationship between model size and performance. This early investigation lays the groundwork for practical and predictable methods for building edge-deployable models for physiological sensing.
academic

Towards Characterizing Knowledge Distillation of PPG Heart Rate Estimation Models

Basic Information

  • Paper ID: 2511.18829
  • Title: Towards Characterizing Knowledge Distillation of PPG Heart Rate Estimation Models
  • Authors: Kanav Arora, Girish Narayanswamy, Shwetak Patel, Richard Li (University of Washington)
  • Classification: cs.LG (Machine Learning)
  • Publication Venue/Conference: NeurIPS 2025 Workshop: Learning from Time Series for Health
  • Paper Link: https://arxiv.org/abs/2511.18829

Abstract

Heart rate estimation is an important health monitoring capability on wearable devices (such as smartwatches and fitness trackers) through photoplethysmography (PPG) signals. Although deep learning models demonstrate excellent performance on heart rate estimation tasks, deploying these models on wearable devices requires satisfying strict memory and latency constraints. This study explores and characterizes how to distill large pre-trained PPG models into small models suitable for edge real-time inference. The research evaluates four distillation strategies through comprehensive sweeps of teacher and student model capacities: (1) hard distillation, (2) soft distillation, (3) decoupled knowledge distillation (DKD), and (4) feature distillation. The paper presents scaling law characteristics that describe the relationship between model size and performance. This early-stage research establishes a practical and predictable methodological foundation for building physiological sensing models deployable on edge devices.

Research Background and Motivation

1. Core Problem to Address

Large deep learning models on wearable devices face challenges from limited computational resources. Although large PPG heart rate estimation models achieve excellent performance, their significant computational demands (memory footprint and inference latency) restrict practical deployment on edge devices, hindering the realization of advantages such as real-time feedback and privacy protection.

2. Problem Significance

  • Health Monitoring Needs: PPG signals can be used to assess cardiovascular health, with important applications in exercise feedback and disease screening (such as hypertension)
  • Edge Deployment Advantages: Edge models provide better privacy protection and support real-time feedback
  • Practical Bottleneck: Large sensor models are difficult to run on resource-constrained wearable devices

3. Limitations of Existing Methods

  • Insufficient Knowledge Distillation Application: While knowledge distillation has achieved success in language models (such as DistilBERT) and audio/accelerometer models, exploration in the physiological sensing domain remains limited
  • Lack of Predictability: Existing distillation methods lack systematic characterization, making it difficult to predict distilled model performance
  • Research Gap in Scaling Laws: Scaling laws for language model distillation were only recently established; no similar research exists in the physiological sensing domain

4. Research Motivation

This paper makes the first attempt to establish predictable distillation performance characterization in the physiological sensing domain, providing systematic evaluation of distillation strategies and scaling law analysis for PPG heart rate estimation tasks.

Core Contributions

  1. Systematic Distillation Strategy Evaluation: First comprehensive evaluation of four knowledge distillation strategies (hard distillation, soft distillation, DKD, feature distillation) on PPG heart rate estimation tasks, spanning multiple teacher and student model capacity configurations
  2. Scaling Law Characterization: Discovers and characterizes that distilled model performance follows predictable exponential scaling curves, revealing the relationship between model size and performance
  3. Optimal Strategy Identification: Demonstrates that decoupled knowledge distillation (DKD) outperforms all evaluated strategies, particularly suited for semantically ordered classification tasks
  4. Architecture Impact Analysis: Shows that model architecture choices (ResNet vs MLP) significantly influence distillation scaling behavior, with ResNet student models exhibiting stronger inductive biases
  5. Practical Validation: Demonstrates that distillation achieves approximately 90% reduction in inference time and 60% reduction in memory usage, with only 30% performance degradation

Methodology Details

Task Definition

Input: 8-second window of PPG signal (green channel, 25Hz sampling rate, 2-second stride)
Output: Instantaneous heart rate classification (180 classes, corresponding to 30-210 BPM)
Evaluation Metric: Mean Absolute Error (MAE, in BPM)
Constraints: Models must satisfy memory and latency limitations of wearable devices

Model Architecture

Base Architecture: 1D-ResNet

Employs the 1D-ResNet variant used by Meier et al. as the backbone network, controlling model capacity by adjusting the number of residual blocks:

  • Teacher Models: 2-12 residual blocks (33K-864K parameters)
  • Student Models: 1-10 residual blocks (23K-534K parameters)

Four Distillation Strategies

1. Hard Distillation

  • Uses the teacher model's final prediction (argmax output) as training labels for the student model
  • Helps the student model mimic the teacher's discrete decision boundaries
  • Contains minimal information, resulting in poorest performance

2. Soft Distillation

  • Student model trained on teacher model's output probability distribution
  • Encodes rich information about inter-class relationships and uncertainty
  • Based on the classical method by Hinton et al.

3. Decoupled Knowledge Distillation (DKD)

  • Decomposes teacher output into target class (TCKD) and non-target class (NCKD) distillation components
  • Flexibly weights true labels and incorrect label probabilities in the student loss function
  • Optimal Hyperparameters: α=1, β=8, temperature τ=2, cross-entropy weight CE=1
  • NCKD probability weight is 8 times that of TCKD, particularly suited for semantically ordered classification tasks

4. Feature Distillation

  • Goes beyond the output layer, training student models to match teacher intermediate feature maps
  • Aligns internal representation spaces
  • Performance falls between soft distillation and DKD

Technical Innovations

1. Distillation Characterization for Physiological Signals

  • First systematic study of distillation scaling laws in the PPG signal domain
  • Discovers that exponential scaling curves apply to physiological sensing tasks

2. DKD Advantage Mechanism

  • In scenarios where classification bin semantics are ordered, non-target class probabilities contain important information
  • Through 8:1 weight ratio, student models learn richer probability labels
  • Although small models cannot learn rich representations from scratch, they can effectively learn by regressing teacher probability labels

3. Importance of Architecture Inductive Bias

  • Inherent inductive biases of convolutional layers (such as natural tendency to smooth-filter signals)
  • Targeted architecture designs like residual connections achieve more efficient sample learning
  • ResNet students demonstrate lower error lower bounds compared to MLP students

Experimental Setup

Datasets

Uses three free-living PPG datasets, totaling 107 hours of sensor signals:

  1. WildPPG: Real-world long continuous recordings
  2. PPG-DaLiA: UCI Machine Learning Repository dataset
  3. GalaxyPPG: Galaxy Watch data collected in semi-natural settings

Preprocessing Pipeline:

  • Uses only PPG sensor green channel
  • Resamples to 25Hz
  • Segments into 8-second windows with 2-second stride
  • Provides heart rate ground truth (BPM) via ECG signal

Data Split:

  • Participant-independent train-test split (80%-20%)
  • 2-fold cross-validation

Evaluation Metrics

Mean Absolute Error (MAE): Heart rate prediction error in BPM

Comparison Methods

  • From-Scratch Training Baseline: Same-sized models trained from scratch (without distillation)
  • Different Distillation Strategies: Hard distillation, soft distillation, DKD, feature distillation
  • Different Architectures: ResNet vs MLP student models

Implementation Details

  • Training Epochs: 300
  • Learning Rate: 5×10⁻⁴
  • Loss Function: Cross-entropy loss
  • Classification Setup: 180 classes (30-210 BPM)
  • Hardware: Nvidia RTX 2080-Ti GPU (for benchmarking)

Experimental Results

Main Results

1. Distilled Models Outperform From-Scratch Training

As shown in Figure 1 (soft distillation results):

  • Baseline Performance: From-scratch trained models align with results reported by Meier et al. (8-block model MAE comparable)
  • Distillation Advantage: All distillation configurations outperform same-sized from-scratch trained models
  • Teacher Size Impact: Larger teacher models typically yield better student performance, though excessively large teachers may overfit and degrade performance

2. DKD Strategy Performs Optimally

Table 2 shows performance comparison with fixed 12-block teacher model:

Student Model SizeHard DistillationSoft DistillationDKDFeature Distillation
1 block (23K)11.73410.3808.8999.397
2 blocks (34K)10.4187.7036.7727.200
6 blocks (139K)6.9836.8016.2916.800
10 blocks (534K)6.4936.3275.7596.409

Performance Ranking: DKD > Feature Distillation > Soft Distillation > Hard Distillation

Key Findings:

  • DKD performs best across all model configurations
  • Hard distillation performs worst due to insufficient information in discrete labels
  • DKD's advantage stems from flexible weighting of true and incorrect label probabilities

3. Predictable Scaling Laws

Figure 2 shows scaling behavior under DKD strategy:

  • Exponential Curve Fitting: Consistent with language model distillation scaling laws, performance follows predictable exponential curves
  • Performance Saturation Point: Student models begin saturating at 6 residual blocks (139K parameters)
  • Strategy Differences: Soft distillation and feature distillation also follow this curve, but hard distillation shows more abrupt saturation at smaller models

4. Architecture Impact on Scaling

Figure 3 compares ResNet and MLP student architectures:

  • ResNet Advantage: ResNet students significantly outperform MLP students across all parameter scales
  • Error Lower Bound: ResNet exhibits lower performance lower bounds
  • Scaling Efficiency: ResNet demonstrates superior scaling efficiency
  • Universality: MLP also exhibits predictable scaling, though specific behavior varies by architecture

Ablation Studies

Teacher Model Size Impact

  • Larger teachers (222K → 534K → 864K parameters) typically yield better student performance
  • However, diminishing returns exist, with excessively large teachers potentially overfitting

DKD Hyperparameter Analysis

Through hyperparameter search:

  • α=1, β=8: NCKD weight is 8 times TCKD weight
  • Temperature τ=2: Controls probability distribution smoothness
  • CE Weight=1: Balances distillation loss and original task loss

Computational Efficiency Analysis

Table 3 shows system benchmarking results:

Model SizeInference Time (s)Memory Usage (MB)
1 block0.512±0.0259.468
6 blocks2.622±0.16711.275
12 blocks4.758±0.13023.483

Distillation Benefits (12 blocks → 1 block):

  • Inference time reduction: ~90% (4.758s → 0.512s)
  • Memory usage reduction: ~60% (23.483MB → 9.468MB)
  • Performance loss: ~30% MAE increase (refer to specific values)

Experimental Findings

  1. Universal Distillation Effectiveness: Distillation consistently outperforms from-scratch training across all configurations
  2. Strategy Selection Importance: DKD can provide ~30% performance improvement over hard distillation
  3. Existence of Scaling Laws: Physiological sensing tasks also follow predictable exponential scaling curves
  4. Critical Role of Architecture Design: Inductive biases significantly impact distillation effectiveness
  5. Practical Trade-offs: Distillation enables substantial computational efficiency gains with moderate performance loss

Knowledge Distillation Foundations

  • Hinton et al. (2015): Proposed classical soft distillation method, softening probability distributions through temperature parameter
  • Zhao et al. (2022): Proposed Decoupled Knowledge Distillation (DKD), separating target and non-target class information
  • Romero et al. (2015): Proposed FitNets feature distillation method

Domain Applications

  • Language Models: DistilBERT successfully optimizes BERT for edge deployment
  • Audio Processing: Peplinski et al. (2020) distill audio models for mobile devices
  • Activity Recognition: Tang et al. (2021) distill accelerometer models for human activity recognition

Scaling Law Research

  • Busbridge et al. (2025): First to establish scaling laws for language model distillation
  • This Paper's Contribution: Extends scaling law research to physiological sensing domain

PPG Heart Rate Estimation

  • Meier et al. (2024): Provides WildPPG dataset and ResNet baseline
  • Narayanswamy et al. (2024): Proposes scaling research for wearable foundation models
  • Pillai et al. (2024), Saha et al. (2025): Develop PPG foundation models

Research Gaps

This paper fills the gap of lacking systematic distillation characterization and predictable scaling laws in the physiological sensing domain.

Conclusions and Discussion

Main Conclusions

  1. Distillation Effectiveness: Knowledge distillation can successfully compress large PPG heart rate estimation models into small models suitable for edge deployment
  2. Strategy Superiority: DKD outperforms all evaluated strategies, particularly suited for semantically ordered classification tasks
  3. Scaling Predictability: Distilled model performance follows exponential scaling curves, consistent with language model findings
  4. Practical Trade-offs: Achieves 90% inference time and 60% memory reduction with moderate performance loss
  5. Architecture Importance: Model architecture choice significantly influences distillation scaling behavior

Limitations

1. Dataset Generalization

  • Current Approach: Uses simple cross-validation, mixing samples from three datasets
  • Limitation: Insufficient evaluation of cross-dataset generalization (training on one dataset, testing on another)
  • Reference Direction: Cross-dataset research methods by Kasnesis et al. (2025)

2. Model Architecture Limitations

  • Current Choice: Uses simple ResNet backbone and supervised learning
  • Improvement Space:
    • Explore larger self-supervised pre-trained models
    • Leverage richer features learned through contrastive learning methods
    • Authors mention forthcoming open-source models available for future research

3. Distillation Strategy Exploration

  • Current Work: Evaluates four baseline strategies from literature
  • Future Direction: Develop new distillation methods specifically optimized for physiological sensing tasks

4. Hardware Evaluation Limitations

  • Benchmark Platform: Uses Nvidia RTX 2080-Ti GPU for testing
  • Real Scenarios: Wearable devices use microprocessors with different performance characteristics
  • Requirement: Evaluation on actual target hardware

Future Directions

  1. Cross-Dataset Generalization Research: Systematically evaluate distilled model transfer capabilities across different datasets
  2. Self-Supervised Teacher Models: Leverage contrastive learning and other methods to train stronger teacher models
  3. Customized Distillation Strategies: Develop distillation methods specifically tailored to PPG signal characteristics
  4. Real Hardware Deployment: Validate and optimize models on actual wearable devices
  5. Multi-Task Extension: Extend research to other physiological metric estimation tasks like heart rate variability

In-Depth Evaluation

Strengths

1. High Research Value

  • Fills Research Gap: First systematic study of distillation scaling laws in physiological sensing domain
  • Practical Orientation: Directly addresses real deployment needs of wearable devices
  • Theoretical Contribution: Extends scaling law research from language models to time series health data

2. Rigorous Experimental Design

  • Comprehensive Comparison: Evaluates four distillation strategies across multiple model capacity configurations
  • Multi-Dataset Validation: Uses three independent PPG datasets (107 hours of data)
  • Cross-Validation: Employs 2-fold cross-validation to enhance result reliability
  • Participant-Independent Split: Avoids data leakage, ensures generalization assessment

3. Insightful Findings

  • DKD Advantage Mechanism: Deeply explains why 8:1 weight ratio suits ordered classification
  • Architecture Inductive Bias: Reveals fundamental differences between ResNet and MLP
  • Scaling Law Verification: Confirms applicability of exponential curves in new domain
  • Saturation Point Identification: Identifies 139K parameters as critical performance-efficiency balance point

4. Clear Writing

  • Logical Structure: Clear progression from motivation through methods to results
  • Effective Visualization: Heatmaps in Figure 1, scaling curves in Figures 2 and 3 are intuitive
  • Honest Presentation: Clearly labels work as "preliminary investigation"

Weaknesses

1. Limited Experimental Scale

  • Teacher Model Capacity: Maximum only 864K parameters, unexplored larger-scale models
  • Data Volume: 107 hours relatively modest for modern large-scale research
  • Architecture Diversity: Only compares ResNet and MLP, excludes modern architectures like Transformers

2. Insufficient Theoretical Analysis

  • Scaling Law Form: No specific mathematical formula provided
  • Fitting Parameters: Specific parameters and goodness-of-fit for exponential curves not reported
  • Theoretical Explanation: Lacks theoretical derivation for why exponential curves apply

3. Incomplete Practical Validation

  • Hardware Platform: Only GPU testing, lacks real wearable device evaluation
  • Power Analysis: Neglects energy consumption, critical for edge devices
  • Real-Time Verification: Lacks validation in actual application scenarios

4. Insufficient Generalization Analysis

  • Cross-Dataset Evaluation: Authors acknowledge this as main limitation
  • Different Physiological Tasks: Focuses only on heart rate, unexplored other physiological metrics
  • Population Diversity: Lacks analysis of performance differences across populations (age, health status)

5. DKD Hyperparameter Sensitivity

  • Hyperparameter Selection: β=8 choice lacks sufficient ablation
  • Task Dependency: Unexplored robustness across different task settings
  • Automatic Tuning: No systematic method provided for hyperparameter selection

Impact

1. Academic Contribution

  • Pioneering: First to establish distillation scaling laws in physiological sensing domain
  • Methodological Value: Provides systematic evaluation framework for future research
  • Cross-Domain Inspiration: Generalizable to other time series health data tasks

2. Practical Value

  • Industry Application: Directly supports smartwatch and fitness tracker development
  • Performance-Efficiency Trade-off: 90% inference time reduction provides viable deployment path
  • Predictability: Scaling laws enable more scientific model design

3. Limitations

  • Early-Stage Research: Authors explicitly position as "early investigation" requiring further validation
  • Reproducibility Challenge: Uses public datasets but no code release commitment
  • Deployment Gap: Distance remains from GPU benchmarks to wearable devices

Applicable Scenarios

Most Suitable Scenarios

  1. Resource-Constrained Wearables: Smartwatches, fitness trackers, etc.
  2. Real-Time Heart Rate Monitoring: Exercise feedback, health tracking applications
  3. Privacy-Sensitive Scenarios: Edge inference avoids cloud data transmission
  4. Early Model Design: Use scaling laws to predict and plan model capacity

Scenarios Requiring Caution

  1. Medical-Grade Accuracy Requirements: Current performance may insufficient for clinical diagnosis
  2. Extreme Environments: Intense exercise, low temperature, and other insufficiently tested scenarios
  3. Cross-Device Generalization: Different sensor hardware may require retraining
  4. Multi-Modal Fusion: Considers only PPG single modality

Extension Potential

  1. Other Physiological Signals: Heart rate variability, blood oxygen saturation, blood pressure estimation
  2. Multi-Modal Sensing: Combine accelerometer, gyroscope and other sensors
  3. Personalized Models: Model fine-tuning for specific users
  4. Disease Screening: Applications like arrhythmia detection, sleep apnea screening

References

Key Cited Works

  1. Busbridge et al. (2025) - Distillation Scaling Laws: First to establish mathematical scaling laws for language model distillation, important theoretical foundation for this paper
  2. Hinton et al. (2015) - Knowledge Distillation Foundational Work: Proposed soft distillation method and temperature parameter concept
  3. Zhao et al. (2022) - Decoupled Knowledge Distillation (DKD): Original paper for this work's best-performing strategy
  4. Meier et al. (2024) - WildPPG Dataset: Primary dataset source and baseline model reference for this paper
  5. Sanh et al. (2019) - DistilBERT: Successful language model distillation case, demonstrating feasibility of distillation for large-scale models
  6. Kasnesis et al. (2025) - PPG Knowledge Distillation Application: Authors' referenced cross-dataset generalization research

These references form the theoretical foundation and methodological references for this paper, essential for understanding the research context.


Overall Evaluation: This is a well-positioned, rigorously executed preliminary research paper. While limited in experimental scale and theoretical depth, it pioneering introduces scaling law research to the physiological sensing domain, providing practical and predictable methodological frameworks for wearable device model optimization. The superior performance of DKD strategy and discovery of exponential scaling curves offer important practical guidance. Future validation on larger-scale data, diverse architectures, and actual hardware will significantly impact wearable health monitoring technology.