2025-11-13T08:49:10.859507

A metrological framework for uncertainty evaluation in machine learning classification models

Bilson, Cox, Pustogvar et al.

Machine learning (ML) classification models are increasingly being used in a wide range of applications where it is important that predictions are accompanied by uncertainties, including in climate and earth observation, medical diagnosis and bioaerosol monitoring. The output of an ML classification model is a type of categorical variable known as a nominal property in the International Vocabulary of Metrology (VIM). However, concepts related to uncertainty evaluation for nominal properties are not defined in the VIM, nor is such evaluation addressed by the Guide to the Expression of Uncertainty in Measurement (GUM). In this paper we propose a metrological conceptual uncertainty evaluation framework for nominal properties. This framework is based on probability mass functions and summary statistics thereof, and it is applicable to ML classification. We also illustrate its use in the context of two applications that exemplify the issues and have significant societal impact, namely, climate and earth observation and medical diagnosis. Our framework would enable an extension of the GUM to uncertainty for nominal properties, which would make both applicable to ML classification models.

academic

A metrological framework for uncertainty evaluation in machine learning classification models

Basic Information

Paper ID: 2504.03359
Title: A metrological framework for uncertainty evaluation in machine learning classification models
Authors: Samuel Bilson, Maurice Cox, Anna Pustogvar, Andrew Thompson (National Physical Laboratory, UK)
Classification: cs.LG (Machine Learning)
Publication Date: October 15, 2025 (arXiv v3)
Paper Link: https://arxiv.org/abs/2504.03359

Abstract

Machine learning classification models are increasingly deployed in critical application domains such as climate observation, medical diagnosis, and bioaerosol monitoring, where predictions must be accompanied by uncertainty assessments. The output of ML classification models represents a categorical variable, referred to as a nominal property in the International Vocabulary of Metrology (VIM). However, neither VIM nor the Guide to the Expression of Uncertainty in Measurement (GUM) defines concepts for uncertainty evaluation of nominal properties. This paper proposes a metrological framework for uncertainty evaluation of nominal properties based on probability mass functions and their summary statistics, applicable to ML classification. The framework is illustrated through two application case studies with significant societal impact: climate observation and medical diagnosis. This framework enables GUM to be extended to uncertainty evaluation of nominal properties, making both applicable to ML classification models.

Research Background and Motivation

Problem Background

Growing Application Demand: ML classification models are increasingly applied in critical domains including climate observation, medical diagnosis, and bioaerosol monitoring, where predictions must be accompanied by credible uncertainty assessments.
Absence of Metrological Standards: Existing metrological standards (VIM and GUM) are primarily designed for quantitative variables and lack a framework for uncertainty evaluation of nominal properties, which are the outputs of classification models.
Multiple Uncertainty Sources: ML classification models involve multiple sources of uncertainty including training data uncertainty, class assignment uncertainty, model selection uncertainty, model parameter uncertainty, and new input data uncertainty.

Research Motivation

Establish a standardized uncertainty evaluation framework enabling ML classification models to be integrated into metrological traceability chains
Provide credible prediction uncertainty for high-risk applications such as medical diagnosis
Extend the existing GUM framework to encompass nominal properties

Limitations of Existing Approaches

GUM is primarily applicable to continuous quantitative variables and cannot be directly applied to classification outputs
Existing conformity assessment methods apply only to rule-based binary classification and are unsuitable for trained ML models
Lack of standardized methods for nominal property uncertainty propagation

Core Contributions

Proposed a metrological uncertainty evaluation framework for nominal properties: Based on probability mass functions (PMF) and summary statistics, providing a systematic uncertainty evaluation methodology for ML classification models.
Established uncertainty propagation mechanisms: Demonstrated how to propagate nominal property uncertainty through PMF in multi-stage measurement models, supporting both analytical and Monte Carlo methods.
Systematically compared uncertainty statistics: Evaluated the characteristics and applicability of multiple uncertainty expression methods including Wilcox Variation Ratio (WVR), information entropy, and Qualitative Variation Index (IQV).
Validated framework practicality: Through two important application case studies—land cover classification and atrial fibrillation detection—demonstrated the framework's effectiveness in real-world problems.
Laid foundation for GUM extension: The framework enables GUM to be extended to uncertainty evaluation of nominal properties, perfecting the metrological standards system.

Methodology Details

Task Definition

This paper addresses the uncertainty evaluation task for ML classification models:

Input: Set of input variables X (may include quantitative and categorical variables)
Output: Categorical variable Y ∈ CK = {c1, ..., cK}, where K is the number of classes
Objective: Evaluate the uncertainty of classification prediction y = f(x)

Theoretical Framework

1. Probability Mass Function (PMF)

For nominal variables, complete uncertainty information is expressed by PMF:

p : CK → [0,1]
ck ↦ pk := p(ck)

satisfying the normalization condition: ∑pk = 1

2. Uncertainty Statistics

The paper systematically evaluated seven uncertainty statistics:

Wilcox Variation Ratio (WVR):

uWVR(p) = 1 - (Kp̂-1)/(K-1)

Information Entropy:

H(p) = -∑pk logK pk

Qualitative Variation Index (IQV):

uIQV(p) = K/(K-1)(1-∑pk²)

where p̂ is the modal probability (highest class probability).

3. Uncertainty Propagation

For measurement models with nominal inputs z = g(x,y), the expected value and variance of the output can be expressed as:

E[z] = ∑pk μk
Var[z] = ∑pk(σk² + μk²) - (∑pkμk)²

Uncertainty Source Identification

The paper identified five major uncertainty sources in ML classification:

Training Data Uncertainty: Measurement uncertainty inherent in the training data itself
Class Assignment Uncertainty: Classification ambiguity inherent to the task
Model Selection Uncertainty: Uncertainty in the choice of model type
Model Parameter Uncertainty: Uncertainty in parameter estimation and optimization
New Input Data Uncertainty: Measurement uncertainty of input data during prediction

Experimental Setup

Case Study 1: Land Cover Classification

Dataset:

Sentinel-2 satellite imagery
20km × 20km region in Scotland
189,142 pixels with four classes: forest, farmland, grassland, residential areas
Data from 2020 and 2021

Method: Bayesian Quadratic Discriminant Analysis (BQDA)

Generative modeling approach
Explicitly models multiple uncertainty sources
Multivariate Gaussian distribution assumption

Evaluation Metrics:

Classification loss (misclassification rate)
Expected Cross-Entropy loss (EXE)
Expected Brier Score (EBS)

Case Study 2: Atrial Fibrillation Detection

Dataset:

DeepBeat PPG dataset
134 patients, over 100,000 signal segments
25-second duration, 32Hz sampling rate
Binary classification task (AF/non-AF)

Method: Convolutional Neural Network + Monte Carlo Dropout

Discriminative modeling approach
xresnet1d50 variant architecture
Captures aleatoric and epistemic uncertainty

Experimental Results

Land Cover Classification Results

Classification Performance:

2020 test: loss=0.012, EXE=0.079, EBS=0.031
2021 test: loss=0.057, EXE=0.567, EBS=0.151
Significant cross-year performance degradation reflects distribution shift effects

Uncertainty Statistics Performance (2020):

Median and mean differences are enormous (orders of magnitude), indicating highly left-skewed distribution
Information entropy H is most sensitive to small value changes
UVR is least sensitive to small value changes
WVR, SDM, CNV show equivalent performance in high-confidence predictions

Atrial Fibrillation Detection Results

Classification Performance:

Classification loss: 0.209
EXE: 0.874
EBS: 0.622

Uncertainty Statistics:

Due to lower classification performance compared to land cover task, uncertainty statistics values are generally higher
In binary classification, WVR, SDM, CNV are mathematically equivalent
Information entropy remains the most sensitive statistic

Key Findings

Sensitivity Ranking of Statistics: Information Entropy > IQV > WVR/SDM/CNV > UVR
Binary Classification Equivalence: WVR, SDM, CNV are mathematically equivalent in binary classification
High-Confidence Approximation: Multiple statistics approximate equivalence for high-confidence multi-class predictions
Performance-Uncertainty Relationship: Lower classification performance correlates with higher uncertainty statistic values

Metrological Standards

GUM Suite: Primarily addresses uncertainty evaluation for quantitative variables
VIM: Defines the concept of nominal properties but lacks uncertainty evaluation methods
Conformity Assessment: Applies only to rule-based binary classification

ML Uncertainty Evaluation

Bayesian Methods: Such as Bayesian neural networks, variational inference
Ensemble Methods: Such as Monte Carlo Dropout, deep ensembles
Probability Calibration: Improves credibility of predicted probabilities

Clinical Laboratory Science: IFCC-IUPAC nominal property vocabulary
Qualitative Chemical Analysis: EURACHEM/CITAC guidelines
Reference Materials: ISO 33406:2024 standard

Conclusions and Discussion

Main Conclusions

PMF is Complete Expression of Nominal Property Uncertainty: Analogous to PDF for continuous variables, PMF provides complete information about classification prediction uncertainty.
Multiple Statistics Have Distinct Advantages: Information entropy is most sensitive but potentially oversensitive; modal probability-based statistics like WVR are more intuitive; selection should be based on specific application requirements.
Framework is Practically Applicable: Two case studies demonstrate the framework's applicability across different domains and model types.
Supports Uncertainty Propagation: PMF enables uncertainty propagation of nominal properties through multi-stage models.

Limitations

i.i.d. Assumption: The framework assumes training and test data are independently and identically distributed; distribution shift affects reliability.
Computational Complexity: Some methods (such as full Bayesian inference) have high computational costs.
Model Selection Uncertainty: Most methods do not adequately account for uncertainty in model architecture selection.
Input Uncertainty Modeling: Explicit modeling of input uncertainty in deep learning methods remains challenging.

Future Directions

GUM Extension: Formally incorporate nominal property uncertainty evaluation into the GUM framework
Standardization: Develop international standards for ML classification model uncertainty evaluation
Method Improvement: Develop more efficient uncertainty quantification methods
Application Expansion: Validate framework effectiveness in additional critical application domains

In-Depth Evaluation

Strengths

Fills Important Gap: First systematic establishment of a metrological uncertainty evaluation framework for ML classification models, addressing a significant gap in GUM/VIM standards.
Theoretical Rigor: Based on probability theory foundations, establishes a complete theoretical system from PMF to summary statistics, maintaining consistency with existing metrological standards.
Strong Practicality: Two case studies spanning different application domains, data types, and model architectures demonstrate broad framework applicability.
Systematic Comparison: Comprehensive comparison of seven uncertainty statistics provides guidance for practical application selection.
Forward-Looking: Provides important support for credible deployment of ML technology in high-risk applications.

Limitations

Limited Uncertainty Sources: While five uncertainty sources are identified, not all are modeled in practical cases, particularly model selection uncertainty.
Assumption Conditions: The i.i.d. assumption is frequently violated in practical applications, but the paper provides insufficient discussion of this.
Computational Efficiency: Computational complexity of some methods (such as full Bayesian inference) limits practical application.
Limited Validation: Only two case studies; framework effectiveness requires validation across more domains and scenarios.

Impact

Standards Development: Likely to promote updates to international metrological standards, incorporating ML classification into formal frameworks.
Industrial Application: Provides credibility assurance for ML applications in critical domains such as healthcare and environmental monitoring.
Academic Value: Bridges metrology and machine learning disciplines, promoting interdisciplinary collaboration.
Reproducibility: Provides clear theoretical framework and implementation details, facilitating adoption by other researchers.

Applicable Scenarios

High-Risk Applications: Medical diagnosis, safety monitoring, and other scenarios with extreme credibility requirements
Regulatory Environments: Industrial and research applications requiring compliance with metrological standards
Multi-Stage Systems: Complex systems where classification results propagate to subsequent processing steps
Quality Assurance: Production and service systems requiring quantification of prediction credibility

References

The paper cites 86 references covering metrological standards, machine learning theory, uncertainty quantification methods, and specific application domains, providing a solid theoretical foundation and broad application context. Key references include GUM documentation, VIM vocabulary, Bayesian machine learning methods, and uncertainty quantification techniques.