A metrological framework for uncertainty evaluation in machine learning classification models
Bilson, Cox, Pustogvar et al.
Machine learning (ML) classification models are increasingly being used in a wide range of applications where it is important that predictions are accompanied by uncertainties, including in climate and earth observation, medical diagnosis and bioaerosol monitoring. The output of an ML classification model is a type of categorical variable known as a nominal property in the International Vocabulary of Metrology (VIM). However, concepts related to uncertainty evaluation for nominal properties are not defined in the VIM, nor is such evaluation addressed by the Guide to the Expression of Uncertainty in Measurement (GUM). In this paper we propose a metrological conceptual uncertainty evaluation framework for nominal properties. This framework is based on probability mass functions and summary statistics thereof, and it is applicable to ML classification. We also illustrate its use in the context of two applications that exemplify the issues and have significant societal impact, namely, climate and earth observation and medical diagnosis. Our framework would enable an extension of the GUM to uncertainty for nominal properties, which would make both applicable to ML classification models.
academic
A metrological framework for uncertainty evaluation in machine learning classification models
Machine learning classification models are increasingly deployed in critical application domains such as climate observation, medical diagnosis, and bioaerosol monitoring, where predictions must be accompanied by uncertainty assessments. The output of ML classification models represents a categorical variable, referred to as a nominal property in the International Vocabulary of Metrology (VIM). However, neither VIM nor the Guide to the Expression of Uncertainty in Measurement (GUM) defines concepts for uncertainty evaluation of nominal properties. This paper proposes a metrological framework for uncertainty evaluation of nominal properties based on probability mass functions and their summary statistics, applicable to ML classification. The framework is illustrated through two application case studies with significant societal impact: climate observation and medical diagnosis. This framework enables GUM to be extended to uncertainty evaluation of nominal properties, making both applicable to ML classification models.
Growing Application Demand: ML classification models are increasingly applied in critical domains including climate observation, medical diagnosis, and bioaerosol monitoring, where predictions must be accompanied by credible uncertainty assessments.
Absence of Metrological Standards: Existing metrological standards (VIM and GUM) are primarily designed for quantitative variables and lack a framework for uncertainty evaluation of nominal properties, which are the outputs of classification models.
Multiple Uncertainty Sources: ML classification models involve multiple sources of uncertainty including training data uncertainty, class assignment uncertainty, model selection uncertainty, model parameter uncertainty, and new input data uncertainty.
Proposed a metrological uncertainty evaluation framework for nominal properties: Based on probability mass functions (PMF) and summary statistics, providing a systematic uncertainty evaluation methodology for ML classification models.
Established uncertainty propagation mechanisms: Demonstrated how to propagate nominal property uncertainty through PMF in multi-stage measurement models, supporting both analytical and Monte Carlo methods.
Systematically compared uncertainty statistics: Evaluated the characteristics and applicability of multiple uncertainty expression methods including Wilcox Variation Ratio (WVR), information entropy, and Qualitative Variation Index (IQV).
Validated framework practicality: Through two important application case studies—land cover classification and atrial fibrillation detection—demonstrated the framework's effectiveness in real-world problems.
Laid foundation for GUM extension: The framework enables GUM to be extended to uncertainty evaluation of nominal properties, perfecting the metrological standards system.
PMF is Complete Expression of Nominal Property Uncertainty: Analogous to PDF for continuous variables, PMF provides complete information about classification prediction uncertainty.
Multiple Statistics Have Distinct Advantages: Information entropy is most sensitive but potentially oversensitive; modal probability-based statistics like WVR are more intuitive; selection should be based on specific application requirements.
Framework is Practically Applicable: Two case studies demonstrate the framework's applicability across different domains and model types.
Supports Uncertainty Propagation: PMF enables uncertainty propagation of nominal properties through multi-stage models.
Fills Important Gap: First systematic establishment of a metrological uncertainty evaluation framework for ML classification models, addressing a significant gap in GUM/VIM standards.
Theoretical Rigor: Based on probability theory foundations, establishes a complete theoretical system from PMF to summary statistics, maintaining consistency with existing metrological standards.
Strong Practicality: Two case studies spanning different application domains, data types, and model architectures demonstrate broad framework applicability.
Systematic Comparison: Comprehensive comparison of seven uncertainty statistics provides guidance for practical application selection.
Forward-Looking: Provides important support for credible deployment of ML technology in high-risk applications.
Limited Uncertainty Sources: While five uncertainty sources are identified, not all are modeled in practical cases, particularly model selection uncertainty.
Assumption Conditions: The i.i.d. assumption is frequently violated in practical applications, but the paper provides insufficient discussion of this.
Computational Efficiency: Computational complexity of some methods (such as full Bayesian inference) limits practical application.
Limited Validation: Only two case studies; framework effectiveness requires validation across more domains and scenarios.
The paper cites 86 references covering metrological standards, machine learning theory, uncertainty quantification methods, and specific application domains, providing a solid theoretical foundation and broad application context. Key references include GUM documentation, VIM vocabulary, Bayesian machine learning methods, and uncertainty quantification techniques.