Introduction: Accounting for missing data by imputing or weighting conditional on covariates relies on the variable with missingness being observed at least some of the time for all unique covariate values. This requirement is referred to as positivity and positivity violations can result in bias. Here, we review a novel approach to addressing positivity violations in the context of systolic blood pressure. Methods: To illustrate the proposed approach, we estimate the mean systolic blood pressure among children and adolescents aged 2-17 years old in the United States using data from the 2017-2018 National Health and Nutrition Examination Survey (NHANES). As blood pressure was not measured for those aged 2-7, there exists a positivity violation by design. Using a recently proposed synthesis of statistical and mathematical models, we integrate external information with NHANES to address our motivating question. Results: With the synthesis model, the estimated mean systolic blood pressure was 100.5 (95% confidence interval: 99.9, 101.0), which is notably lower than either a complete-case analysis or extrapolation from a statistical model. The synthesis results were supported by a diagnostic comparing the performance of the mathematical model in the positive region. Discussion: Positivity violations pose a threat to quantitative medical research, and standard approaches to addressing nonpositivity rely on restrictive untestable assumptions. Using a synthesis model, like the one detailed here, offers a viable alternative.
Accounting for Missing Data in Public Health Research Using a Synthesis of Statistical and Mathematical Models
- Paper ID: 2503.02789
- Title: Accounting for Missing Data in Public Health Research Using a Synthesis of Statistical and Mathematical Models
- Authors: Paul N Zivich, Bonnie E Shook-Sa, Stephen R Cole, Eric T Lofgren, Jessie K Edwards
- Classification: stat.AP (Applied Statistics), stat.ME (Statistical Methods)
- Publication Date: October 16, 2025
- Paper Link: https://arxiv.org/abs/2503.02789
This study proposes a comprehensive approach combining statistical and mathematical models to address violations of the positivity assumption in handling missing data in public health research. Using the estimation of mean systolic blood pressure in U.S. children and adolescents aged 2-17 as an example, the study employs data from the 2017-2018 National Health and Nutrition Examination Survey (NHANES). Due to the NHANES design, blood pressure was not measured in children aged 2-7, resulting in a design-based positivity violation. By integrating external information with NHANES data, the synthetic model estimated mean systolic blood pressure at 100.5 mmHg (95% CI: 99.9, 101.0), significantly lower than results from complete case analysis or statistical model extrapolation.
- Importance of the Positivity Assumption: In missing data handling, imputation or weighting based on covariates relies on the positivity assumption—that for all unique covariate values, the missing variable is observed at least sometimes.
- Prevalence of Positivity Violations: When certain covariate combinations completely lack observations of the target variable, positivity violations occur, leading to bias.
- Limitations of Existing Methods: Traditional approaches to handling non-positivity either modify the research question or rely on restrictive, untestable modeling assumptions.
- Theoretical Significance: Provides a new theoretical framework for handling positivity violations, avoiding restrictive assumptions of traditional methods.
- Practical Value: Offers feasible solutions for missing data problems in public health and clinical research.
- Methodological Innovation: First systematic integration of statistical and mathematical models to address non-positivity issues.
- Proposed a Synthetic Model Framework: Divides data into regions satisfying and violating positivity, using statistical and mathematical models respectively.
- Developed a Resampling Algorithm: Provides variance estimation methods considering uncertainty from both models.
- Constructed Model Diagnostic Procedures: Validates method effectiveness by comparing statistical and mathematical model performance within positivity regions.
- Provided Complete Implementation: Includes R and Python code, enhancing method reproducibility and practical utility.
Estimate parameter μ=E[Y], where Y is systolic blood pressure, but is completely missing under certain covariate values X, violating the positivity assumption Pr(R=1∣X=x)>0.
Data is divided into two regions:
- Positivity Region (X∗=1): Ages 8-17, with systolic blood pressure observations present
- Non-Positivity Region (X∗=0): Ages 2-7, with systolic blood pressure completely missing
The parameter can be rewritten as:
E[Y]=E[Y∣X∗=1]Pr(X∗=1)+E[Y∣X∗=0]Pr(X∗=0)
In the positivity region, a saturated model is used:
E[Y∣X,R=1,X∗=1;β]=β8I(X=8)+β9I(X=9)+⋯+β17I(X=17)
Using g-computation method:
- Fit regression model based on complete data
- Predict systolic blood pressure for all observations
- Calculate sample-weighted average
Based on external published information on U.S. children and adolescent systolic blood pressure distribution:
- Uses age, sex, and height percentile-specific distributions
- Assumes normal distribution with mean equal to median
- Standard deviation approximated from 90th percentile
- Avoids Extrapolation Assumptions: Unlike traditional linear extrapolation, does not assume the 8-17 age relationship extends to ages 2-7.
- Flexible Model Selection: Positivity region can use non-parametric methods; non-positivity region integrates external information.
- Uncertainty Quantification: Resampling algorithm simultaneously considers uncertainty from statistical model parameter estimation and mathematical model distribution.
- Primary Data: 2017-2018 NHANES, n=2,572 children and adolescents aged 2-17
- External Information: Published U.S. children and adolescent systolic blood pressure distribution data by Flynn et al.
- Missing Pattern: Systolic blood pressure completely missing in ages 2-7 (design-based missing), 8% missing in ages 8-17
- Outcome Variable: Systolic blood pressure (mmHg), average of up to 3 measurements
- Covariates: Age (years), height (cm), weight (kg), sex
- Sampling Weights: NHANES sampling weights applied for U.S. population inference
- Complete Case Analysis: Uses only observations with systolic blood pressure measurements
- Linear Extrapolation: Fits linear model based on ages 8-17, extrapolates to ages 2-7
- Sensitivity Analysis: Boundary analysis with mean systolic blood pressure for ages 2-7 ranging from 70-120 mmHg
- Resampling Iterations: 10,000
- Confidence Intervals: 95% CI constructed using 2.5% and 97.5% quantiles
- Point Estimate: Median used as point estimate
| Method | Mean Systolic BP (mmHg) | 95% CI |
|---|
| Complete Case Analysis | 104.7 | (104.1, 105.3) |
| Linear Extrapolation | 101.6 | (100.8, 102.4) |
| Synthetic Model | 100.5 | (99.9, 101.0) |
| Boundary Analysis | 92.7-109.9 | (91.9, 110.5) |
- Synthetic Model Yields Lowest Result: 1.1 mmHg lower than linear extrapolation, a difference of 2.9 times the extrapolation method's standard error.
- Statistically Significant Differences Between Methods: Differences between synthetic model and other methods exceed estimated uncertainty ranges.
- Boundary Analysis Supports Results: Synthetic model estimate falls within reasonable boundary ranges.
Comparing statistical and mathematical model performance within the positivity region:
- Reasonable overlap in systolic blood pressure distributions predicted by both models
- Age-specific mean differences near zero, though statistical model results slightly lower than mathematical model for ages 15-17
- Overall supports mathematical model validity in positivity region
Results considering additional covariates (sex, height, weight) in appendices:
- Synthetic model results remain stable: 100.5 (99.9, 101.0)
- Extrapolation method results approach synthetic model: 100.8 (97.7, 103.8)
- Augmented inverse probability weighted estimator results similar
- Imputation Methods: Multiple imputation, maximum likelihood estimation
- Weighting Methods: Inverse probability weighting
- Doubly Robust Methods: Augmented inverse probability weighted estimators
- Problem Modification: Restricting study population to positivity-satisfying regions
- Parametric Extrapolation: Using restrictive modeling assumptions for extrapolation
- Boundary Analysis: Providing sensitivity analysis ranges
- First systematic integration of statistical and mathematical models
- Avoids problem modification or strong parametric assumptions
- Provides practical uncertainty quantification methods
- Synthetic Model Effectiveness: Successfully estimates population parameters including non-positivity regions.
- Method Advantages: Avoids restrictive assumptions of traditional methods, providing more reasonable estimates.
- Practical Value: Offers feasible solutions for design-based or systematic missing data.
- Variance Estimation: Does not account for NHANES cluster sampling design, potentially underestimating uncertainty.
- Mathematical Model Complexity: Current use of relatively simple models; complex cases may require intermediate process modeling.
- External Information Dependence: Method effectiveness depends on accuracy and applicability of external information.
- Multivariate Non-Positivity: Application when multiple variables simultaneously exhibit non-positivity requires further research.
- Complex Mathematical Models: Develop models for complex processes like drug concentrations and physiological responses.
- Improved Variance Estimation: Extend resampling algorithms to accommodate complex sampling designs like clustering.
- Multidimensional Non-Positivity: Investigate cases where multiple variables simultaneously exhibit non-positivity.
- Enhanced Diagnostic Methods: Develop more comprehensive model validity diagnostic procedures.
- Strong Methodological Innovation: First systematic integration of statistical and mathematical models for non-positivity.
- Solid Theoretical Foundation: Based on causal inference and missing data theory fundamentals.
- Outstanding Practical Utility: Provides complete implementation code and detailed algorithm descriptions.
- Sufficient Validation: Verified through multiple comparison methods and diagnostic procedures.
- External Information Requirements: Method success depends on availability of high-quality external information.
- Computational Complexity: Resampling procedures increase computational burden.
- Limited Applicability: Primarily applicable when reliable external information is available.
- Theoretical Guarantees: Lacks theoretical analysis of method asymptotic properties.
- Academic Contribution: Provides important methodological contributions to statistics and epidemiology.
- Practical Value: Directly applicable to common design-based missing data problems in public health research.
- Reproducibility: Provided code and detailed descriptions ensure method reproducibility.
- Generalization Potential: Framework generalizable to other research areas with non-positivity.
- Design-Based Missing Data: System-based missing due to age restrictions, ethical considerations.
- Rich External Information: Availability of reliable external studies or prior knowledge.
- Parameter Estimation: Primarily applicable to population parameter estimation rather than individual prediction.
- Public Health Research: Particularly suitable for missing data in large-scale epidemiological surveys.
The paper cites important literature in related fields, including:
- Cole et al. on missing outcome data in epidemiological research
- Westreich and Cole on positivity in practice
- Petersen et al. on diagnosing and addressing positivity assumption violations
- Flynn et al. on blood pressure screening and management clinical practice guidelines for children and adolescents