Every year in the United States, 800,000 individuals suffer a stroke - one person every 40 seconds, with a death occurring every four minutes. While individual factors vary, certain predictors are more prevalent in determining stroke risk. As strokes are the second leading cause of death and disability worldwide, predicting stroke likelihood based on lifestyle factors is crucial. Showing individuals their stroke risk could motivate lifestyle changes, and machine learning offers solutions to this prediction challenge. Neural networks excel at predicting outcomes based on training features like lifestyle factors, however, they're not the only option. Logistic regression models can also effectively compute the likelihood of binary outcomes based on independent variables, making them well-suited for stroke prediction. This analysis will compare both neural networks (dense and convolutional) and logistic regression models for stroke prediction, examining their pros, cons, and differences to develop the most effective predictor that minimizes false negatives.
academicStroke Prediction using Clinical and Social Features in Machine Learning
- Paper ID: 2501.00048
- Title: Stroke Prediction using Clinical and Social Features in Machine Learning
- Author: Aidan Chadha (Virginia Tech)
- Classification: cs.LG cs.AI
- Publication Date/Venue: 2025 Preprint
- Paper Link: https://arxiv.org/abs/2501.00048
- Code Link: https://github.com/Aidan7757/stroke_prediction_using_clinical_social_features
Approximately 800,000 people in the United States suffer from stroke annually, with one person experiencing a stroke every 40 seconds and one person dying from stroke every 4 minutes. As the second leading cause of death and disability globally, predicting stroke likelihood based on lifestyle factors is critically important. This study compares the performance of neural networks (dense and convolutional) with logistic regression models in stroke prediction, aiming to develop the most effective predictor while minimizing false negatives.
Stroke prediction is a critical healthcare problem involving multiple internal and external factors:
- External Factors: Marital status, occupation type, residential environment, etc.
- Internal Factors: History of heart disease, BMI, age, blood glucose levels, etc.
- Public Health Impact: Stroke is the second leading cause of death and disability globally
- Prevention Value: Early risk assessment can motivate lifestyle changes
- Clinical Application: Real-time risk assessment can be integrated into routine physical examinations
- Lack of comprehensive predictive models that effectively combine clinical and social features
- The harm of false negatives in medical settings has not received sufficient attention
- Limited comparative studies of different machine learning methods in stroke prediction
- Multi-Model Comparison Framework: Systematically compares the performance of logistic regression, dense neural networks, and convolutional neural networks in stroke prediction
- Medical-Oriented Evaluation Strategy: Focuses on minimizing false negatives, aligning with practical requirements in medical settings
- Comprehensive Feature Analysis: Integrates clinical indicators and social factors for comprehensive risk assessment
- Practical Multi-Model System Recommendations: Proposes a hierarchical prediction pipeline combining advantages of multiple models
- Input: Patient data containing 10 features (age, gender, hypertension, heart disease, marital status, occupation type, residence type, average blood glucose level, BMI, smoking status)
- Output: Binary classification result (0: no stroke, 1: stroke)
- Constraint: Minimize false negatives while balancing precision and recall
- Preprocessing: Feature standardization using StandardScaler, categorical variable encoding using Label Encoder
- Regularization: L2 regularization to prevent overfitting
- Optimization: Maximum iterations set to 10,000 to ensure convergence
- Decision Boundary: 0.5 probability threshold (adjustable)
Dense Neural Network (DNN):
- Input layer: 10 features
- Hidden layers: Including Batch Normalization and Dropout
- Activation function: ReLU
- Output layer: Single neuron with Sigmoid activation
Convolutional Neural Network (CNN):
- Similar architecture but using convolutional layers for feature processing
- Includes pooling layers and fully connected layers
Training Parameters:
- Loss function: Cross Entropy Loss (suitable for class imbalance)
- Optimizer: Adam (adaptive learning rate)
- Training epochs: 400
- Regularization: Dropout + Batch Normalization
- Multi-Architecture Comparison: First systematic comparison of CNN and DNN performance on tabular stroke prediction data
- Medical-Oriented Design: Uses weighted loss functions to address class imbalance
- Feature Importance Analysis: Analyzes biological factors' predictive contribution through logistic regression coefficients
- Statistical Robustness Validation: Computes 95% confidence intervals using Bootstrap resampling
- Source: Kaggle Stroke Prediction Dataset
- Scale: Approximately 5,000 samples
- Class Distribution: Highly imbalanced (only 5-6% stroke cases)
- Split: 80% training set, 20% test set
- Features: 10 clinical and social features
- Accuracy: Overall correctness rate
- Recall: Ability to identify actual stroke cases (primary focus)
- Precision: Accuracy of predicted stroke cases
- F1-Score: Harmonic mean of precision and recall
- AUC-ROC: Discriminative ability across different thresholds
- Confusion Matrix: Detailed classification error analysis
- Logistic Regression (Sklearn implementation)
- Dense Neural Network (PyTorch implementation)
- Convolutional Neural Network (PyTorch implementation)
- Framework: PyTorch (neural networks), Sklearn (logistic regression)
- Hardware: Standard computing environment
- Reproducibility: Fixed random seeds, open-source code
| Model | Accuracy | Recall | Precision | F1-Score |
|---|
| Logistic Regression | 74.95% | 75.81% | 16.31% | - |
| Dense Neural Network | 86.50% | 43.55% | 20.77% | - |
| Convolutional Neural Network | 78.67% | 53.23% | - | - |
- Accuracy vs. Recall Trade-off:
- Dense neural network achieves highest accuracy (86.50%) but lower recall (43.55%)
- Logistic regression achieves highest recall (75.81%) but lower precision (16.31%)
- CNN achieves balance between the two
- Feature Importance Analysis:
- Age is the most important predictor (consistent with medical knowledge)
- BMI importance lower than expected (inconsistent with existing research)
- Training Dynamics:
- CNN converges slowly after 50 epochs
- DNN continues to improve throughout 400 training epochs
- No obvious overfitting observed
Bootstrap resampling (1,000 iterations) computes 95% confidence intervals:
- DNN Accuracy: 86.50% 84.32%, 88.68%
- DNN Recall: 43.55% 39.87%, 47.23%
- Logistic Regression Accuracy: 74.95% 72.63%, 77.27%
- Logistic Regression Recall: 75.81% 72.14%, 79.48%
The paper cites multiple relevant studies:
- Shao et al. (2024): Emphasizes importance of BMI and age as biological predictors
- Gupta et al. (2025): Neural network-based stroke prediction models
- Zhang et al. (2022): Application of multilayer perceptrons in stroke prediction
Advantages compared to existing work:
- Systematic comparison of multiple machine learning methods
- Focus on minimizing false negatives
- Comprehensive analysis combining clinical and social features
- Model Selection Depends on Application Scenario:
- Logistic Regression: Suitable for initial screening (high recall, strong interpretability)
- Dense Neural Network: Suitable for precise assessment (high accuracy, low false positives)
- CNN: Balanced performance, suitable for validation tool
- Multi-Model System Recommendations:
- Stage 1: Logistic regression for initial screening
- Stage 2: DNN for precise assessment of high-risk patients
- Stage 3: CNN for verification and balance
- Data Imbalance: Only 5-6% positive cases limit model learning capacity
- Anomalous Feature Importance: Lower-than-expected BMI importance may affect prediction accuracy
- Generalization Ability: Single dataset may limit model universality
- Sample Size: 5,000 samples relatively small, particularly for positive cases
- Data Augmentation: Collect more real stroke patient data to alleviate class imbalance
- Feature Engineering: Re-evaluate and optimize feature selection strategy
- Model Ensemble: Develop more sophisticated multi-model fusion methods
- Clinical Validation: Validate model effectiveness in actual medical environments
- Practical Orientation: Clearly addresses practical need to minimize false negatives in medical settings
- Comprehensive Methodology: Systematically compares traditional machine learning and deep learning methods
- Statistical Rigor: Uses Bootstrap method to verify result robustness
- Reproducibility: Provides complete code and data with MIT open-source license
- Clinical Relevance: Integrates risk factors recognized by the medical field
- Data Quality: Severe class imbalance not adequately addressed
- Model Depth: Relatively simple neural network architecture, insufficient exploration of deep learning potential
- Insufficient Feature Engineering: Anomalous BMI importance suggests potential feature processing issues
- Evaluation Limitations: Lacks comparison with existing clinical risk assessment tools
- Experimental Scale: Single dataset, lacks cross-dataset validation
- Academic Contribution: Provides practical multi-model comparison framework for medical AI
- Clinical Value: Proposed hierarchical prediction system has practical application potential
- Methodological Significance: Emphasizes importance of false negative control in medical AI
- Extensibility: Methods generalizable to other medical prediction tasks
- Primary Healthcare: Logistic regression model suitable for community health screening
- Specialty Hospitals: Dense neural network suitable for precise risk assessment
- Health Management: Can be integrated into personal health monitoring applications
- Clinical Research: Provides tools for stroke risk factor research
- CDC. Preventing stroke deaths. https://www.cdc.gov/vitalsigns/pdf/2017-09-vitalsigns.pdf
- Shao, Y., et al. (2024). Link between triglyceride-glucose-body mass index and future stroke risk in middle-aged and elderly Chinese. Cardiovascular Diabetology.
- Gupta, A., et al. (2025). Predicting stroke risk: An effective stroke prediction model based on neural networks. Journal of Neurorestoratology.
Overall Assessment: This study provides valuable multi-model comparative analysis on the important medical problem of stroke prediction, with particular emphasis on false negative control reflecting practical requirements of medical AI. Despite limitations such as data imbalance, the proposed multi-model system architecture has practical application value and provides a good reference framework for similar research in the medical AI field.