2025-11-22T01:16:16.023348

Stroke Prediction using Clinical and Social Features in Machine Learning

Chadha

Every year in the United States, 800,000 individuals suffer a stroke - one person every 40 seconds, with a death occurring every four minutes. While individual factors vary, certain predictors are more prevalent in determining stroke risk. As strokes are the second leading cause of death and disability worldwide, predicting stroke likelihood based on lifestyle factors is crucial. Showing individuals their stroke risk could motivate lifestyle changes, and machine learning offers solutions to this prediction challenge. Neural networks excel at predicting outcomes based on training features like lifestyle factors, however, they're not the only option. Logistic regression models can also effectively compute the likelihood of binary outcomes based on independent variables, making them well-suited for stroke prediction. This analysis will compare both neural networks (dense and convolutional) and logistic regression models for stroke prediction, examining their pros, cons, and differences to develop the most effective predictor that minimizes false negatives.

academic

Basic Information

Paper ID: 2501.00048
Title: Stroke Prediction using Clinical and Social Features in Machine Learning
Author: Aidan Chadha (Virginia Tech)
Classification: cs.LG cs.AI
Publication Date/Venue: 2025 Preprint
Paper Link: https://arxiv.org/abs/2501.00048
Code Link: https://github.com/Aidan7757/stroke_prediction_using_clinical_social_features

Abstract

Approximately 800,000 people in the United States suffer from stroke annually, with one person experiencing a stroke every 40 seconds and one person dying from stroke every 4 minutes. As the second leading cause of death and disability globally, predicting stroke likelihood based on lifestyle factors is critically important. This study compares the performance of neural networks (dense and convolutional) with logistic regression models in stroke prediction, aiming to develop the most effective predictor while minimizing false negatives.

Research Background and Motivation

Problem Definition

Stroke prediction is a critical healthcare problem involving multiple internal and external factors:

External Factors: Marital status, occupation type, residential environment, etc.
Internal Factors: History of heart disease, BMI, age, blood glucose levels, etc.

Significance

Public Health Impact: Stroke is the second leading cause of death and disability globally
Prevention Value: Early risk assessment can motivate lifestyle changes
Clinical Application: Real-time risk assessment can be integrated into routine physical examinations

Existing Limitations

Lack of comprehensive predictive models that effectively combine clinical and social features
The harm of false negatives in medical settings has not received sufficient attention
Limited comparative studies of different machine learning methods in stroke prediction

Core Contributions

Multi-Model Comparison Framework: Systematically compares the performance of logistic regression, dense neural networks, and convolutional neural networks in stroke prediction
Medical-Oriented Evaluation Strategy: Focuses on minimizing false negatives, aligning with practical requirements in medical settings
Comprehensive Feature Analysis: Integrates clinical indicators and social factors for comprehensive risk assessment
Practical Multi-Model System Recommendations: Proposes a hierarchical prediction pipeline combining advantages of multiple models

Methodology

Task Definition

Input: Patient data containing 10 features (age, gender, hypertension, heart disease, marital status, occupation type, residence type, average blood glucose level, BMI, smoking status)
Output: Binary classification result (0: no stroke, 1: stroke)
Constraint: Minimize false negatives while balancing precision and recall

Model Architecture

1. Logistic Regression Model

Preprocessing: Feature standardization using StandardScaler, categorical variable encoding using Label Encoder
Regularization: L2 regularization to prevent overfitting
Optimization: Maximum iterations set to 10,000 to ensure convergence
Decision Boundary: 0.5 probability threshold (adjustable)

2. Neural Network Models

Dense Neural Network (DNN):

Input layer: 10 features
Hidden layers: Including Batch Normalization and Dropout
Activation function: ReLU
Output layer: Single neuron with Sigmoid activation

Convolutional Neural Network (CNN):

Similar architecture but using convolutional layers for feature processing
Includes pooling layers and fully connected layers

Training Parameters:

Loss function: Cross Entropy Loss (suitable for class imbalance)
Optimizer: Adam (adaptive learning rate)
Training epochs: 400
Regularization: Dropout + Batch Normalization

Technical Innovations

Multi-Architecture Comparison: First systematic comparison of CNN and DNN performance on tabular stroke prediction data
Medical-Oriented Design: Uses weighted loss functions to address class imbalance
Feature Importance Analysis: Analyzes biological factors' predictive contribution through logistic regression coefficients
Statistical Robustness Validation: Computes 95% confidence intervals using Bootstrap resampling

Experimental Setup

Dataset

Source: Kaggle Stroke Prediction Dataset
Scale: Approximately 5,000 samples
Class Distribution: Highly imbalanced (only 5-6% stroke cases)
Split: 80% training set, 20% test set
Features: 10 clinical and social features

Evaluation Metrics

Accuracy: Overall correctness rate
Recall: Ability to identify actual stroke cases (primary focus)
Precision: Accuracy of predicted stroke cases
F1-Score: Harmonic mean of precision and recall
AUC-ROC: Discriminative ability across different thresholds
Confusion Matrix: Detailed classification error analysis

Comparison Methods

Logistic Regression (Sklearn implementation)
Dense Neural Network (PyTorch implementation)
Convolutional Neural Network (PyTorch implementation)

Implementation Details

Framework: PyTorch (neural networks), Sklearn (logistic regression)
Hardware: Standard computing environment
Reproducibility: Fixed random seeds, open-source code

Experimental Results

Main Results

Model	Accuracy	Recall	Precision	F1-Score
Logistic Regression	74.95%	75.81%	16.31%	-
Dense Neural Network	86.50%	43.55%	20.77%	-
Convolutional Neural Network	78.67%	53.23%	-	-

Key Findings

Accuracy vs. Recall Trade-off:
- Dense neural network achieves highest accuracy (86.50%) but lower recall (43.55%)
- Logistic regression achieves highest recall (75.81%) but lower precision (16.31%)
- CNN achieves balance between the two
Feature Importance Analysis:
- Age is the most important predictor (consistent with medical knowledge)
- BMI importance lower than expected (inconsistent with existing research)
Training Dynamics:
- CNN converges slowly after 50 epochs
- DNN continues to improve throughout 400 training epochs
- No obvious overfitting observed

Statistical Significance

Bootstrap resampling (1,000 iterations) computes 95% confidence intervals:

DNN Accuracy: 86.50% 84.32%, 88.68%
DNN Recall: 43.55% 39.87%, 47.23%
Logistic Regression Accuracy: 74.95% 72.63%, 77.27%
Logistic Regression Recall: 75.81% 72.14%, 79.48%

The paper cites multiple relevant studies:

Shao et al. (2024): Emphasizes importance of BMI and age as biological predictors
Gupta et al. (2025): Neural network-based stroke prediction models
Zhang et al. (2022): Application of multilayer perceptrons in stroke prediction

Advantages compared to existing work:

Systematic comparison of multiple machine learning methods
Focus on minimizing false negatives
Comprehensive analysis combining clinical and social features

Conclusions and Discussion

Main Conclusions

Model Selection Depends on Application Scenario:
- Logistic Regression: Suitable for initial screening (high recall, strong interpretability)
- Dense Neural Network: Suitable for precise assessment (high accuracy, low false positives)
- CNN: Balanced performance, suitable for validation tool
Multi-Model System Recommendations:
- Stage 1: Logistic regression for initial screening
- Stage 2: DNN for precise assessment of high-risk patients
- Stage 3: CNN for verification and balance

Limitations

Data Imbalance: Only 5-6% positive cases limit model learning capacity
Anomalous Feature Importance: Lower-than-expected BMI importance may affect prediction accuracy
Generalization Ability: Single dataset may limit model universality
Sample Size: 5,000 samples relatively small, particularly for positive cases

Future Directions

Data Augmentation: Collect more real stroke patient data to alleviate class imbalance
Feature Engineering: Re-evaluate and optimize feature selection strategy
Model Ensemble: Develop more sophisticated multi-model fusion methods
Clinical Validation: Validate model effectiveness in actual medical environments

In-Depth Evaluation

Strengths

Practical Orientation: Clearly addresses practical need to minimize false negatives in medical settings
Comprehensive Methodology: Systematically compares traditional machine learning and deep learning methods
Statistical Rigor: Uses Bootstrap method to verify result robustness
Reproducibility: Provides complete code and data with MIT open-source license
Clinical Relevance: Integrates risk factors recognized by the medical field

Weaknesses

Data Quality: Severe class imbalance not adequately addressed
Model Depth: Relatively simple neural network architecture, insufficient exploration of deep learning potential
Insufficient Feature Engineering: Anomalous BMI importance suggests potential feature processing issues
Evaluation Limitations: Lacks comparison with existing clinical risk assessment tools
Experimental Scale: Single dataset, lacks cross-dataset validation

Impact

Academic Contribution: Provides practical multi-model comparison framework for medical AI
Clinical Value: Proposed hierarchical prediction system has practical application potential
Methodological Significance: Emphasizes importance of false negative control in medical AI
Extensibility: Methods generalizable to other medical prediction tasks

Applicable Scenarios

Primary Healthcare: Logistic regression model suitable for community health screening
Specialty Hospitals: Dense neural network suitable for precise risk assessment
Health Management: Can be integrated into personal health monitoring applications
Clinical Research: Provides tools for stroke risk factor research

References

CDC. Preventing stroke deaths. https://www.cdc.gov/vitalsigns/pdf/2017-09-vitalsigns.pdf
Shao, Y., et al. (2024). Link between triglyceride-glucose-body mass index and future stroke risk in middle-aged and elderly Chinese. Cardiovascular Diabetology.
Gupta, A., et al. (2025). Predicting stroke risk: An effective stroke prediction model based on neural networks. Journal of Neurorestoratology.

Overall Assessment: This study provides valuable multi-model comparative analysis on the important medical problem of stroke prediction, with particular emphasis on false negative control reflecting practical requirements of medical AI. Despite limitations such as data imbalance, the proposed multi-model system architecture has practical application value and provides a good reference framework for similar research in the medical AI field.