2025-11-22T01:16:16.023348

Stroke Prediction using Clinical and Social Features in Machine Learning

Chadha
Every year in the United States, 800,000 individuals suffer a stroke - one person every 40 seconds, with a death occurring every four minutes. While individual factors vary, certain predictors are more prevalent in determining stroke risk. As strokes are the second leading cause of death and disability worldwide, predicting stroke likelihood based on lifestyle factors is crucial. Showing individuals their stroke risk could motivate lifestyle changes, and machine learning offers solutions to this prediction challenge. Neural networks excel at predicting outcomes based on training features like lifestyle factors, however, they're not the only option. Logistic regression models can also effectively compute the likelihood of binary outcomes based on independent variables, making them well-suited for stroke prediction. This analysis will compare both neural networks (dense and convolutional) and logistic regression models for stroke prediction, examining their pros, cons, and differences to develop the most effective predictor that minimizes false negatives.
academic

Stroke Prediction using Clinical and Social Features in Machine Learning

Basic Information

Abstract

Approximately 800,000 people in the United States suffer from stroke annually, with one person experiencing a stroke every 40 seconds and one person dying from stroke every 4 minutes. As the second leading cause of death and disability globally, predicting stroke likelihood based on lifestyle factors is critically important. This study compares the performance of neural networks (dense and convolutional) with logistic regression models in stroke prediction, aiming to develop the most effective predictor while minimizing false negatives.

Research Background and Motivation

Problem Definition

Stroke prediction is a critical healthcare problem involving multiple internal and external factors:

  • External Factors: Marital status, occupation type, residential environment, etc.
  • Internal Factors: History of heart disease, BMI, age, blood glucose levels, etc.

Significance

  1. Public Health Impact: Stroke is the second leading cause of death and disability globally
  2. Prevention Value: Early risk assessment can motivate lifestyle changes
  3. Clinical Application: Real-time risk assessment can be integrated into routine physical examinations

Existing Limitations

  • Lack of comprehensive predictive models that effectively combine clinical and social features
  • The harm of false negatives in medical settings has not received sufficient attention
  • Limited comparative studies of different machine learning methods in stroke prediction

Core Contributions

  1. Multi-Model Comparison Framework: Systematically compares the performance of logistic regression, dense neural networks, and convolutional neural networks in stroke prediction
  2. Medical-Oriented Evaluation Strategy: Focuses on minimizing false negatives, aligning with practical requirements in medical settings
  3. Comprehensive Feature Analysis: Integrates clinical indicators and social factors for comprehensive risk assessment
  4. Practical Multi-Model System Recommendations: Proposes a hierarchical prediction pipeline combining advantages of multiple models

Methodology

Task Definition

  • Input: Patient data containing 10 features (age, gender, hypertension, heart disease, marital status, occupation type, residence type, average blood glucose level, BMI, smoking status)
  • Output: Binary classification result (0: no stroke, 1: stroke)
  • Constraint: Minimize false negatives while balancing precision and recall

Model Architecture

1. Logistic Regression Model

  • Preprocessing: Feature standardization using StandardScaler, categorical variable encoding using Label Encoder
  • Regularization: L2 regularization to prevent overfitting
  • Optimization: Maximum iterations set to 10,000 to ensure convergence
  • Decision Boundary: 0.5 probability threshold (adjustable)

2. Neural Network Models

Dense Neural Network (DNN):

  • Input layer: 10 features
  • Hidden layers: Including Batch Normalization and Dropout
  • Activation function: ReLU
  • Output layer: Single neuron with Sigmoid activation

Convolutional Neural Network (CNN):

  • Similar architecture but using convolutional layers for feature processing
  • Includes pooling layers and fully connected layers

Training Parameters:

  • Loss function: Cross Entropy Loss (suitable for class imbalance)
  • Optimizer: Adam (adaptive learning rate)
  • Training epochs: 400
  • Regularization: Dropout + Batch Normalization

Technical Innovations

  1. Multi-Architecture Comparison: First systematic comparison of CNN and DNN performance on tabular stroke prediction data
  2. Medical-Oriented Design: Uses weighted loss functions to address class imbalance
  3. Feature Importance Analysis: Analyzes biological factors' predictive contribution through logistic regression coefficients
  4. Statistical Robustness Validation: Computes 95% confidence intervals using Bootstrap resampling

Experimental Setup

Dataset

  • Source: Kaggle Stroke Prediction Dataset
  • Scale: Approximately 5,000 samples
  • Class Distribution: Highly imbalanced (only 5-6% stroke cases)
  • Split: 80% training set, 20% test set
  • Features: 10 clinical and social features

Evaluation Metrics

  • Accuracy: Overall correctness rate
  • Recall: Ability to identify actual stroke cases (primary focus)
  • Precision: Accuracy of predicted stroke cases
  • F1-Score: Harmonic mean of precision and recall
  • AUC-ROC: Discriminative ability across different thresholds
  • Confusion Matrix: Detailed classification error analysis

Comparison Methods

  • Logistic Regression (Sklearn implementation)
  • Dense Neural Network (PyTorch implementation)
  • Convolutional Neural Network (PyTorch implementation)

Implementation Details

  • Framework: PyTorch (neural networks), Sklearn (logistic regression)
  • Hardware: Standard computing environment
  • Reproducibility: Fixed random seeds, open-source code

Experimental Results

Main Results

ModelAccuracyRecallPrecisionF1-Score
Logistic Regression74.95%75.81%16.31%-
Dense Neural Network86.50%43.55%20.77%-
Convolutional Neural Network78.67%53.23%--

Key Findings

  1. Accuracy vs. Recall Trade-off:
    • Dense neural network achieves highest accuracy (86.50%) but lower recall (43.55%)
    • Logistic regression achieves highest recall (75.81%) but lower precision (16.31%)
    • CNN achieves balance between the two
  2. Feature Importance Analysis:
    • Age is the most important predictor (consistent with medical knowledge)
    • BMI importance lower than expected (inconsistent with existing research)
  3. Training Dynamics:
    • CNN converges slowly after 50 epochs
    • DNN continues to improve throughout 400 training epochs
    • No obvious overfitting observed

Statistical Significance

Bootstrap resampling (1,000 iterations) computes 95% confidence intervals:

  • DNN Accuracy: 86.50% 84.32%, 88.68%
  • DNN Recall: 43.55% 39.87%, 47.23%
  • Logistic Regression Accuracy: 74.95% 72.63%, 77.27%
  • Logistic Regression Recall: 75.81% 72.14%, 79.48%

The paper cites multiple relevant studies:

  1. Shao et al. (2024): Emphasizes importance of BMI and age as biological predictors
  2. Gupta et al. (2025): Neural network-based stroke prediction models
  3. Zhang et al. (2022): Application of multilayer perceptrons in stroke prediction

Advantages compared to existing work:

  • Systematic comparison of multiple machine learning methods
  • Focus on minimizing false negatives
  • Comprehensive analysis combining clinical and social features

Conclusions and Discussion

Main Conclusions

  1. Model Selection Depends on Application Scenario:
    • Logistic Regression: Suitable for initial screening (high recall, strong interpretability)
    • Dense Neural Network: Suitable for precise assessment (high accuracy, low false positives)
    • CNN: Balanced performance, suitable for validation tool
  2. Multi-Model System Recommendations:
    • Stage 1: Logistic regression for initial screening
    • Stage 2: DNN for precise assessment of high-risk patients
    • Stage 3: CNN for verification and balance

Limitations

  1. Data Imbalance: Only 5-6% positive cases limit model learning capacity
  2. Anomalous Feature Importance: Lower-than-expected BMI importance may affect prediction accuracy
  3. Generalization Ability: Single dataset may limit model universality
  4. Sample Size: 5,000 samples relatively small, particularly for positive cases

Future Directions

  1. Data Augmentation: Collect more real stroke patient data to alleviate class imbalance
  2. Feature Engineering: Re-evaluate and optimize feature selection strategy
  3. Model Ensemble: Develop more sophisticated multi-model fusion methods
  4. Clinical Validation: Validate model effectiveness in actual medical environments

In-Depth Evaluation

Strengths

  1. Practical Orientation: Clearly addresses practical need to minimize false negatives in medical settings
  2. Comprehensive Methodology: Systematically compares traditional machine learning and deep learning methods
  3. Statistical Rigor: Uses Bootstrap method to verify result robustness
  4. Reproducibility: Provides complete code and data with MIT open-source license
  5. Clinical Relevance: Integrates risk factors recognized by the medical field

Weaknesses

  1. Data Quality: Severe class imbalance not adequately addressed
  2. Model Depth: Relatively simple neural network architecture, insufficient exploration of deep learning potential
  3. Insufficient Feature Engineering: Anomalous BMI importance suggests potential feature processing issues
  4. Evaluation Limitations: Lacks comparison with existing clinical risk assessment tools
  5. Experimental Scale: Single dataset, lacks cross-dataset validation

Impact

  1. Academic Contribution: Provides practical multi-model comparison framework for medical AI
  2. Clinical Value: Proposed hierarchical prediction system has practical application potential
  3. Methodological Significance: Emphasizes importance of false negative control in medical AI
  4. Extensibility: Methods generalizable to other medical prediction tasks

Applicable Scenarios

  1. Primary Healthcare: Logistic regression model suitable for community health screening
  2. Specialty Hospitals: Dense neural network suitable for precise risk assessment
  3. Health Management: Can be integrated into personal health monitoring applications
  4. Clinical Research: Provides tools for stroke risk factor research

References

  1. CDC. Preventing stroke deaths. https://www.cdc.gov/vitalsigns/pdf/2017-09-vitalsigns.pdf
  2. Shao, Y., et al. (2024). Link between triglyceride-glucose-body mass index and future stroke risk in middle-aged and elderly Chinese. Cardiovascular Diabetology.
  3. Gupta, A., et al. (2025). Predicting stroke risk: An effective stroke prediction model based on neural networks. Journal of Neurorestoratology.

Overall Assessment: This study provides valuable multi-model comparative analysis on the important medical problem of stroke prediction, with particular emphasis on false negative control reflecting practical requirements of medical AI. Despite limitations such as data imbalance, the proposed multi-model system architecture has practical application value and provides a good reference framework for similar research in the medical AI field.