Stroke Prediction Model

This project implements a machine learning pipeline for predicting stroke risk using tabular data from the patient dataset. Multiple models are trained to choose the best performing. Below is a detailed explanation of how each key consideration was implemented.

Data Set

This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Attribute Information

  1. id: unique identifier
  2. gender: "Male", "Female" or "Other"
  3. age: age of the patient
  4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
  5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
  6. ever_married: "No" or "Yes"
  7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
  8. Residence_type: "Rural" or "Urban"
  9. avg_glucose_level: average glucose level in blood
  10. bmi: body mass index
  11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
  12. stroke: 1 if the patient had a stroke or 0 if not

Key Considerations Implementation

Data Cleaning

Drop id column

The id column is dropped as it serves as a unique identifier for each row but does not contribute to the predictive power of the model.

Remove missing values

Remove data entries with missing 'bmi' as it corresponds no impact to model accuracy being less in number

Feature Engineering

Binary Encoding

Convert categorical features with only two unique values into binary numeric format for easier processing by machine learning models:

  • ever_married: Encoded as 0 for “No” and 1 for “Yes”.
  • Residence_type: Encoded as 0 for “Rural” and 1 for “Urban”.

One-Hot Encoding for Multi-Class Categorical Features

  • For features with more than two categories, such as gender, work_type, and smoking_status, apply one-hot encoding to create separate binary columns for each category.
  • The onehot_encode function is assumed to handle the transformation, creating additional columns for each category while dropping the original column.

Split Dataset into Features and Target

  • Separate the target variable (stroke) from the features:
  • X: Contains all feature columns used as input for the model.
  • y: Contains the target column, which indicates whether a stroke occurred.

Train-Test Split

  • Split the dataset into training and testing sets to evaluate model performance effectively. This ensures the model is tested on unseen data and helps prevent overfitting.
  • The specific split ratio (e.g., 70% train, 30% test) can be customized as needed.

Model Selection

Following models are evaluated:

  • Logistic Regression
  • K-Nearest Neighbors
  • Support Vector Machine (Linear Kernel)
  • Support Vector Machine (RBF Kernel)
  • Neural Network
  • Gradient Boosting

Evaluated for:

  • Handles both numerical and categorical features
  • Resistant to overfitting
  • Provides feature importance
  • Good performance on imbalanced data

4. Software Engineering Best Practices

A. Logging

Comprehensive logging system:

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

Logging features:

  • Timestamp for each operation
  • Different log levels (INFO, ERROR)
  • Operation tracking
  • Error capture and reporting

B. Documentation

  • Docstrings for all classes and methods
  • Clear code structure with comments
  • This README file
  • Logging outputs for tracking
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Dataset used to train srieas/TestModel