Stroke Prediction Model

This project implements a machine learning pipeline for predicting stroke risk using tabular data from the patient dataset. Multiple models are trained to choose the best performing. Below is a detailed explanation of how each key consideration was implemented.

Data Set

This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Attribute Information

id: unique identifier
gender: "Male", "Female" or "Other"
age: age of the patient
hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
ever_married: "No" or "Yes"
work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
Residence_type: "Rural" or "Urban"
avg_glucose_level: average glucose level in blood
bmi: body mass index
smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
stroke: 1 if the patient had a stroke or 0 if not

Key Considerations Implementation

Data Cleaning

Drop id column

The id column is dropped as it serves as a unique identifier for each row but does not contribute to the predictive power of the model.

Remove missing values

Remove data entries with missing 'bmi' as it corresponds no impact to model accuracy being less in number

Feature Engineering

Binary Encoding

Convert categorical features with only two unique values into binary numeric format for easier processing by machine learning models:

ever_married: Encoded as 0 for “No” and 1 for “Yes”.
Residence_type: Encoded as 0 for “Rural” and 1 for “Urban”.

One-Hot Encoding for Multi-Class Categorical Features

For features with more than two categories, such as gender, work_type, and smoking_status, apply one-hot encoding to create separate binary columns for each category.
The onehot_encode function is assumed to handle the transformation, creating additional columns for each category while dropping the original column.

Split Dataset into Features and Target

Separate the target variable (stroke) from the features:
X: Contains all feature columns used as input for the model.
y: Contains the target column, which indicates whether a stroke occurred.

Train-Test Split

Split the dataset into training and testing sets to evaluate model performance effectively. This ensures the model is tested on unseen data and helps prevent overfitting.
The specific split ratio (e.g., 70% train, 30% test) can be customized as needed.

Model Selection

Following models are evaluated:

Logistic Regression
K-Nearest Neighbors
Support Vector Machine (Linear Kernel)
Support Vector Machine (RBF Kernel)
Neural Network
Gradient Boosting

Evaluated for:

Handles both numerical and categorical features
Resistant to overfitting
Provides feature importance
Good performance on imbalanced data

4. Software Engineering Best Practices

A. Logging

Comprehensive logging system:

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

Logging features:

Timestamp for each operation
Different log levels (INFO, ERROR)
Operation tracking
Error capture and reporting

B. Documentation

Docstrings for all classes and methods
Clear code structure with comments
This README file
Logging outputs for tracking

srieas
/

TestModel