Final Project in AI Engineering, DVAE26, ht24

Instructions

This project focuses on developing a machine learning model for predictive analytics using a healthcare dataset. It encompasses the entire machine learning lifecycle, from data quality assessment and preprocessing to model training, evaluation, and deployment.

Objective: Predict stroke occurrences using tabular healthcare data.

Key Considerations:

Data preprocessing: Handle missing values, encode categorical variables, and normalize features.
Model selection: Train, evaluate, and compare multiple machine learning models.
Evaluation metrics: Accuracy, ROC AUC, and confusion matrix were utilized to measure model performance.
SE best practices: Modular code, version control, hyperparameter tuning, and interpretability techniques were implemented.

Dataset: Kaggle Stroke Prediction Dataset
Model Development: Build a pipeline to process data, train models, and evaluate their performance.
Deployment: Deploy the best-performing model on Hugging Face.
Deliverables: A report summarizing the workflow, key findings, and insights.

Workflow Pipeline

1. Data Acquisition

The dataset was sourced from Kaggle's Stroke Prediction Dataset. It includes 5110 records with 12 features, such as age, gender, hypertension, and smoking status.

2. Data Preprocessing

The following steps were performed to ensure data quality and readiness for modeling:

Handling Missing Values: Imputed missing BMI values using the mean strategy.
Encoding Categorical Features: Used one-hot encoding for variables like gender and residence type.
Feature Scaling: Standardized continuous variables to ensure uniformity across features.
Balancing the Dataset: Addressed class imbalance using SMOTE (Synthetic Minority Oversampling Technique).

3. Model Development

Several machine learning models were developed and evaluated:

Linear Regression
Logistic Regression
Random Forest Classifier + hyperparameter tuning
Decision Tree Classifier
Gradient Boosting Classifier + hyperparameter tuning
K-Nearest Neighbors (KNN)

Results

The table below summarizes the performance of the models:

Model	Accuracy	ROC AUC	Confusion Matrix
Linear Regression	0.775632	0.837220	[[478, 195], [107, 566]]
Logistic Regression	0.776374	0.837297	[[492, 181], [120, 553]]
Random Forest	0.942793	0.990407	[[629, 44], [33, 640]]
Decision Tree	0.892273	0.892273	[[588, 85], [60, 613]]
Gradient Boosting	0.957652	0.988327	[[651, 22], [35, 638]]
KNN	0.880386	0.946411	[[531, 142], [19, 654]]

Key Insights

Gradient Boosting achieved the highest accuracy (95.76%) and maintained an excellent ROC AUC score (0.988).
Random Forest provided comparable ROC AUC performance (0.990), with slightly lower accuracy (94.28%).
Simpler models like Logistic Regression and Linear Regression performed adequately but were less effective in comparison.
KNN and Decision Tree showed strong results but were outperformed by Gradient Boosting and Random Forest.

Evaluation Metrics

Accuracy: Gradient Boosting demonstrated the best predictive accuracy, correctly identifying 95.76% of instances.
ROC AUC: Random Forest achieved the highest ROC AUC score, indicating excellent separability between stroke and non-stroke cases.
Confusion Matrix: Gradient Boosting had minimal false negatives and false positives, demonstrating robust performance.

Deployment

The best-performing model (Gradient Boosting) was deployed on Hugging Face. It can be accessed publicly via the following link: ML stroke prediction.

Deployment highlights:

Platform: Hugging Face
Usage: Users can input patient data to predict stroke risk interactively.
Versioning: Deployment includes metadata for reproducibility.

Data Quality Analysis

Analysis of Data Quality

The dataset included various features with distinct scales and types. It required preprocessing to ensure readiness for training.
Missing values in BMI were a key challenge, addressed using mean imputation.

Challenges Faced

Class Imbalance: Stroke cases were significantly underrepresented in the dataset, potentially biasing model predictions.
Feature Correlation: Certain features (e.g., age and hypertension) required analysis to ensure they contributed meaningfully to the prediction task.

Measures Taken to Ensure Data Integrity

Visualized data distribution and checked for anomalies before preprocessing.
Validated preprocessing steps using unit tests.
Conducted exploratory data analysis (EDA) to confirm correlations and feature relevance.

SE Best Practices

Key Practices Implemented:

Modularization: Encapsulated preprocessing, training, and evaluation in separate functions for reusability.
Version Control: Maintained a Git repository with detailed commit messages.
Hyperparameter Tuning: Experimented with different learning rates, tree depths, and ensemble configurations.
Documentation: Created comprehensive project documentation, including a README file.
Testing: Added unit tests for data preprocessing and model evaluation steps.

Key Outcomes

Results:

Gradient Boosting emerged as the best-performing model for stroke prediction.
The project showcased the impact of robust preprocessing and careful model selection on achieving high accuracy and ROC AUC scores.

Personal Insights:

It was a extremely interesting project for me, as I never did something like that. I learned a lot.