EDA of Stroke Dataset and Prediction of Strokes using Selected ML Algorithms
- Introduction
- Setup
- Basic EDA and Data Preparation
- Train and Evaluate a Dummy Model
- Model Selection
- Model Tuning
- Final Model and Predictions
- Summary and Conclusion
Introduction
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
In this project, we'll try to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. To do this, we'll use the Stroke Prediction Dataset provided by fedesoriano on Kaggle.
Each row in the dataset provides relavant information about the patient like age, smoking status, gender, heart disease, bmi, work type and in the end whether the patient suffered a stroke. This last parameter will be our target, which we'll try to predict using information from the other columns.
The steps that we'll take:
- Setup and import the dataset
- Perform basic EDA and prepare the dataset for training
- Train and evaluate a dummy model
- Train multiple ML models and make predictions.
- Evaluate and compare their performance.
import numpy as np # Linear algebra
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Data Visualizatin
import seaborn as sns # Data Visualization
from imblearn.over_sampling import SMOTE # Oversampling imbalanced classes
from sklearn.impute import SimpleImputer, MissingIndicator # Handle missing values
from sklearn.model_selection import train_test_split, cross_val_predict, cross_validate, GridSearchCV, StratifiedKFold # Model validation
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, RocCurveDisplay, PrecisionRecallDisplay, confusion_matrix, ConfusionMatrixDisplay # Metrics
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, FunctionTransformer # Data preparation
from imblearn.pipeline import make_pipeline # Build pipeline
from sklearn.compose import make_column_transformer # Combine data preparation steps
# Models for prediciton
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
# ignore warnings
import warnings
warnings.filterwarnings("ignore")
# get the file path
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
file_path = "/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv"
# file size of the dataset
!ls -lh {file_path}
# look at the first few rows of the dataseet
!head {file_path}
Observations
- File size is 310 KB, thus it is safe to import the whole dataset.
- The delimiters are
,
-
id
is the index column -
stroke
is the prediction class in the last column
# Load the dataset into pandas DataFrame
df = pd.read_csv(file_path, index_col = ["id"])
df.head()
Basic EDA and Data Preparation
First, it is really important to separate the test data from the train data at this point, so that the transformers and models cannot learn from the test data itself. Before making a split, it is worth looking at the distribution of prediction class, to see if there is an imbalance and whether we will need to stratify the split.
# Distribution of prediction class
stroke_val_count = df.stroke.value_counts()
print(f"Value Count in the prediction class - Stroke:\n{stroke_val_count}\n\n")
sns.barplot(x = stroke_val_count.index, y = stroke_val_count)
plt.show()
As we can see, the prediction class is highly imbalanced. Therefore, we'll need to stratify the split. Also, after making the split, it would be worth to generate artificial data in the training dataset to help the ML models distinguish better between the two categories of prediction class.
# Separate the prediction class from the training features
X = df.copy()
y = X.pop("stroke")
# split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 42)
# Look at the dataset
X_train.head()
X_train.info()
Observations
- There are 4088 entries in the train dataset.
- There are total 10 features which we can use to predict the occurance of stroke.
- There are some categorical features like
gender
,work_type
,Residence_type
of dtypeobject
, which we'll need to be One Hot Encoded. - Numerical features will need to be scaled.
- There are some missing values in the bmi column.
# Null values
X_train.isnull().sum()
# Distribution of null values in the dataset
plt.figure(figsize = (10, 8))
sns.heatmap(X_train.isnull(), cmap = "rocket", yticklabels = "", cbar = None)
null_bmi = y_train.loc[X_train.bmi.isna()]
not_null_bmi = y_train.loc[X_train.bmi.notnull()]
print(f"""Non null bmi values:-
Stroke-no stroke ratio: {null_bmi.sum()/len(null_bmi)}
Null bmi values:-
Stroke-no stroke ratio: {not_null_bmi.sum()/len(not_null_bmi)}
""")
Observations
Although Null values look to be evenly distributed in the heatmap, the ratio for occurance of stroke is significatly different in the entries with null bmi values. Thus, instead of dropping the null values, it would be better to impute the null values with median bmi value, and also encode the presence of null values in a separate column. This may help in better prediction of stroke.
# Distribution of Categorical Features
cat_cols = ['gender', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
print("Value Counts of all categorical columns...\n")
for i, col in enumerate(cat_cols):
val_count = X_train[col].value_counts()
print(f"Values:-\n{val_count}\n\n")
plt.figure(i)
sns.barplot(x = val_count.index, y = val_count)
Observations
The categories in all the categorical features look ok, albeit most of the categories are unevenly distributed. Thus, we'll just one hot encode these columns.
# Look at the basic statistics
num_cols = ["age", "avg_glucose_level", "bmi"]
X_train[num_cols].describe()
Observations
-
Age
appears to be slightly negatively skewed -
avg_glucose_level
appears to be positively skewed - min age of 0.08 suggests that age is stored in fractions, which needs further inspection.
# Distribution of Numerical (Continuous) Features
for i, col in enumerate(num_cols):
plt.figure(i)
sns.violinplot(x = X_train[col])
plt.legend([f"skewness: {X_train[col].skew():.2f}\nkurtosis: {X_train[col].kurtosis():.2f}"])
Observations
-
age
is very slightly negatively skewed with negative kurtosis. -
avg_glucose_level
andbmi
are positively skewed with sharp peaks(positive kurtosis)
Although some ML models assume normal distribution of data, they can work fine with data with small skewness and kurtosis values. Therefore, we'll just scale this data using MinMaxScaler.
# inspect`age` column
print(sorted(X_train.age.unique()))
Observations
Only values smaller than 2 are stored as floats. Thus, we can change the whole column to int datatype to make this feature uniform.
Build Column Transformer
Now, we can use combine all the observations that we made to build a data preparation pipeline with a column transformer. We'll use the transformer to:
- impute missing values
- change
age
column dtype toint
- add missing value indicator
- onehotencode categorical columns
- scale numerical features
# Columns
age_col = ["age"]
num_cols_without_age = ["avg_glucose_level", "bmi"]
cat_cols = ["gender", "ever_married", "work_type", "Residence_type", "smoking_status"]
# Scale the data
scaler = MinMaxScaler()
# handle missing values in `bmi`
imputer = SimpleImputer(strategy = "median")
# change dtype of `age`
def to_int(x):
return pd.DataFrame(x).astype(int)
int_tr = FunctionTransformer(to_int)
# build transformer for `age` separately
age_transformer = make_pipeline(int_tr, scaler)
# build transformer for numeric cols without age
num_transformer = make_pipeline(imputer, scaler)
# missing_indicator
miss_ind = MissingIndicator()
# build transformer for categorical cols
cat_transformer = OneHotEncoder(drop = "first", handle_unknown = "ignore", sparse = False)
# combine transformers to make a single column transformer
ct = make_column_transformer((age_transformer, age_col), (num_transformer, num_cols_without_age), (miss_ind, num_cols_without_age), (cat_transformer, cat_cols), remainder = "passthrough")
ct.fit_transform(X_train).shape
sm = SMOTE(random_state = 42)
sm
A dummy model which doesn't use any of the features to make predictions will give a score, that the future ML models should at least beat. This score will help to identify errors in model training and evaluation if the models perform worse than this score.
# Build imblearn pipeline with the DummyClassifier
dummy_model = DummyClassifier(strategy = 'prior')
dummy_pipe = make_pipeline(ct, sm, dummy_model)
# specify KFold strategy
cv = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
# Scores on multiple metrics from the DummyClassifier
cross_validate(dummy_pipe, X_train, y_train, cv = cv, n_jobs = -1, scoring = ["accuracy", "recall", "roc_auc", "f1"])
The model accuracy comes out to be really great at about 0.95. Although, it would generally be a good score to achieve. in this domain, a good accuracy score can be decieving. We can actually confirm this by looking at the test recall scores. All the test recall scores are 0, which means that the model failed to catch even a single True Positive. This is what we almost never want in the medical sphere. If the testing isn't prohibitively expensive or risky for the patient, which in this case it isn't, the test should aim for a high recall score and not a high accuracy score. That is why we'll judge our models using the roc_auc score primarily, which shows the relation between the True Positive Rate(TPR/recall) and False Positive Rate(FPR) of the model. For this dummy model, the roc_auc score comes out to be 0.5, which means the model isn't able to distinguish between the prediction classes at any of the thresholds, which we would expect from a dummy model. We can also look at the roc_auc_curve of the model.
# ROC curve of DummyClassifier
dummy_preds_prob = cross_val_predict(dummy_pipe, X_train, y_train, cv = cv, n_jobs = -1, method = "predict_proba")[:, 1]
RocCurveDisplay.from_predictions(y_train, dummy_preds_prob)
# Specify ML models
models = {"Logistic_Regression": LogisticRegression(random_state = 42),
"Ridge_Classification": RidgeClassifier(random_state = 42),
"SVC": SVC(random_state = 42),
"GaussianNB": GaussianNB(),
"KNClassifier": KNeighborsClassifier(n_neighbors = 5),
"RandomForestClassifier": RandomForestClassifier(max_depth = 10, n_jobs = -1, random_state = 42),
"XGBClassifier": XGBClassifier(n_estimators = 50, learning_rate = 0.03, n_jobs = -1,
objective = "binary:logistic", eval_metric = "auc", tree_method = "hist", random_state = 42)}
# Model comparison on multiple metrics
scores = {}
for model_name, model in models.items():
model_pipe = make_pipeline(ct, sm, model)
cross_val = cross_validate(model_pipe, X_train, y_train, cv = cv, scoring = ["accuracy", "recall", "precision", "roc_auc", "f1"])
del cross_val["fit_time"]
del cross_val["score_time"]
print(model_name + " :\n")
for score_name, score_vals in cross_val.items():
score_mean = score_vals.mean()
score_std = score_vals.std()
cross_val[score_name] = score_mean
print(f"{score_name}: Mean: {(score_mean * 100):.2f} % Std: {(score_std * 100):.2f}\n")
print("\n", "-" * 50, "\n")
scores[model_name] = cross_val
scores_df = pd.DataFrame.from_dict(scores, orient = 'index')
scores_df.sort_values(by = "test_roc_auc", ascending = False)
From the scores we get, linear models although lacking in accuracy porformed way much better than the others in recall, auc_roc, and f1 metrics. Their roc_auc score of above 80s but low precision score suggests that they can do well with threshold tuning. The tree models like RandomForestClassifier and XGBClassifier also achieved satisfactory performance with high accuracy and moderate roc_auc scores. KNClassifier and SVC performed the worst and may require some hyperparameter tuning. But time would be better spent tuning the linear models to get even better performance than to tune the other ML models.
# Initialize models and specify param_grid for GridSearchCV
models_and_params = {"LogisticRegression": [LogisticRegression(random_state = 42),
{"logisticregression__class_weight": [{0:1, 1: 1}, {0:1, 1:3}]}],
"RidgeClassification": [RidgeClassifier(random_state = 42),
{"ridgeclassifier__alpha": [1, 2, 3], "ridgeclassifier__class_weight": [{0:1, 1: 1}, {0:1, 1:3}]}]}
# Run GridSearchCV and store its results
tuning_scores = []
for model_name in models_and_params:
model, params = models_and_params[model_name]
model_pipe = make_pipeline(ct, sm, model)
grid_cv = GridSearchCV(model_pipe, params, scoring = ["accuracy", "recall", "precision", "roc_auc", "f1"], n_jobs = -1, cv = cv, refit = False)
grid_cv.fit(X_train, y_train)
tuning_scores.append(grid_cv)
# Results for LogisticRegression
logistic_regression_grid_result = pd.DataFrame.from_dict(tuning_scores[0].cv_results_)
logistic_regression_grid_result = logistic_regression_grid_result[[
"param_logisticregression__class_weight",
"mean_test_accuracy",
"mean_test_precision",
"mean_test_recall",
"mean_test_roc_auc",
"mean_test_f1"]]
logistic_regression_grid_result.sort_values(by = "mean_test_roc_auc", ascending = False)
# Results for RidgeClassifier
ridge_classifier_grid_result = pd.DataFrame.from_dict(tuning_scores[1].cv_results_)
ridge_classifier_grid_result = ridge_classifier_grid_result[[
"param_ridgeclassifier__alpha",
"param_ridgeclassifier__class_weight",
"mean_test_accuracy",
"mean_test_precision",
"mean_test_recall",
"mean_test_roc_auc",
"mean_test_f1"]]
ridge_classifier_grid_result.sort_values(by = "mean_test_roc_auc", ascending = False)
From these grid search, we have found that both the logistic regression and the ridge regression models are really close in performance on their roc_auc score which is the primary metric, and that the increasing alpha values in ridge classifier does help in improving the scores. Changing class weights improves test recall a lot but because of tradeoff between test recall and test precision, test precision takes a sligh hit, along with a large hit on test accuracy. Now, we have to make a call on what parameter is more important for us between test precision and test recall. with the roc_auc, the primary metric still high, we choose to prefer better test precision and test accuracy.
From this, we have identified the ML model - RidgeClassifier with the default class weghts 1:1 and alpha value of 3. Now, we will train this final model and make predictions on the test dataset and then evaluate those predictions.
# Model Training
ridge = RidgeClassifier(random_state = 42, alpha = 3)
ridge_pipe = make_pipeline(ct, sm, ridge)
ridge_pipe.fit(X_train, y_train)
# Predictions on test dataset
preds = ridge_pipe.predict(X_test)
ConfusionMatrixDisplay.from_predictions(y_test, preds)
# Metrics for test predictions
RocCurveDisplay.from_estimator(ridge_pipe, X_test, y_test, name = "RidgeClassifier")
print("Accuracy Score:", accuracy_score(y_test, preds))
print("Precision Score:", precision_score(y_test, preds))
print("Recall Score:", recall_score(y_test, preds))
The classifier has predicted 723 True Negatives and 40 True Positives. There are 259 misclassifications too out of which 249 are False Positives. Considering that the predictions class was highly imbalanced to begin with, the model has worked really well in making correct classifications. Its AUC_ROC score on the test dataset is 0.85, similar to what we found on the train dataset. With more data, we can improve this score further.
In this project, we used the Stroke Dataset available on Kaggle to predict whether a patient would suffer from a stroke. First, we prepared the data for training and test by splitting it using train_test_split. Then we explored the data and understood where it needed some cleaning and preparation. With this knowledge, we developed imblearn's Pipelines to clean the data. We then explored multiple ML models and studied their performance through multiple metrics, primarily focusing on roc_auc scores. This choice of metrics was made with the knowledge that in the medicinal domain, the correct knowledge of True Positives is much more valuable than the wrong knowledge on False Positives. We chose 2 linear models from this step for further model tuning and selection. RidgeClassifier turned out to be performing the best with its default parameters. We then trained this model and made predictions on the test data that we got from the split earlier. On the predictions, we achieved respectable scores on recall - 0.8 and on precision - 0.14.