EDA and Prediction of Mushroom Edibility using Select ML Algorithms
- Introduction
- Setup
- EDA and Data Preparation
- Train Hardcoded Model and Evaluate Results
- Model Selection
- Train Final Model and make predictions
- Summary and Conclusion
Today, we'll work on a classification problem. The dataset we have chosen is the mushroom-classification dataset available on Kaggle. This dataset was provided by UCI Machine Learning repository nearly 3 decades ago. The dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended.
Our task making successful predictions begins first by the setup of the system for training.
# Import the required libraries and get the file path
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV # validation
from sklearn.preprocessing import OneHotEncoder, LabelEncoder # data preparation
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay, precision_score, recall_score, accuracy_score, f1_score # metrics
from sklearn.pipeline import make_pipeline # build pipeline
# ML models
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# ignore warnings
import warnings
warnings.filterwarnings("ignore")
# get file path
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
file_dir = "/kaggle/input/mushroom-classification/mushrooms.csv"
# inspect file size
!ls -lh {file_dir}
# inspect first 5 rows of dataset
!head {file_dir}
Observations
- The dataset file size is 366 KB.
- It will be safe to import the whole dataset.
- The prediction class is the first column.
- There appears to be no index column
# read file
df = pd.read_csv(file_dir)
# view all columns
pd.set_option("display.max_columns", None)
df.head()
# split datasets for training and testing
X = df.copy()
y = X.pop("class")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train.shape
X_train.info()
X_train.describe()
Observations
- The dataset has 22 features.
- There are 6499 enries.
- All features are categorical in nature.
- Most of the features have unique values less than 10.
class_vc = df["class"].value_counts()
class_vc
sns.barplot(x = class_vc.index, y = class_vc)
Observations
- The class counts are not much imbalanced.
for i, cols in enumerate(df):
feature_vc = df[cols].value_counts()
print(feature_vc, "\n_________\n")
plt.figure(i)
sns.barplot(x = feature_vc.index, y = feature_vc)
Observations
- All the feautures are categorical in nature.
- All classes are anonymized.
- We'll need to OneHotEncode the data.
- Because the number of no feature classes is too large, we can OneHotEncode each feature class.
- Feature
vei-type
has only one class. Thus it'll need to be dropped.
for i, feature in enumerate(X_train):
plt.figure(i)
sns.countplot(x = feature, hue = y_train, data = X_train)
Observations
- Allmost all the features differ and distinguish between the two target classes.
- Their distributions are different for the target classes.
It means that models may also offer incredible scores in classifying the two target classes.
df.isnull().sum()
Observations
- There are no null values in the dataset.
- It will need no imputation or transformation
# OneHotEncoding Categorical features
ohe = OneHotEncoder(drop = "first", handle_unknown = "ignore", sparse = False)
ohe_train_data = ohe.fit_transform(X_train)
ohe_test_data = ohe.transform(X_test)
ohe_train_data
# Encode the target in train dataset
le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
# Encode target in test dataset
y_test_le = le.transform(y_test)
So, we can now move on to training ML models.
mf_model = DummyClassifier(random_state = 42, strategy = "most_frequent")
mf_cross_val = cross_validate(mf_model, ohe_train_data, y_train_le, scoring = ["accuracy", "precision", "recall", "f1", "roc_auc"])
mf_cross_val
Observations
- The test roc_auc score is 0.5 in all validations, which means the model is not able to distinguish between the two target classes in any way.
- This model also gives a test accuracy of 0.52.
We'll need to at least improve on these scores.
models = {"LogisticRegression": LogisticRegression(random_state = 42),
"RidgeClassification": RidgeClassifier(random_state = 42),
"GaussianNB": GaussianNB(),
"RandomForestClassifier": RandomForestClassifier(n_estimators = 70, random_state = 42),
"XGBClassifier": XGBClassifier(n_estimators = 70, objective = "binary:logistic", learning_rate = 0.05, n_jobs = -1, scoring = "auc", random_state = 42)}
model_scores = {}
# cross validate all models
for model_name, model in models.items():
cross_val = cross_validate(model, ohe_train_data, y_train_le, n_jobs = -1, scoring = ["accuracy", "precision", "recall", "f1", "roc_auc"])
del cross_val["fit_time"]
del cross_val["score_time"]
model_scores[model_name] = cross_val
# put results into a dataframe
model_scores_df = pd.DataFrame.from_dict(model_scores)
model_scores_df = model_scores_df.applymap(np.mean)
model_scores_df
Observations
- Surprisingly, all of the models performed incredibly well and fit almost perfectly to the training dataset.
- Out of these, RandomForestClassifier performed the best with a perfect score of 1.0 in all scores.
Thus, we'll train the RandomForestClassifier to make final predictions.
# train model
forest_clas = RandomForestClassifier(random_state = 42)
forest_clas.fit(ohe_train_data, y_train_le)
# make predictions
preds = forest_clas.predict(ohe_test_data)
# decode predictions into their original labels
preds_in = le.inverse_transform(preds)
preds_in
# plot results
ConfusionMatrixDisplay.from_predictions(y_test, preds_in)
accuracy_score(y_test, preds_in)
recall_score(y_test_le, preds)
f1_score(y_test_le, preds)
RocCurveDisplay.from_estimator(forest_clas, ohe_test_data, y_test_le)
Observations
- As seen in the model selection phase, RandomForestClassifier achieved perfect accuracy in classification.
- This gave it perfect scores in other metrics as well such as roc_auc, recall and f1 score.
In this project we worked on a classification problem. We needed to classify Mushroom based on their edibility, classifying them into either edible or poisonous mushrooms. The dataset had 22 features which were to be used to make predictions. As observed in the EDA sections, the features had different distributions for target classes, which helped in ML training phase. This meant we achieved perfect scores in model selection phase with RandomForestClassifier. And as expected, the RandomForestClassifier achieved perfect scores in the final training section too.
Thanks for reading.