Introduction

Today, we'll work on a classification problem. The dataset we have chosen is the mushroom-classification dataset available on Kaggle. This dataset was provided by UCI Machine Learning repository nearly 3 decades ago. The dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended.

Our task making successful predictions begins first by the setup of the system for training.

Setup

# Import the required libraries and get the file path

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV # validation
from sklearn.preprocessing import OneHotEncoder, LabelEncoder # data preparation
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay, precision_score, recall_score, accuracy_score, f1_score # metrics
from sklearn.pipeline import make_pipeline # build pipeline

# ML models
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# get file path
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
/kaggle/input/mushroom-classification/mushrooms.csv
file_dir = "/kaggle/input/mushroom-classification/mushrooms.csv"
# inspect file size
!ls -lh {file_dir}
-rw-r--r-- 1 nobody nogroup 366K Nov 21 08:43 /kaggle/input/mushroom-classification/mushrooms.csv
# inspect first 5 rows of dataset
!head {file_dir}
class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
e,b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
e,b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
p,x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g

Observations

  • The dataset file size is 366 KB.
  • It will be safe to import the whole dataset.
  • The prediction class is the first column.
  • There appears to be no index column
# read file
df = pd.read_csv(file_dir)

# view all columns
pd.set_option("display.max_columns", None)

df.head()
class cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size gill-color stalk-shape stalk-root stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat
0 p x s n t p f c n k e e s s w w p w o p k s u
1 e x s y t a f c b k e c s s w w p w o p n n g
2 e b s w t l f c b n e c s s w w p w o p n n m
3 p x y w t p f c n n e e s s w w p w o p k s u
4 e x s g f n f w b k t e s s w w p w o e n a g

EDA and Data Preparation

# split datasets for training and testing
X = df.copy()
y = X.pop("class")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Preliminary Analysis

X_train.shape
(6499, 22)
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6499 entries, 3777 to 767
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   cap-shape                 6499 non-null   object
 1   cap-surface               6499 non-null   object
 2   cap-color                 6499 non-null   object
 3   bruises                   6499 non-null   object
 4   odor                      6499 non-null   object
 5   gill-attachment           6499 non-null   object
 6   gill-spacing              6499 non-null   object
 7   gill-size                 6499 non-null   object
 8   gill-color                6499 non-null   object
 9   stalk-shape               6499 non-null   object
 10  stalk-root                6499 non-null   object
 11  stalk-surface-above-ring  6499 non-null   object
 12  stalk-surface-below-ring  6499 non-null   object
 13  stalk-color-above-ring    6499 non-null   object
 14  stalk-color-below-ring    6499 non-null   object
 15  veil-type                 6499 non-null   object
 16  veil-color                6499 non-null   object
 17  ring-number               6499 non-null   object
 18  ring-type                 6499 non-null   object
 19  spore-print-color         6499 non-null   object
 20  population                6499 non-null   object
 21  habitat                   6499 non-null   object
dtypes: object(22)
memory usage: 1.1+ MB
X_train.describe()
cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size gill-color stalk-shape stalk-root stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat
count 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499 6499
unique 6 4 10 2 9 2 2 2 12 2 5 4 4 9 9 1 4 3 5 9 6 7
top x y n f n f c b b t b s s w w p w o p w v d
freq 2970 2583 1817 3780 2822 6333 5448 4511 1360 3675 3036 4148 3969 3559 3523 6499 6343 5989 3191 1884 3224 2530

Observations

  • The dataset has 22 features.
  • There are 6499 enries.
  • All features are categorical in nature.
  • Most of the features have unique values less than 10.

Target Class

class_vc = df["class"].value_counts()
class_vc
e    4208
p    3916
Name: class, dtype: int64
sns.barplot(x = class_vc.index, y = class_vc)
<AxesSubplot:ylabel='class'>

Observations

  • The class counts are not much imbalanced.

Distribution of Features

for i, cols in enumerate(df):
    feature_vc = df[cols].value_counts()
    print(feature_vc, "\n_________\n")
    
    plt.figure(i)
    sns.barplot(x = feature_vc.index, y = feature_vc)
e    4208
p    3916
Name: class, dtype: int64 
_________

x    3656
f    3152
k     828
b     452
s      32
c       4
Name: cap-shape, dtype: int64 
_________

y    3244
s    2556
f    2320
g       4
Name: cap-surface, dtype: int64 
_________

n    2284
g    1840
e    1500
y    1072
w    1040
b     168
p     144
c      44
u      16
r      16
Name: cap-color, dtype: int64 
_________

f    4748
t    3376
Name: bruises, dtype: int64 
_________

n    3528
f    2160
y     576
s     576
a     400
l     400
p     256
c     192
m      36
Name: odor, dtype: int64 
_________

f    7914
a     210
Name: gill-attachment, dtype: int64 
_________

c    6812
w    1312
Name: gill-spacing, dtype: int64 
_________

b    5612
n    2512
Name: gill-size, dtype: int64 
_________

b    1728
p    1492
w    1202
n    1048
g     752
h     732
u     492
k     408
e      96
y      86
o      64
r      24
Name: gill-color, dtype: int64 
_________

t    4608
e    3516
Name: stalk-shape, dtype: int64 
_________

b    3776
?    2480
e    1120
c     556
r     192
Name: stalk-root, dtype: int64 
_________

s    5176
k    2372
f     552
y      24
Name: stalk-surface-above-ring, dtype: int64 
_________

s    4936
k    2304
f     600
y     284
Name: stalk-surface-below-ring, dtype: int64 
_________

w    4464
p    1872
g     576
n     448
b     432
o     192
e      96
c      36
y       8
Name: stalk-color-above-ring, dtype: int64 
_________

w    4384
p    1872
g     576
n     512
b     432
o     192
e      96
c      36
y      24
Name: stalk-color-below-ring, dtype: int64 
_________

p    8124
Name: veil-type, dtype: int64 
_________

w    7924
n      96
o      96
y       8
Name: veil-color, dtype: int64 
_________

o    7488
t     600
n      36
Name: ring-number, dtype: int64 
_________

p    3968
e    2776
l    1296
f      48
n      36
Name: ring-type, dtype: int64 
_________

w    2388
n    1968
k    1872
h    1632
r      72
u      48
o      48
y      48
b      48
Name: spore-print-color, dtype: int64 
_________

v    4040
y    1712
s    1248
n     400
a     384
c     340
Name: population, dtype: int64 
_________

d    3148
g    2148
p    1144
l     832
u     368
m     292
w     192
Name: habitat, dtype: int64 
_________

Observations

  • All the feautures are categorical in nature.
  • All classes are anonymized.
  • We'll need to OneHotEncode the data.
  • Because the number of no feature classes is too large, we can OneHotEncode each feature class.
  • Feature vei-type has only one class. Thus it'll need to be dropped.

Feature Distribution against Target Class

for i, feature in enumerate(X_train):
    plt.figure(i)
    sns.countplot(x = feature, hue = y_train, data = X_train)

Observations

  • Allmost all the features differ and distinguish between the two target classes.
  • Their distributions are different for the target classes.

It means that models may also offer incredible scores in classifying the two target classes.

Null Values

Now, we'll look at the number of null values in the data.

df.isnull().sum()
class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

Observations

  • There are no null values in the dataset.
  • It will need no imputation or transformation

Data Preparation

There isn't much to prepare in the data. We just need to OneHotEncode all the categorical features. And we'll also encode the target class.

# OneHotEncoding Categorical features
ohe = OneHotEncoder(drop = "first", handle_unknown = "ignore", sparse = False)
ohe_train_data = ohe.fit_transform(X_train)
ohe_test_data = ohe.transform(X_test)
ohe_train_data
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.]])
# Encode the target in train dataset
le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
# Encode target in test dataset
y_test_le = le.transform(y_test)

So, we can now move on to training ML models.

Train Hardcoded Model and Evaluate Results

First we'll train a hardcoded model, which will just predict the most frequent target class which is 'edible'. This will help us in giving us a baseline score which our future models should at least beat. It helps to find errors in training.

mf_model = DummyClassifier(random_state = 42, strategy = "most_frequent")
mf_cross_val = cross_validate(mf_model, ohe_train_data, y_train_le, scoring = ["accuracy", "precision", "recall", "f1", "roc_auc"])
mf_cross_val
{'fit_time': array([0.00430179, 0.00245929, 0.00171161, 0.0016439 , 0.0016706 ]),
 'score_time': array([0.00662088, 0.00684905, 0.00624609, 0.00612545, 0.00620723]),
 'test_accuracy': array([0.51769231, 0.51769231, 0.51692308, 0.51692308, 0.51732102]),
 'test_precision': array([0., 0., 0., 0., 0.]),
 'test_recall': array([0., 0., 0., 0., 0.]),
 'test_f1': array([0., 0., 0., 0., 0.]),
 'test_roc_auc': array([0.5, 0.5, 0.5, 0.5, 0.5])}

Observations

  • The test roc_auc score is 0.5 in all validations, which means the model is not able to distinguish between the two target classes in any way.
  • This model also gives a test accuracy of 0.52.

We'll need to at least improve on these scores.

Model Selection

In this section, we'll train multiple ML models and compare their performance after cross validation. Then we'll choose the best performing one for final training.

models = {"LogisticRegression": LogisticRegression(random_state = 42),
         "RidgeClassification": RidgeClassifier(random_state = 42),
         "GaussianNB": GaussianNB(),
         "RandomForestClassifier": RandomForestClassifier(n_estimators = 70, random_state = 42),
         "XGBClassifier": XGBClassifier(n_estimators = 70, objective = "binary:logistic", learning_rate = 0.05, n_jobs = -1, scoring = "auc", random_state = 42)}
model_scores = {}

# cross validate all models
for model_name, model in models.items():
    
    cross_val = cross_validate(model, ohe_train_data, y_train_le, n_jobs = -1, scoring = ["accuracy", "precision", "recall", "f1", "roc_auc"])
    del cross_val["fit_time"]
    del cross_val["score_time"]
    
    model_scores[model_name] = cross_val
# put results into a dataframe
model_scores_df = pd.DataFrame.from_dict(model_scores)
model_scores_df = model_scores_df.applymap(np.mean)
model_scores_df
LogisticRegression RidgeClassification GaussianNB RandomForestClassifier XGBClassifier
test_accuracy 0.999538 0.999538 0.947377 1.0 1.0
test_precision 1.000000 1.000000 0.902635 1.0 1.0
test_recall 0.999044 0.999044 0.999044 1.0 1.0
test_f1 0.999521 0.999521 0.948331 1.0 1.0
test_roc_auc 1.000000 1.000000 0.996180 1.0 1.0

Observations

  • Surprisingly, all of the models performed incredibly well and fit almost perfectly to the training dataset.
  • Out of these, RandomForestClassifier performed the best with a perfect score of 1.0 in all scores.

Thus, we'll train the RandomForestClassifier to make final predictions.

Train Final Model and make predictions

# train model
forest_clas = RandomForestClassifier(random_state = 42)

forest_clas.fit(ohe_train_data, y_train_le)
RandomForestClassifier(random_state=42)
# make predictions
preds = forest_clas.predict(ohe_test_data)

# decode predictions into their original labels
preds_in = le.inverse_transform(preds) 
preds_in
array(['e', 'p', 'e', ..., 'e', 'p', 'p'], dtype=object)
# plot results
ConfusionMatrixDisplay.from_predictions(y_test, preds_in)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fae14daf950>
accuracy_score(y_test, preds_in)
1.0
recall_score(y_test_le, preds)
1.0
f1_score(y_test_le, preds)
1.0
RocCurveDisplay.from_estimator(forest_clas, ohe_test_data, y_test_le)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7fae14cd1510>

Observations

  • As seen in the model selection phase, RandomForestClassifier achieved perfect accuracy in classification.
  • This gave it perfect scores in other metrics as well such as roc_auc, recall and f1 score.

Summary and Conclusion

In this project we worked on a classification problem. We needed to classify Mushroom based on their edibility, classifying them into either edible or poisonous mushrooms. The dataset had 22 features which were to be used to make predictions. As observed in the EDA sections, the features had different distributions for target classes, which helped in ML training phase. This meant we achieved perfect scores in model selection phase with RandomForestClassifier. And as expected, the RandomForestClassifier achieved perfect scores in the final training section too.

Thanks for reading.