Determining the edibility of wild mushrooms

 · 17 mins read

Determining the edibility of wild mushrooms

Note: the following content can also be viewed as a Jupyter notebook, or in my Github repository

Is it possible to tell whether a mushroom is edible ot not, just by looking at it’s physical characteristics? We will explore this question using this dataset from the UCI Machine Learning.

Dataset description:

This dataset contains 8124 entries corresponding to 23 species of gilled mushrooms from North America. Each species is identified as definitely edible (e), definitely poisonous (p), or of unknown edibility and not recommended (also p). Each entry has 22 features related to the physical characteristics of the mushroom. The feature labels are explained in the file labels.txt. (Data source: The Audubon Society Field Guide to North American mushrooms).

Importing all libraries

%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Dataset: loading and initial inspection

df = pd.read_csv('dataset.csv')
df.head()
classcap-shapecap-surfacecap-colorbruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-typeveil-colorring-numberring-typespore-print-colorpopulationhabitat
0pxsntpfcnk...swwpwopksu
1exsytafcbk...swwpwopnng
2ebswtlfcbn...swwpwopnnm
3pxywtpfcnn...swwpwopksu
4exsgfnfwbk...swwpwoenag

5 rows × 23 columns

df.describe()
classcap-shapecap-surfacecap-colorbruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-typeveil-colorring-numberring-typespore-print-colorpopulationhabitat
count8124812481248124812481248124812481248124...8124812481248124812481248124812481248124
unique264102922212...4991435967
topexynfnfcbb...swwpwopwvd
freq4208365632442284474835287914681256121728...4936446443848124792474883968238840403148

4 rows × 23 columns

We notice that the column veil-type has only 1 unique value - that is, all 8124 mushroom instances have the same veil-color.

It thus becomes an irrelevant feature, so we proceed to remove it

df.drop(['veil-type'], axis=1, inplace=True)

Converting categorical data to numerical

Most Machine Learning algorithms require numerical features. However, our dataset is composed of categorical features. We now proceed to convert these to numerical.

Label Encoding

A typical approach is to perform Label Encoding. This is nothing more than just assigning a number to each category, that is:

(cat_a, cat_b, cat_c, etc.) → (0, 1, 2, etc.)

This technique works:

  • When the features are binary (only have 2 unique values)
  • When the features are ordinal categorical (that is, when the categories can be ranked). A good example would be a feature called t-shirt size with 3 unique values small, medium or large, which have an intrinsic order.

However, in our case, only some of our features have 2 unique values (most of them have more), and none of them are ordinal categorical (in fact they they are nominal categorical, which means they have no intrinsic order).

Therefore, we will only apply Label Encoding to those features with a binary set of values:

for col in df.columns:
    if len(df[col].value_counts()) == 2:
        le = LabelEncoder()
        le.fit(df[col])
        df[col] = le.transform(df[col])
df.head()
classcap-shapecap-surfacecap-colorbruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-above-ringstalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-colorring-numberring-typespore-print-colorpopulationhabitat
01xsn1p101k...sswwwopksu
10xsy1a100k...sswwwopnng
20bsw1l100n...sswwwopnnm
31xyw1p101n...sswwwopksu
40xsg0n110k...sswwwoenag

5 rows × 22 columns

We can see how it has converted some of the features to values of 0 or 1. More importantly, our labels (the class column) are now 0=e, and 1=p.

One Hot Encoding

For the remaining features, we can use a technique called One Hot Encoding.

Essentially, this consists on creating a new binary feature representing each category. For instance, from the feature cap surface, which has 4 unique values (f, g, y and s), we create 4 binary features (cap_surface_f, cap_surface_g, cap_surface_y and cap_surface_s) indicating whether the category they represent was indeed that one or not. This means that, for any given instance (row), we will have exactly one of these 4 features equal to 1, and the other 3 equal to 0.

One Hot Encoding is really simple to perform with the pandas package:

df = pd.get_dummies(df)
df.head()
classbruisesgill-attachmentgill-spacinggill-sizestalk-shapecap-shape_bcap-shape_ccap-shape_fcap-shape_k...population_spopulation_vpopulation_yhabitat_dhabitat_ghabitat_lhabitat_mhabitat_phabitat_uhabitat_w
01110100000...1000000010
10110000000...0000100000
20110001000...0000001000
31110100000...1000000010
40011010000...0000100000

5 rows × 112 columns

Separating labels from features

X will now contain our features, and y our labels (0 for edible and 1 for poisonous/unknown)

y = df['class'].to_frame()
X = df.drop('class', axis=1)
y.head()
class
01
10
20
31
40
X.head()
bruisesgill-attachmentgill-spacinggill-sizestalk-shapecap-shape_bcap-shape_ccap-shape_fcap-shape_kcap-shape_s...population_spopulation_vpopulation_yhabitat_dhabitat_ghabitat_lhabitat_mhabitat_phabitat_uhabitat_w
01101000000...1000000010
11100000000...0000100000
21100010000...0000001000
31101000000...1000000010
40110100000...0000100000

5 rows × 111 columns

Standardising our features

It is generally considered a good practice to standardise our features (convert them to have zero-mean and unit variance). Most of the times, the difference will be small, but, in any case, it still never hurts to do so.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Creating training and sets sets

We will separate our data into a training set (70%) and a test set (30%). This is a very standard approach in Machine Learning.

The stratify option ensures that the ratio of edible to poisonois mushrooms in our dataset remains the same in both training and test sets. The random_state parameter is simply a seed for the algorithm to use (if we didn’t specify one, it would create different training and test sets every time we run it)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, stratify=y, random_state=19)

Logistic Regression

Since this is now a supervised learning binary classification problem, it makes perfect sense to start by running a simple logistic regression.

A logistic regression simply predicts the probability of an instance (row) belonging to the default class, which can then be snapped into a 0 or 1 classification. Off we go.

logreg = LogisticRegression()
logreg.fit(X_train, y_train.values.ravel())
y_pred_test = logreg.predict(X_test)
print('Accuracy of Logistic Regression classifier on the test set: {:.2f}'.format(accuracy_score(y_test, y_pred_test)))
Accuracy of Logistic Regression classifier on the test set: 1.00

It seems like the logistic regression achieved the maximum accuracy possible: 100%

I have to admit that this made me go back and check my code and logical reasoning a couple times. But no, it simply means that the given features are a really good indicator of the edibility of mushrooms.

Still, we should run the logistic regression again, but this time using cross-validation, to ensure that we are not overfitting the data. A simple 10-fold cross validation should do.

scores = cross_val_score(logreg, X_train, y_train.values.ravel(), cv=StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=19), scoring='accuracy')
print('Accuracy of Logistic Regression classifier using 10-fold cross-validation: {}'.format(scores.mean()))
Accuracy of Logistic Regression classifier using 10-fold cross-validation: 0.9997655334114889

This time it doesn’t achieve the perfect score, but it’s pretty damn close.

Thus, it seems like the relationship between the features and the edibility of the mushrooms is highly linear. There is really no point in trying other models different from this logistic regression.

What we can do is investigate what are the most immportant features in deciding whether a mushroom is edible or not.

Most relevant features

features_coeffs = pd.DataFrame(logreg.coef_, columns=X.columns, index=['coefficients'])
features_coeffs.sort_values('coefficients', axis=1, ascending=False, inplace=True)
features_coeffs.T.head()
coefficients
odor_p1.301698
odor_c1.248420
odor_f1.215397
spore-print-color_r1.186042
spore-print-color_h1.101121
features_coeffs.T.tail()
coefficients
gill-spacing-0.777519
odor_a-0.783296
spore-print-color_n-0.819762
odor_l-0.827127
odor_n-1.719140

Interesting. Seems like odor and spore-print-color play an important role in deciding whether a mushroom is edible or not. Let’s confirm this:

def plot_features_containing(feature_name):
    categories = X.columns[X.columns.str.contains(feature_name)]
    edible_num = []
    poisonous_num = []
    for cat in categories:
        y[X[cat]==0]
        edible_count = sum((y[X[cat]==1]==0).values[:,0])
        poisonous_count = sum(X[cat]==1) - edible_count
        edible_num.append(edible_count)
        poisonous_num.append(poisonous_count)
    odor_df = pd.DataFrame(index=categories, columns=['edible', 'poisonous'])
    odor_df.edible = edible_num
    odor_df.poisonous = poisonous_num
    odor_df.plot(x=odor_df.index, kind='bar')
plot_features_containing('odor')

Mushrooms by odor - graph

odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

Very interesting! Seems like, at least in our dataset set:

  • All mushrooms with an almond or anise odor are edible
  • All mushrooms with a creosote, fishy, musty, pungent or spicy odor are poisonous (or unknown edibility)
  • Most mushrooms with no odor are edible. But not all of them!

Of course, this is just what our dataset tells us. It doesn’t necessarily mean that any new mushroom we find out there will obey these rules

plot_features_containing('spore-print-color')

Mushrooms by spore print color - graph

For spore-print-color we have quite a similar picture, although perhaps not as extreme as with odor. This is what we expected, since these are the 2 features with the highest coefficients in our logistic regression.

In fact, if we do the same for a feature different from these two, the distribution will probably not be as extreme as for these last 2.

Let’s check this.

plot_features_containing('cap-color')

Mushrooms by cap color - graph

Indeed, we see a much more balanced distribution, which suggests that cap-color does not play such an important role in determining the edibility of a mushroom.

Conclusion

  • We fitted a logistic regression model and achieved near perfect accuracy, so there was no need to try with more complex models.

  • Our algorithm identified specific traits (particularly regarding odor) that seem to heavily influence the chance that a mushroom is edible or not.

  • Even though experts have determined that is that there is no simple set of rules to determine whether a mushroom is edible or not, it seems like with this algorithm we can get pretty close.

Nevertheless, it is important to keep in mind that these results apply only to this dataset, and don’t necessarily mean that there aren’t any mushrooms out there which don’t follow these rules.

So, if you’re ever lost and stranded in a forest, don’t attempt to eat anything just because a machine tells you to do so! Stay safe out there.