Determining the edibility of wild mushrooms

 · 17 mins read

Determining the edibility of wild mushrooms

Note: the following content can also be viewed as a Jupyter notebook, or in my Github repository

Is it possible to tell whether a mushroom is edible ot not, just by looking at it’s physical characteristics? We will explore this question using this dataset from the UCI Machine Learning.

Dataset description:

This dataset contains 8124 entries corresponding to 23 species of gilled mushrooms from North America. Each species is identified as definitely edible (e), definitely poisonous (p), or of unknown edibility and not recommended (also p). Each entry has 22 features related to the physical characteristics of the mushroom. The feature labels are explained in the file labels.txt. (Data source: The Audubon Society Field Guide to North American mushrooms).

Importing all libraries

%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Dataset: loading and initial inspection

df = pd.read_csv('dataset.csv')

5 rows × 23 columns


4 rows × 23 columns

We notice that the column veil-type has only 1 unique value - that is, all 8124 mushroom instances have the same veil-color.

It thus becomes an irrelevant feature, so we proceed to remove it

df.drop(['veil-type'], axis=1, inplace=True)

Converting categorical data to numerical

Most Machine Learning algorithms require numerical features. However, our dataset is composed of categorical features. We now proceed to convert these to numerical.

Label Encoding

A typical approach is to perform Label Encoding. This is nothing more than just assigning a number to each category, that is:

(cat_a, cat_b, cat_c, etc.) → (0, 1, 2, etc.)

This technique works:

  • When the features are binary (only have 2 unique values)
  • When the features are ordinal categorical (that is, when the categories can be ranked). A good example would be a feature called t-shirt size with 3 unique values small, medium or large, which have an intrinsic order.

However, in our case, only some of our features have 2 unique values (most of them have more), and none of them are ordinal categorical (in fact they they are nominal categorical, which means they have no intrinsic order).

Therefore, we will only apply Label Encoding to those features with a binary set of values:

for col in df.columns:
    if len(df[col].value_counts()) == 2:
        le = LabelEncoder()[col])
        df[col] = le.transform(df[col])

5 rows × 22 columns

We can see how it has converted some of the features to values of 0 or 1. More importantly, our labels (the class column) are now 0=e, and 1=p.

One Hot Encoding

For the remaining features, we can use a technique called One Hot Encoding.

Essentially, this consists on creating a new binary feature representing each category. For instance, from the feature cap surface, which has 4 unique values (f, g, y and s), we create 4 binary features (cap_surface_f, cap_surface_g, cap_surface_y and cap_surface_s) indicating whether the category they represent was indeed that one or not. This means that, for any given instance (row), we will have exactly one of these 4 features equal to 1, and the other 3 equal to 0.

One Hot Encoding is really simple to perform with the pandas package:

df = pd.get_dummies(df)

5 rows × 112 columns

Separating labels from features

X will now contain our features, and y our labels (0 for edible and 1 for poisonous/unknown)

y = df['class'].to_frame()
X = df.drop('class', axis=1)

5 rows × 111 columns

Standardising our features

It is generally considered a good practice to standardise our features (convert them to have zero-mean and unit variance). Most of the times, the difference will be small, but, in any case, it still never hurts to do so.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Creating training and sets sets

We will separate our data into a training set (70%) and a test set (30%). This is a very standard approach in Machine Learning.

The stratify option ensures that the ratio of edible to poisonois mushrooms in our dataset remains the same in both training and test sets. The random_state parameter is simply a seed for the algorithm to use (if we didn’t specify one, it would create different training and test sets every time we run it)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, stratify=y, random_state=19)

Logistic Regression

Since this is now a supervised learning binary classification problem, it makes perfect sense to start by running a simple logistic regression.

A logistic regression simply predicts the probability of an instance (row) belonging to the default class, which can then be snapped into a 0 or 1 classification. Off we go.

logreg = LogisticRegression(), y_train.values.ravel())
y_pred_test = logreg.predict(X_test)
print('Accuracy of Logistic Regression classifier on the test set: {:.2f}'.format(accuracy_score(y_test, y_pred_test)))
Accuracy of Logistic Regression classifier on the test set: 1.00

It seems like the logistic regression achieved the maximum accuracy possible: 100%

I have to admit that this made me go back and check my code and logical reasoning a couple times. But no, it simply means that the given features are a really good indicator of the edibility of mushrooms.

Still, we should run the logistic regression again, but this time using cross-validation, to ensure that we are not overfitting the data. A simple 10-fold cross validation should do.

scores = cross_val_score(logreg, X_train, y_train.values.ravel(), cv=StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=19), scoring='accuracy')
print('Accuracy of Logistic Regression classifier using 10-fold cross-validation: {}'.format(scores.mean()))
Accuracy of Logistic Regression classifier using 10-fold cross-validation: 0.9997655334114889

This time it doesn’t achieve the perfect score, but it’s pretty damn close.

Thus, it seems like the relationship between the features and the edibility of the mushrooms is highly linear. There is really no point in trying other models different from this logistic regression.

What we can do is investigate what are the most immportant features in deciding whether a mushroom is edible or not.

Most relevant features

features_coeffs = pd.DataFrame(logreg.coef_, columns=X.columns, index=['coefficients'])
features_coeffs.sort_values('coefficients', axis=1, ascending=False, inplace=True)

Interesting. Seems like odor and spore-print-color play an important role in deciding whether a mushroom is edible or not. Let’s confirm this:

def plot_features_containing(feature_name):
    categories = X.columns[X.columns.str.contains(feature_name)]
    edible_num = []
    poisonous_num = []
    for cat in categories:
        edible_count = sum((y[X[cat]==1]==0).values[:,0])
        poisonous_count = sum(X[cat]==1) - edible_count
    odor_df = pd.DataFrame(index=categories, columns=['edible', 'poisonous'])
    odor_df.edible = edible_num
    odor_df.poisonous = poisonous_num
    odor_df.plot(x=odor_df.index, kind='bar')

Mushrooms by odor - graph

odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

Very interesting! Seems like, at least in our dataset set:

  • All mushrooms with an almond or anise odor are edible
  • All mushrooms with a creosote, fishy, musty, pungent or spicy odor are poisonous (or unknown edibility)
  • Most mushrooms with no odor are edible. But not all of them!

Of course, this is just what our dataset tells us. It doesn’t necessarily mean that any new mushroom we find out there will obey these rules


Mushrooms by spore print color - graph

For spore-print-color we have quite a similar picture, although perhaps not as extreme as with odor. This is what we expected, since these are the 2 features with the highest coefficients in our logistic regression.

In fact, if we do the same for a feature different from these two, the distribution will probably not be as extreme as for these last 2.

Let’s check this.


Mushrooms by cap color - graph

Indeed, we see a much more balanced distribution, which suggests that cap-color does not play such an important role in determining the edibility of a mushroom.


  • We fitted a logistic regression model and achieved near perfect accuracy, so there was no need to try with more complex models.

  • Our algorithm identified specific traits (particularly regarding odor) that seem to heavily influence the chance that a mushroom is edible or not.

  • Even though experts have determined that is that there is no simple set of rules to determine whether a mushroom is edible or not, it seems like with this algorithm we can get pretty close.

Nevertheless, it is important to keep in mind that these results apply only to this dataset, and don’t necessarily mean that there aren’t any mushrooms out there which don’t follow these rules.

So, if you’re ever lost and stranded in a forest, don’t attempt to eat anything just because a machine tells you to do so! Stay safe out there.