Exploring the Data

Ideally, you want to try and identify large splits in the data that can indicate the largest differences.

The top grey node of the tree shows the overall number of different types of mushroom, in this case 8124 and out of those 8124 types of mushroom 52% are edible. This node indicates the current situation without any characteristics being applied and will always show the largest probability – in this example 52% Edible compared to 48% Poisonous.

The first node states that the most significant characteristic is Spore print colour. When you click on the first node in the left-hand branch a ‘Node Information’ box will be displayed in the top-left corner and from this information you can see that 448 mushrooms with a Black, Brown or Buff spore print colour could be poisonous. There is an 11.52% chance that you could pick a poisonous mushroom out of the 3888 Black-Brown print colour ones. This node also shows that 3440 out of the 3888 mushrooms with Black-Brown prints could be edible.

The Probability (Total) on the right-hand side of the ‘Node Information’ box indicates the likelihood of pulling out an Edible or Poisonous mushroom from the total population of this diagram. Using the example above you can see that for the top left node there is a 42.34% chance of choosing an Edible mushroom with a Black or Brown Spore print colour.

To close the ‘Node Information’ box, click the ‘Close’ icon in the top-right corner or click another node. 

If you explore the model further and look at the right-hand branch, you can see that ‘Ring Number’ is the next characteristic.

This shows that you can increase the odds of eating an edible mushroom by choosing mushrooms with two or more rings. 

Although the odds are improving as you follow the branches, you can increase your chances even further by continuing further down the chart. 

If you look just below the ring number characteristic, you can see that the spore print colour comes back into play again and for the first time you can see that there is a 100% chance of getting an edible mushroom based on the Spore Print Colour. 

So how do you distinguish between the first node’s spore print colour and this node? This is done by opening the tool tip information and clicking the small plus icon next to the characteristic.

As you can see by splicing the information together, the lower node gives us 528 mushrooms with a 100% probability of being edible providing they have two rings and either a Purple, White or Yellow spore colour.

However, these are not the only edible mushrooms out of the 8124, if you look back at the table, you can see that many branches will filter down and give us a 100% certainty of Edible mushrooms of different characteristics. 

So now you have a model which will predict with 99% accuracy the mushrooms that you can and can’t eat.

You can see this information by clicking the Chart information icon in the Chart Tools menu.

Please remember that this is a model, and you should therefore not base your decisions entirely on the model.

What matters here is whether the information in the model is useful. In summary, it has achieved a couple of very important things:

  1. It has challenged your preconception. Spore print colour is no longer a primary consideration.

  2. It has given you a model that you can use.

There is something very special about pi Predicts. It brings together the power of machine learning and the people who hold the domain knowledge.

However, one reason you may still not be ready to make a decision about eating mushrooms is that you still probably don’t know very much about them.

If you tap into domain knowledge, then something exciting happens. The model may not be helpful, but if you showed it to a mushroom expert – or got them to build it for themselves – they will tell you very quickly how useful the model is, why it’s right, or why it’s wrong.

They may tell you why your sample is no good – maybe your data comes from America and you’re trying to eat European mushrooms. They can quickly tell you things about the model which data scientists struggle to tell.

Statistics tells us about correlation but not causation. It cannot tell us why this is happening, but a domain expert can.

You built your first model a certain way, what if the expert knows that spore print colour is just not useful because they know some people have difficulty distinguishing colours? You could remove that characteristic and rebuild the model.