Random Forest: The Math of Intelligence
We’re going to talk about building and evaluating random forests.
Random forests are built from decision trees.
Decision trees are easy to build, easy to use and easy to interpret, but in practice they are not that awesome…
Quote from the Elements of Statistical Learning
The good news is that random forests combine the simplicity of decision trees with flexibility, resulting in a vast improvement in accuracy.
So let’s illustrate a Random Forest!
Step 1.
Create a Bootstrap Dataset:
Imagine that these 4 samples are the entire data set that we are going to build a tree from this original Dataset:
To create a bootstrap data set that is the same size as the original.
We just randomly select samples from the original data set, the important detail is that we’re allowed to pick the same sample more than once this is the first sample that we randomly select:
This is the second randomly selected sample from the original Dataset.
So it’s the second sample in our bootstrap Dataset:
Lastly… here’s the fourth randomly selected sample note.
It’s the same as the third:
Step2.
For creating a random forest will create a decision tree using the bootstrap Dataset.
Note, we’ll talk more about how to determine the optimal number of variables to consider later.
Thus instead of considering all four variables to figure out how to split the root node.
This case we randomly selected good blood circulation and blocked arteries as candidates for the root node, just for the sake of the example assume that good blood circulation.
Did the best job separating the samples?
Since we used a good blood circulation, I’m going to gray it out so that we focus on the remaining variables.
Now we need to figure out how to split samples at this node just like for the route we randomly select two variables as candidates instead of all three remaining columns and:
We just build the tree as usual, but only considering a random subset of variables at each step:
Here’s the tree we just made.
Now go back to step one and repeat:
Make a new bootstrap data set and build a tree considering a subset of variables at each step.
Using a bootstrap sample and considering only a subset of variables at each step results in a wide variety of trees.
The variety is what makes random forests more effective than individual decision trees.