This Article Will Discuss:
1- Why Random Forest ?
2-Random Forest Hyperparametr
Random Forest is able to overcome two main issues with decision tree :
1- Not all features of the data set is used
2- The root is the same even if we manipulate the decision tree parameter like max_depth,criterion ... .
The key idea of random forest is to randomly pick subset of features in each potential split .
It allows us to explore all aespect of our data , producing more generilized models .
unlike decision tree where a single feature may dominate .
Random forests shine in situations where the dataset is noisy or has a lot of irrelevant features.
So , Having more than one single tree means we may have diffrent predictions of our label , which to choose ?
in classification task => we will tally up the result .
in regression task => we will take the avearage .
* NOTE
while creating random forest you will notice that you're baisacally setting up all the rules of
a decision tree (e.g., max_depth, criterion, min_samples_split), so what additional parameter that make it random forest ?
1- n_estimators : How many decision tree you will have in the random forest ?
can be determined by CV , setting it to 1 will make it behave poorly comparing to decision tree , as it's only trained with subset of
features .
Adding more trees won't overfit the model, however if the parameter is set with large value it will cause
duplicated tree .
( Duplicated tree does'nt imply overfitting ) .
2- n_features : you can decide the subset size of random features based on what task the model
is performing :
1-classification => √n+1
2-Regression=> n/3
Note
This is can be used as starting point we don't have to stick to it .
3- bootstrapping : random sampling with replacment , True by default .
instead of training each decision tree with the entire data points, we can randomly select rows of the data set (we can have the same row more than once "that's why it's called with replacment" ).
this will make the model less senseitive to the trainning data .
Important Note
Notice how we have two level of randomization now .
1- Random Feature selection => trees are less correlated .
2- Random data points selction => reduce overfitting , more generilization .
4- OOB (out of bag error) :
Bootstrapping make some data points left out ,The data points that the tree didn’t train on (the OOB samples).
for each data point it checks which tree didn't use it,Then it use these trees to predict the label for the data point.
Note
It only works when Bootstrap is set to True .