Today, let’s study the Ensembles of Trees Algorithms: Random Forest and Gradient Boosted Trees(GBTs).
I. Background Knowledge
1. Random Forest is trying to build several estimators independently and then to average their predictions. It’s main idea is to reduce variances
2. GBTs: is to combine several weak models to produce a powerful ensemble, which aim to reduce bias
FYI: In MLlib, GBT doesn’t support Multiclass classification yet
4. Here’s a good tutorial from DataBricks
II. IPython Code (Github)
Let’s look at the IPython Code here. We are using the same dataset and procedure as the decision tree session
And we will find our dataset is more bias oriented or has high variance
In Conclusion, both scikit-learn randomforest model and MLlib one have better performance then the GBTs. Therefore, we can draw the answer that the dataset we used is more variance oriented. Besides, these mutual identifiation certify in scikit-learn randomforest model and MLlib’s model are reliable.