Study Apache Spark MLlib on IPython—Regression & Classification—Random Forest & GBTs

HI Guys,

Today, let’s study the Ensembles of Trees Algorithms: Random Forest and Gradient Boosted Trees(GBTs).

I. Background Knowledge 

1. Random Forest is trying to build several estimators independently and then to average their predictions. It’s main idea is to reduce variances

2. GBTs: is to combine several weak models to produce a powerful ensemble, which aim to reduce bias

FYI: In MLlib, GBT doesn’t support Multiclass classification yet

3. Difference between variance and bias

1

4. Here’s a good tutorial from DataBricks

II. IPython Code (Github)

Let’s look at the IPython Code here. We are using the same dataset and procedure as the decision tree session

And we will find our dataset is more bias oriented or has high variance

1

2

In PySpark

3

In Conclusion, both scikit-learn randomforest model and MLlib one have better performance then the GBTs. Therefore, we can draw the answer that the dataset we used is more variance oriented. Besides, these mutual identifiation certify in scikit-learn randomforest model and MLlib’s model are reliable.

Reference

1. http://scikit-learn.org/stable/modules/ensemble.html

2. https://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests

3. https://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts

4. http://scott.fortmann-roe.com/docs/BiasVariance.html

5. http://scikit-learn.org/stable/modules/ensemble.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s