Study Apache Spark MLlib on IPython—Regression & Classification—Decision Tree

HI Guys,

Today, let’s study the Decision Tree algorithm and see how to use this in Python scikit-learn and MLlib. Decision Tree is also the foundation of some ensemble algorithms such as Random Forest and Gradient Boosted Trees.

I. Background Knowledge

For decision trees, here are some basic concept background links.

1. What is ID3 (KeyWord: Information gain)

2. What is C4.5 (KeyWord: information gain ratio)

3. What is CART (Keyword: Gini coefficient)

4. Deference between ID3, C4.5 and CART

4. MLlib decision tree is using ID3 with CART, debate

5. Apache Spark MLlib decision tree (Note decision tree support both regression and classification usages)

II. LibSVM Datatype

1. In MLlib

LibSVM, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format: (refer to Apache MLlib)

label index1:value1 index2:value2 ...

For more detail info, please refer to this github dataset

2. In sklearn

Please refer to scikit-learn load_libsvm

Note we use the memory to cache the libsvm files by Python decoration to increase performance.

III. Code

Let’s look at the IPython Code here. Github source code

Note the functionality of zip() , is to concatenate the former RDD and the later one and return as key, value pairs.

2

3

4

5

6

Note the MLlib model gives a very ideal accuracy.

The decision tree pdf we generated is like this CART_decision_tree

1

Reference

1.http://scikit-learn.org/stable/modules/tree.html

2.http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html#example-tree-plot-iris-py

3.https://spark.apache.org/docs/latest/mllib-decision-tree.html#classification

4.http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

5.http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html#sklearn.datasets.load_svmlight_file

6.

2 comments

  1. Doug · September 29, 2015

    100% accuracy, dont you think its overfitting?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s