Today, let’s study the Decision Tree algorithm and see how to use this in Python scikit-learn and MLlib. Decision Tree is also the foundation of some ensemble algorithms such as Random Forest and Gradient Boosted Trees.
I. Background Knowledge
For decision trees, here are some basic concept background links.
1. What is ID3 (KeyWord: Information gain)
2. What is C4.5 (KeyWord: information gain ratio)
3. What is CART (Keyword: Gini coefficient)
4. Deference between ID3, C4.5 and CART
4. MLlib decision tree is using ID3 with CART, debate
5. Apache Spark MLlib decision tree (Note decision tree support both regression and classification usages)
II. LibSVM Datatype
1. In MLlib
label index1:value1 index2:value2 ...
For more detail info, please refer to this github dataset
2. In sklearn
Please refer to scikit-learn load_libsvm
Note we use the memory to cache the libsvm files by Python decoration to increase performance.
Let’s look at the IPython Code here. Github source code
Note the functionality of zip() , is to concatenate the former RDD and the later one and return as key, value pairs.
Note the MLlib model gives a very ideal accuracy.
The decision tree pdf we generated is like this CART_decision_tree