A brief introduction and summary of MLlib

MLlib stands for Machine learning Library used for Apache Spark, which is a really convenient tool combined many common used Machine learning/data analysis algorithms.

Let’s go through what algorithms MLlib currently have now.

ml_method

1. regression

1.1 Linear regression

Linear regression is one of the most common methods for regression, predicting the output variable as a linear combination of the features.

2.Classification (Supervised)

2.1 Logistic regression (Binary Classification)

Logistic regression is a binary classification method that identifies a linear separating plane between positive and negative examples. In MLlib, it takes LabeledPoints with label 0 or 1 and returns a LogisticRegressionModel that can predict new points.

2.2 Linear SVM (Binary Classification)
Support Vector Machines, or SVMs, are another binary classification method with linear separating planes, again expecting labels of 0 or 1. They are available through the SVMWithSGD class, with similar parameters to linear and logisitic regression. The returned SVMModel uses a threshold for prediction like LogisticRegressionModel.

2.2 Naive Bayes (Multiclass Classification)

Naive Bayes is a multiclass classification algorithm that scores how well each point belongs in each class based on a linear function of the features. It is commonly used in text classification with TF-IDF features, among other applications. MLlib implements Multinomial Naive Bayes, which expects nonnegative frequencies (e.g., word frequencies) as input features.

3. Regression and Classification (Ensemble Method)

3.1 Decision trees

Decision trees are a flexible model that can be used for both classification and regression. They represent a tree of nodes, each of which makes a binary decision based on a feature of the data (e.g., is a person’s age greater than 20?), and where the leaf nodes in the tree contain a prediction (e.g., is the person likely to buy a product?). Decision trees are attractive because the models are easy to inspect and because they support both categorical and continuous features.

3.2 Random forests

return a WeightedEnsembleModel that contains several trees (in the weakHypotheses field, weighted by weakHypothesisWeights) and can predict() an RDD or Vector. It also includes a toDebugString to print all the trees.Actually decision tress is just one tree of Random Forests

3.3 Gradient-Boosted Trees (GBT)

Gradient-Boosted Trees (GBTs) are ensembles of decision trees. GBTs iteratively train decision trees in order to minimize a loss function. Like decision trees, GBTs handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.

MLlib supports GBTs for binary classification and for regression, using both continuous and categorical features. MLlib implements GBTs using the existing decision trees implementation. Please see the decision tree guide for more information on trees.

Note: GBTs do not yet support multiclass classification. For multiclass problems, please use decision trees or random forest.

4. Cluster (Unsupervised)

Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity. Unlike the supervised tasks seen before, where data is labeled, clustering can be used to make sense of unlabeled data. It is commonly used in data exploration (to find what a new dataset looks like) and in anomaly detection (to identify points that are far from any cluster).

4.1 K-means
MLlib includes the popular K-means algorithm for clustering, as well as a variant called K-means|| that provides better initialization in parallel environments.5 Kmeans is similar to the K-means++ initialization procedure often used in singlenode
settings.

4.2 Collaborating Filter

Collaborative filtering is a technique for recommender systems wherein users’ ratings and interactions with various products are used to recommend new ones. Collaborative filtering is attractive because it only needs to take in a list of user/product interactions: either “explicit” interactions (i.e., ratings on a shopping site) or “implicit” ones (e.g., a user browsed a product page but did not rate the product). Based solely on these interactions, collaborative filtering algorithms learn which products are similar to each other (because the same users interact with them) and which users are similar to each other, and can make new recommendations. While the MLlib API talks about “users” and “products,” you can also use collaborative filtering for other applications, such as recommending users to follow on a social network, tags to add to an article, or songs to add to a radio station.

ALS works by determining a feature vector for each user and product, such that the dot product of a user’s vector and a product’s is close to their score.

5. Dimensionality Reduction

5.1 PCA

The main technique for dimensionality reduction used by the machine learning community is principal component analysis (PCA). In this technique, the mapping to the lower-dimensional space is done such that the variance of the data in the low dimensional representation is maximized, thus ignoring non-informative dimensions. To compute the mapping, the normalized correlation matrix of the data is constructed and the singular vectors and values of this matrix are used. The singular vectors that correspond to the largest singular values are used to reconstruct a large fraction of the variance of the original data.

5.2 SVD

MLlib also provides the lower-level singular value decomposition (SVD) primitive. The SVD factorizes an m ~ n matrix A into three matrices A . Uƒ°V T , where:

. U is an orthonormal matrix, whose columns are called left singular vectors.
. ƒ° is a diagonal matrix with nonnegative diagonals in descending order, whosediagonals are called singular values.
. V is an orthonormal matrix, whose columns are called right singular vectors.

Reference:

1. https://spark.apache.org/docs/1.2.1/mllib-guide.html

2. Holden Karau et al., Learning Spark : Lightning-Fast Big Data Analytics, O’ Reilly 2014.

Study Spam Classifier Code by MLlib on Intellij

Hi guys, I am hoping you are doing well today. For this chapter let’s study and run the spam classifier code from databricks

1. Download the source code from github, refer to databricks examples

But I have modified some of them, you can download the whole project at my github

2. Download the test files

spam sample

https://github.com/databricks/learning-spark/blob/master/files/spam.txt

non-spam sample

https://github.com/databricks/learning-spark/blob/master/files/ham.txt

3. Create a project based on the procedure we discussed at here

4. Change build.sbt

We need to use scala 2.10.x version for Spark 1.4.0 see here

name := “spam_classifier”
version := “1.0”
scalaVersion := “2.10.4”
libraryDependencies += “org.apache.spark” % “spark-core_2.10” % “1.4.0”
libraryDependencies += “org.apache.spark” % “spark-mllib_2.10” % “1.4.0”

4. Let’s study the code

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint

object spam_classifier{

  def main(args: Array[String]) {
    //change it to the local standalone mode
    val conf = new SparkConf()
                   .setMaster("local[2]")
                   .setAppName("spam_classifier")
                   .set("spark.executor.memory", "1g")
    val sc = new SparkContext(conf)

    // Load 2 types of emails from text files: spam and ham (non-spam).
    // Each line has text from one email.
    val spam = sc.textFile("spam.txt")
    val ham = sc.textFile("ham.txt")

    //(1)Feature extraction
    // Create a HashingTF instance to map email text to vectors of 100 features.
    val tf = new HashingTF(numFeatures = 100)
    // Each email is split into words, and each word is mapped to one feature.
    val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
    val hamFeatures = ham.map(email => tf.transform(email.split(" ")))

    //(2)Create Training Set
    // Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
    val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
    val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
    val trainingData = positiveExamples ++ negativeExamples
    trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.

    //(3) Training
    // Create a Logistic Regression learner which uses the LBFGS optimizer.
    val lrLearner = new LogisticRegressionWithSGD()
    // Run the actual learning algorithm on the training data.
    val model = lrLearner.run(trainingData)

    //(4) Create Testing Set
    // Test on a positive example (spam) and a negative one (ham).
    // First apply the same HashingTF feature transformation used on the training data.
    val posTestExample = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))
    val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))

    //(5) Test results
    // Now use the learned model to predict spam/ham for new emails.
    println(s"Prediction for positive test example: ${model.predict(posTestExample)}")
    println(s"Prediction for negative test example: ${model.predict(negTestExample)}")

    sc.stop()
  }
}

4.1 About HasingTF
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
TFIDF(t,d,D)=TF(t,d)IDF(t,D).
Term frequency TF(t,d) is the number of times that term t appears in document d, while documentfrequency DF(t,D) is the number of documents that contains term t. IDF(t,D)is a numerical measure of how much information a term provides.
The product of these values, TFIDF, shows how relevant a term is to a specificdocument (i.e., if it is common in that document but rare in the whole corpus).

4.2
val spamFeatures = spam.map(email => tf.transform(email.split(” “)))
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF
Transforms the input document to term frequency vectors.

4.3 labelpoint

LabeledPoint class in MLlib, described in which resides in the mllib.regression package. A LabeledPoint consists simply of a label (which is always a Double value, but can be set to discrete integers for classification) and a features vector.

4.3 For LogisticRegressionWithSGD
https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression

5. Let’s build the project by hitting

Untitled

6. Let’s run it

2

val posTestExample = tf.transform(“O M G GET cheap stuff by sending money to …”.split(” “))

val negTestExample = tf.transform(“Hi Dad, I started studying Spark the other …”.split(” “))

We can see the posTestExample is a spam email and be labeled as 1.0, while the negTestExample is label as 0.0. Our classifier is working!

Congratulations!

Reference,

Holden Karau et al., Learning Spark : Lightning-Fast Big Data Analytics, O’ Reilly 2014

Step by Step of installing Apache Spark on Apache Hadoop

HI guys,

This time, I am going to install Apache Spark on our existing Apache Hadoop 2.7.0.

Env versions

OS-Ubuntu 15.04

Scala-2.11.7

spark-spark-1.4.0-bin-hadoop2.6.tgz

1. Install Scala (refer to this)
sudo apt-get remove scala-library scala
sudo dpkg -i scala-2.11.7.deb
sudo apt-get update
sudo apt-get install scala
2.Install Spark
wget http://apache.mirrors.ionfish.org/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz
tar -zxvf spark-1.4.0-bin-hadoop2.6.tgz
mv spark-1.4.0-bin-hadoop2.6 /usr/local/spark
3 get hadoop version
hadoop version
It should show 2.7.0
4 add spark home
sudo vi ~/.bashrc
add
export SPARK_HOME=/usr/local/spark
source ~/.bashrc
5 Spark Version
Since spark-1.4.0-bin-hadoop2.6.tgz is an built version for hadoop 2.6.0 and later, it is also usable for hadoop 2.7.0. Thus, we don’t bother to re-build by sbt or maven tools, which are indeed complicated. If you download the source code from Apache spark org, and build with command
build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -DskipTests clean package
There are lots of build tool dependency crashed.
So, no bother about building spark.
6 let’s verify our installation
cd $SPARK_HOME
7. launch spark shell (refer to this)
./bin/spark-shell
1
It means spark shell is running
8.  Test spark shell
scala:> sc.parallelize(1 to 100).count()
background info
sc—spark context, Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
parallelize—Distribute a local Scala collection to form an RDD.
count—Return the number of elements in the dataset.
3
scala:> exit
9 Let’s try another typical example
bin/spark-submit –class org.apache.spark.examples.SparkPi –master local[*] lib/spark-example* 10
The last variable 10 s the argument for the main of the application. For here is the slice number used for calculation Pi
 4

Congratulations! We have finishing Spark installation and next we can start using this powerfull tool to perform data analysis and many other fun stuffs.

Step by Step of installing SingleNode Yarn on Ubuntu

HI Guys,
I am back and today’s lets talk about install SingleNode Yarn on Ubuntu. Some you guys may heard of Hadoop but may not know about Yarn. Here’s the explanation from Apache hadoop Yarn, “The fundamental idea of MRv2, Yarn—yet another resource negotiater,  is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.”

Environment versions

OS-Ubuntu 15.04

jdk-8

Hadoop-2.7.0

1. Download VMware here and install it.

VMware is a virtual machine allow you to install ubuntu Linux operation system. There is an other similar product, Virtual Box from Oracle, which may cause some issue when loading ubuntu iso. So let’s use VMware here and feel free to use Virtual box if it works for your machine.

2. Go to BIOS and turn on Virtualization-Technology

It is a technology supporting virtual machines, which means no matter you are using VMware or Virtual Box, you have to turn it on.

Here’s an example

1

3. Download the Ubuntu Image.

You don’t need to afraid of AMD suffix if you are using intell CPU, read this

Ubuntu download

either 14.0.2 or 15.04 is decent. I am using 15.04

4. Load the .iso file

Here’s a good tutorial about load image on VMware

5. Ethernet connection issues

if you were seeing this image, especially the Ethernet connection illustrated in red rectangular, your connection is fine. Virtual machine is using Ethernet to connect to your laptop. Moreover, if the Ubuntu shows disconnected, however your main machine (Win-7 for example) has internet connection, just simply—-“disconnect”—-“enable networking” in your Ubuntu. This re-connection process sometimes may solve the issue.

2

let’s start the hadoop! 

actually here are some very good tutorial on youtube and they are installing hadoop 2.7.0 which is already a Yarn version.

I prefer the following one, which is simple and good enough for us.

I have copied an modified Mr.Chaalpritam’s procedure from the second video from his blog

1. update packages in ubuntu

sudo apt-get update

2.  install java development kit (jdk)

sudo apt-get install openjdk-8-jdk

3. Check Java version

java -version

4. Install remote access tool ssh

sudo apt-get install ssh

5. Install rsync

The following command is to install rsync, which is used for keeping copies of a file on two computer systems the same. But essentially ubuntu already has the latest version of it.

sudo apt-get install rsync

6. Generate ssh-DSA public key pair.
Some background information at here. And if you read the link, you will understand your private key is saved as id_dsa

ssh-keygen -t dsa -P ‘ ‘ -f ~/.ssh/id_dsa

7. Save private key id_dsa as authorized_keys

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

8. Download hadoop 2.7.0

wget -c http://www.interior-dsgn.com/apache/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

9. Unzip the binary file

sudo tar -zxvf hadoop-2.7.0.tar.gz

10. Move hadoop-2.7.0 to your local user

sudo mv hadoop-2.7.0  /usr/local/hadoop

11. Generate symbolic link to point to the jdk

update-alternatives –config java

12. Set up linux environment

sudo nano ~/.bashrc

put the following lines at the end of the bashrc.

#Hadoop Variables
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib”

13. Activate bashrc

source ~/.bashrc

14. Go back to hadoop directory

cd /usr/local/hadoop/etc/hadoop

15. Configure java home in hadoop-env.sh

open hadoop-env.sh

sudo nano hadoop-env.sh

change export JAVA_HOME=${JAVA_HOME} as the following line
export JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64″

16. Configure core-site.xml
It is the Site-specific configuration for a given hadoop installation

sudo nano core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

17. Modify YARN configuration options

sudo nano yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value> org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

18. Change MapReduce configuration options

sudo cp mapred-site.xml.template mapred-site.xml

sudo nano mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

19. Adjust hdfs-site.xml
set hdfs configuration(namenode, datanode, and replications).

we are using standalone mode so set replication as 1 here.

sudo nano hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hadoop_store/hdfs/datanode</value>
</property>
</configuration>

20.  Make directory of namenode and datanode

mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode

mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode

21. Change ownership of hadoop folder

Allowing your local machine read/write to hdfs

replace the $USER as the username of your local machine

sudo chown $USER:$USER -R /usr/local/hadoop

22. Format the namenode

hdfs namenode -format

23. Start all components.

start-all.sh

24. Check the components

jps

If you see all the component shows up like the following image, congratulations!

3

Usually SecondaryNameNode, ResourceManager and NameNode are sitting in the master machine

DataNode and NodeManager deployed in the slave machine. However, since we are setting a standalone cluster, all the components are sitting in one machine.

25. Let’s run the classical WordCount MapReduce job in Yarn.
25.1 Download and unzip wordcount sample input file. (refer to this)

wget http://www.gutenberg.org/files/4300/4300.zip

unzip 4300.zip

25.2 go to hadoop directory

cd /usr/local/hadoop

25.3 create user file folder

bin/hdfs dfs -mkdir /user

If you met this warning

4

Don worry, you might read this

25.4 build your user directory

bin/hdfs dfs -mkdir /user/$USERNAME

You can verify by looking at your namenode port 50070. My $USERNAME is robin

5

25.5 put input file from local machine to HDFS (hadoop file system)

bin/hdfs dfs -put  /home/robin/Downloads/4300.txt inputfile

25.5 Using the already built .jar to run wordcout example.

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar wordcount inputfile outputfile

6

If you are seeing this, congratulations! Our Yarn installation is successful!

25.6 Get the file from hdfs to your local machine

bin/hdfs dfs -get /user/robin/outputfile    /home/robin/Downloads/wordcount_output

25.7 Overview the result

7

We are seeing wordcount_output folder contains

_SUCCESS—-On the successful completion of a job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS.

part-r-00000—–The output files are by default named part-x-yyyyy where:

  • x is either ‘m’ or ‘r’, depending on whether the job was a map only job, or reduce
  • yyyyy is the mapper or reducer task number (zero based)

refer to this

Congratulations! We have successfully install Yarn on Ubuntu!

Step by Step of Building Scala SBT project on Intellij

HI guys,

Let’s talk about the 1st topic here about how to build a simple sbt project in intellij. Actually, I am a hardware engineer, so I will use newbie level terms and feel free to skip these if you are already a senior guy.

Although I have to admit that Maven (a java built tool) and Gradle are quite powerful build tool, the Apache recommend using sbt to build project. Thus, let’s build sbt project, which we can program some Scala code for the Apache spark project based on.

Platform versions

OS:win7

Java: jdk.1.8.0_45

Scala:2.11.7

Intellij: community 14.1.4

spark:1.4.0

1. Install the Java idk

http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

and set the Java_home like the following image shows up.

1

Type cmd in your menu->search window and type

java -version 

It should show up the java version.

11

2. Install the intellij

Intellij is a powerfull developing IDE, which is similar to Eclipse which most of you may heard of. However, Intellij has better support for Scala and Android development. So we choose Intellij here.

Please download and install it here,

https://www.jetbrains.com/idea/download/,

you might want the free community version, which is enough for us.

3. Build Hello world sbt project

After you installed the intellij, when you first opening it. The latest version 14.1.4 will let you choose to install scala plugin . Go select it. Open intellij and let’s build a simple sbt project without spark to testify your installation is correct.

4. Choose a sbt project

2

5. Name the project

3

6. Add the source code under src/main/scala/

5

7. Type in our first Scala App

7

8. Run the code

8

9. Let’s see the result

9

10. Sometimes we may encounter version issues, java missing class issues, among which might be resolved by cleaning Intellij cache. But remember to save files before cleaning cache

10

Congratulations, we have finished our first Scala App in Intellij!