Study Spam Classifier Code by MLlib on Intellij

Hi guys, I am hoping you are doing well today. For this chapter let’s study and run the spam classifier code from databricks

1. Download the source code from github, refer to databricks examples

But I have modified some of them, you can download the whole project at my github

2. Download the test files

spam sample

https://github.com/databricks/learning-spark/blob/master/files/spam.txt

non-spam sample

https://github.com/databricks/learning-spark/blob/master/files/ham.txt

3. Create a project based on the procedure we discussed at here

4. Change build.sbt

We need to use scala 2.10.x version for Spark 1.4.0 see here

name := “spam_classifier”
version := “1.0”
scalaVersion := “2.10.4”
libraryDependencies += “org.apache.spark” % “spark-core_2.10” % “1.4.0”
libraryDependencies += “org.apache.spark” % “spark-mllib_2.10” % “1.4.0”

4. Let’s study the code

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint

object spam_classifier{

  def main(args: Array[String]) {
    //change it to the local standalone mode
    val conf = new SparkConf()
                   .setMaster("local[2]")
                   .setAppName("spam_classifier")
                   .set("spark.executor.memory", "1g")
    val sc = new SparkContext(conf)

    // Load 2 types of emails from text files: spam and ham (non-spam).
    // Each line has text from one email.
    val spam = sc.textFile("spam.txt")
    val ham = sc.textFile("ham.txt")

    //(1)Feature extraction
    // Create a HashingTF instance to map email text to vectors of 100 features.
    val tf = new HashingTF(numFeatures = 100)
    // Each email is split into words, and each word is mapped to one feature.
    val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
    val hamFeatures = ham.map(email => tf.transform(email.split(" ")))

    //(2)Create Training Set
    // Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
    val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
    val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
    val trainingData = positiveExamples ++ negativeExamples
    trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.

    //(3) Training
    // Create a Logistic Regression learner which uses the LBFGS optimizer.
    val lrLearner = new LogisticRegressionWithSGD()
    // Run the actual learning algorithm on the training data.
    val model = lrLearner.run(trainingData)

    //(4) Create Testing Set
    // Test on a positive example (spam) and a negative one (ham).
    // First apply the same HashingTF feature transformation used on the training data.
    val posTestExample = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))
    val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))

    //(5) Test results
    // Now use the learned model to predict spam/ham for new emails.
    println(s"Prediction for positive test example: ${model.predict(posTestExample)}")
    println(s"Prediction for negative test example: ${model.predict(negTestExample)}")

    sc.stop()
  }
}

4.1 About HasingTF
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
TFIDF(t,d,D)=TF(t,d)IDF(t,D).
Term frequency TF(t,d) is the number of times that term t appears in document d, while documentfrequency DF(t,D) is the number of documents that contains term t. IDF(t,D)is a numerical measure of how much information a term provides.
The product of these values, TFIDF, shows how relevant a term is to a specificdocument (i.e., if it is common in that document but rare in the whole corpus).

4.2
val spamFeatures = spam.map(email => tf.transform(email.split(” “)))
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF
Transforms the input document to term frequency vectors.

4.3 labelpoint

LabeledPoint class in MLlib, described in which resides in the mllib.regression package. A LabeledPoint consists simply of a label (which is always a Double value, but can be set to discrete integers for classification) and a features vector.

4.3 For LogisticRegressionWithSGD
https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression

5. Let’s build the project by hitting

Untitled

6. Let’s run it

2

val posTestExample = tf.transform(“O M G GET cheap stuff by sending money to …”.split(” “))

val negTestExample = tf.transform(“Hi Dad, I started studying Spark the other …”.split(” “))

We can see the posTestExample is a spam email and be labeled as 1.0, while the negTestExample is label as 0.0. Our classifier is working!

Congratulations!

Reference,

Holden Karau et al., Learning Spark : Lightning-Fast Big Data Analytics, O’ Reilly 2014

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s