My Learning Curve of Big Data and Data Analysis

Hi Guys,

Starting from this year, I stepped into the big data and began to self-study on Hadoop. After taking some online courses and read several books, I really like to share the resources to you guys and hopefully will be beneficial to your big data/data mining study.

I. Linux

As an ASIC verification Engineer, I am working on Llinux OS. However, I am trying to sharpen my Linux skills. So I read these two books.

b1b2

1. Unix and Linux System Administration Handbook (4th Edition) by  Evi Nemeth

Pretty good one for Linux beginners, and I highly recommend to you.

pdf version download

2. Linux Kernel Development (3rd Edition) by Robert Love

Frankly speaking, I believe this one is only suitable for the guys who are working on kernel or OS development.

II. Hadoop & Yarn

1. Books

Still, if you have time, starting from Hadoop rather than Spark and understand the key features, architectures of it will stubborn the foundation of your big data knowledge. However, if you don’t have time, just start from Apache Spark, which we will talk about later. The best book about Hadoop for the beginners is still this one.

Hadoop the definitive guide 4th Edition by Tom White.

I would like to say if you have read it twice, and work on some hands-on exercises, you will beat Cloudera CCD410(Cloudera Certified Developer for Apache Hadoop Exam) (The book’s pdf version download)

b3

I did read it twice, but the reason I didn’t insist on CCD410 certification is that, I found out, “WOW”, there is a new tool named “Apache Spark”, where you don’t need to write specific ugly Map-Reduce Java Code, which are wrapped up and packaged in the Spark! We will talk about Spark later.

2. Online Courses/Introductions
(1 Coursera

1

Introduction to Data Science (University of Washington)

I didn’t took this one, because it’s a new course. Actually this is the only course of my search result with keyword “hadoop” in the coursera. But it’s looks good and you may like to try it.

(2 Udemy

There are plenty of courses in Udemy about Hadoop. If you like, pick the most enrolled and reviewed one. Try to search some promotion codes, which will make the course only costs you 10 to 20 dollars.

(3 MapR Academy School

2

Highly recommend. Easy to the beginners, you can even have popcorn while watching.

(4 Hadoop Yahoo introduction

Highly Recommended. It’s a Must Read, if you don’t have time to go through Hadoop books)

3. About Sandbox
(1 Hortonworks Sandbox

High recommend for the Hadoop beginners. They have setup the Hadoop Ecosystem on it (including Hive, Oozie, Sqoop, Zookeeper…), which makes it easier to the beginners. Moreover, there are numbers of tutorials on this.

(2 Hortonworks tutorials

There are bunch of tutorials based on their Sandbox and very easy to learn and practice.

(3 Cloudera QuickStart VM

This one is also decent and friendly to the Hadoop beginners.

(4 MapR Sandbox

Forget it, it needs at least  8GB RAM!

4. What you really need to practice

(1 Really need to install a clean ubuntu or similar linux OS on VMware or VirtualBox and try to install Hadoop single-node or multi-nodes on them. There are plenty of tutorials on youtube about install hadoop on ubuntu.

(2 Alternatively, you can try to install Hadoop on AWS EC2, which is easier to build a multi-nodes cluster.

(3 Run wordcount example on them and understand basic of MapReduce Java Code.

(4 Remember the Latest version of Hadoop is MR v2.0 called Yarn. So, trying to install Hadoop v2.0.

(5 Hartonworks Yarn tutorials

III. Spark

(1 Learning Spark Lightning-Fast Big Data Analysis By 

b4

Highly recommended. The whole book is mainly using Python and Scala, which makes it easier and you can go through it very quick if you already know Hadoop. The difficulty session would be the RDD (Resilient Distribution Data-frame) programming and manipulating.
I found Udemy is more suitable for me than Coursera, because it is quick to learn(doesn’t like Cousrera, you have to follow up 2 months), more practical (most of the lecturers are experienced engineers). While Coursera’s advantages are all the courses are highly organized and persuasive certification provided. There is no course in Cousera talking about Apache Spark yet.
(2.  learn Spark from Scratch (Udemy)
3
Again, it is not from scratch, lol, it is not as easy as you might think if you didn’t use Intellij, didn’t know hadoop as all, didn’t understand stream processing, have no idea of SQL, database and streaming, and haven’t used any build tools, Maven, Gradle or SBT. Moreover, the tutorial is sort of less organized. But, after self-resolve all these issues, you will definitely learn a lot from this class.
(3 Apache Spark Offical Website
4
For Apache Spark, I sincerely like its official website, which is highly categorized and reduce my learning curve a lot, especially for MLlib. Moreover, they provide Scala, Java and Python all three version of sample codes.
(4 Summit 2014 materials
Some very valuable training materials from Spark Summit 2014.

IV. Scala

This new functional programming language is getting hot due to paralleling programming is well suited to cloud computing such as Hadoop and Spark. I am interested on Scala due to Spark is programmed on Scala, which could be wrote more efficiently and increase application performance.

Coursera courses

1. Functional Programming Principles in Scala —–Basic Scala

2. Principles of Reactive Programming —-Advanced Scala

Those are decent Scala courses. Well, I have to admit the Scala (functional programming) is not easy to learn. But the point of study a language is to practice. So get ready to do some Spark coding in Scala would be an efficient way to master this.

V. Data analysis & Machine Learning

1. Coursera–Machine Learning by Andrew Ng
This online course nearly becomes a MUST take one for the Machine Learning Beginners. Dr.Andrew described each algorithm in very detail and make mathematics demonstration relatively simple.
2. Udemy—Learning Python for Data Analysis and Visualization by Jose Portilla
Mr.Jose introduced many Python packages including Numpy, Pandas, Scikit-learn, Matplotlib, etc… His lecturing is extremely specific and you will learn a lot about how to analysis data-set and data-frame in IPythoon Notebook
3. Coursera—Data Science Certification by John Hopkins University
John Hopkins University provides a well-known data science certification on Coursera, which requires you to complete all 9 courses and cost about $430, while, it’s still free to access the course without certification. In addition, this is more like an expanded version of Introduction to Data Science (University of Washington). Recommend to go through all the courses if you have time. I haven’t study these courses, as some of them hasn’t started at the beginning of this year, but I plan to go take a look when classes begin, lol.
56
4University of California-Berkeley Online Master Degree of Data Science Program
If you have enough funding or hasn’t took any Master Degree yet, you could consider this Online Master Degree Program, which It takes 18 to 20 months and costs $60,000. Some reviews about this.
7
5. Udemy—Applied Data Science with R by V2 Maestros
I recently learned R on Udemy by this course, which not only provides a simple and quick introduction of statistic programming language R in data science directions, but also is pretty well organized and suitable for the R beginners like me.

VI. Not the End

These are all the materials I learned in the upper half year of 2015. I am continuing my study and very likely to share my learning path with you guys. Thank you so so much for reading this long article and good luck for Big data learning!

Step by Step of Configuring Apache Spark to Connect with Cassandra

HI guys,

This time, We will discuss about how to access the data/table in Cassandra.

I. Pre-request

Pre-request:

Cassandra installed (tutorial)

Apache Spark installed (tutorial)

Intellij installed (tutorial)

Scala -version 2.10.x (Apache Spark 1.4.0 only support 2.10.x, refer to spark.apache.org)

If you haven’t installed these tools, feel free to refer to my tutorial in previous sessions.

Note we installed Cassandra and Spark in Ubuntu machine on VMware player. But our intellij is installed on local win7 machine.

II. Apache Cassandra configurations

1. Open one terminal in ubuntu and type

ifconfig

you will find your IP address. e.g 192.168.30.154

final

2. Configure the Cassandra server

cd /usr/local/cassandra/conf/cassandra.yaml

This is the main configuration files (Spec in DataStax).

There are two addresses we need to configure for our standalone mode run.

listen_address: 192.168.30.154

(192.168.30.154 is an example, you may need to change to your IP)

The other nodes in cluster can communicate with this node using listen_address.

rpc_address: 192.168.30.154

(192.168.30.154 is an example, you may need to change to your IP)

rpc_address specify the ip or host name through which client communicate

3. Let’s open Cassandra server.

cd /usr/local/cassandra/

bin/cassandra -f

4. configure Cqlsh

vi /usr/local/cassandra/bin/cqlsh

search for DEFAULT_HOST and change this address to your local IP, e.g

DEFAULT_HOST = ‘192.168.30.154’

5. Add some values to our previous Cassandra table(Please refer to this)

We have a keyspace named “demo” and a table named users. Let’s insert some values into it.

INSERT INTO demo.users(user_name, birth_year) VALUES(‘Robin’, 1987);

which means we have insert value Robin to column user_name in table users of the keyspace demo. Same thing for value 1987.

III. Build sbt spark project

1. Build a Spark Project on Intellij (refer to here)

SBT dependency (build.sbt)

name := "SparkCassandra"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.4.0"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.4.0-M1"

2. A simple query code like this (refer to this)

import org.apache.spark.{SparkConf, SparkContext}
import com.datastax.spark.connector._

object SparkCassandra extends App {

  val conf = new SparkConf()
    .setMaster("local[*]")
    .setAppName("SparkCassandra")
    //set Cassandra host address as your local address
    .set("spark.cassandra.connection.host", "192.168.30.154")
  val sc = new SparkContext(conf)
  //get table from keyspace and stored as rdd
  val rdd = sc.cassandraTable("demo", "users")
  //collect will dump the whole rdd data to driver node (here's our machine), 
  //which may crush the machine. So take first 100 (for now we use small table
  //so it's ok)
  val file_collect = rdd.collect().take(100)
  file_collect.foreach(println(_))

  sc.stop()
}

3. The result is what we expected

final_2

Since we only configured the user_name and birth_year, the rest of the column will return ‘null’.

Reference,

1. http://blog.knoldus.com/2015/06/23/apache-spark-cassandra-basic-steps-to-install-and-configure-cassandra-and-use-it-with-apache-spark-with-example/

2. http://www.datastax.com/dev/blog/accessing-cassandra-from-spark-in-java

3.http://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10