My Learning Curve of Spark and Data Mining II

HI Guys,

I am back. I am sorry I didn’t update any post from September, due to focusing on my current jobs which is working as a Django developer from back-end to the front-end even involving using some D3.js, lol.

Anyway, I am trying to continue study on Big data and data mining at my free time and I will list the following resources I have been through in this half year, especially on Apache Spark.

1.Data Mining

1. 1 web Data Mining pdf and Programming Collective Intelligence pdf


Although these two books are relatively old, they are decently introduce the data mining on the website and Machine learning algorithms in python respectively, which are worthy to take a quick look.

1.2 Stanford University Class CS246

They very formally present the Machine Learning Algorithms with pdf download available. But with totally concentrating on algorithms and derivatives, it might be boring when you read.

1.3 Text Processing

Text Retrieval and Search Engine & Text Mining and Analytic


These two courses firmly explain the text processing with good explanations to nearly all the popular NLP algorithms. Strongly recommend to go though if you are interesting on NLP.


2. Apache Spark

I have passed the Apache Spark Certification by Databricks and O’Reilly yesterday, which is not too hard (as I am not a Spark developer but as a web developer now), but still many questions are pretty puzzling. I am not allow to tell you the specific questions but will recommend the public materials that useful to prepare for it.

Please don’t go online testing, that testing software is extremely hard to install and using inner laptop camera is not allowed, so that you have to buy an other camera. I strongly recommend to go onsite testing.

2.1  Learning Spark pdf


This book is still the bible of Apache Spark. You would better read it at least twice! 

2.2 RDD original paper pdf 

RDD is the core of Spark and this paper is the original published paper for RDD. I am strongly recommend to read from session 1 to session 6.5

2.3 Spark Summit 2014 to 2015

You would like to go through the ppt and pdf of the following websites. You should learn them easily after you read Learning Spark book.

Another Very good tutorial in Summit 2015 pdf


2.4 UC Berkeley CS 190 and CS 200

CS 200 Introduction to Big Data with Apache Spark is suggested to go through.

2.5 UC Berkeley AMP Camp

You would like to take a quick look at the training materials in passed Camps’

2.6 IBM Big Data University

Although this website UI design is lame, the complete certifications cannot be shared in LinkedIn, the following courses are all good enough. If you just interesing on Spark, Spark Fundamentals I and II are suit for you. However, the sandbox from IBM requires over 10G RAM…..

Spark Fundamentals II – (BD097EN)

Spark Fundamentals I – (BD095EN)

Text Analytics Essentials (BD085EN)

Developing Distributed Applications Using ZooKeeper (BD065EN)

Hadoop Fundamentals I (BD001EN)

2.7 Databricks Spark Knowledge Git-Books

Spark Base pdf

Spark Reference Application pdf

2.8 Apache Spark API by La Trobe University pdf

Although spark official doc illustrates the API and application usages very well, this pdf document from La Trobe University explained each API method in very detail, I strongly recommend you to read every examples.

2.9 Advanced Analytics with Spark pdf


This book provides commercial level codes, while you can go through it if you have time.

2.10. Coursera —Hadoop Platform and Application Framework


I am taking this course from UCSD, seems decently enough, while it mainly introducing Hadoop.


For Chinese Resources,



——No Commercial Use for all the Links and pdf provided above——

Hopefully all of theses materials are beneficial to you. During my career as a web developer in the past half year, I feel like although Big data and data analysis are extremely hot recently, we have to consider the whole picture of the product and better to understand the product, the market need and then think about how to exploit the data analysis techniques and even big data.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s