Starting from this year, I stepped into the big data and began to self-study on Hadoop. After taking some online courses and read several books, I really like to share the resources to you guys and hopefully will be beneficial to your big data/data mining study.
As an ASIC verification Engineer, I am working on Llinux OS. However, I am trying to sharpen my Linux skills. So I read these two books.
1. Unix and Linux System Administration Handbook (4th Edition) by
Pretty good one for Linux beginners, and I highly recommend to you.
2. Linux Kernel Development (3rd Edition) by Robert Love
Frankly speaking, I believe this one is only suitable for the guys who are working on kernel or OS development.
II. Hadoop & Yarn
Still, if you have time, starting from Hadoop rather than Spark and understand the key features, architectures of it will stubborn the foundation of your big data knowledge. However, if you don’t have time, just start from Apache Spark, which we will talk about later. The best book about Hadoop for the beginners is still this one.
Hadoop the definitive guide 4th Edition by Tom White.
I would like to say if you have read it twice, and work on some hands-on exercises, you will beat Cloudera CCD410(Cloudera Certified Developer for Apache Hadoop Exam) (The book’s pdf version download)
I did read it twice, but the reason I didn’t insist on CCD410 certification is that, I found out, “WOW”, there is a new tool named “Apache Spark”, where you don’t need to write specific ugly Map-Reduce Java Code, which are wrapped up and packaged in the Spark! We will talk about Spark later.
2. Online Courses/Introductions
Introduction to Data Science (University of Washington)
I didn’t took this one, because it’s a new course. Actually this is the only course of my search result with keyword “hadoop” in the coursera. But it’s looks good and you may like to try it.
There are plenty of courses in Udemy about Hadoop. If you like, pick the most enrolled and reviewed one. Try to search some promotion codes, which will make the course only costs you 10 to 20 dollars.
(3 MapR Academy School
Highly recommend. Easy to the beginners, you can even have popcorn while watching.
Highly Recommended. It’s a Must Read, if you don’t have time to go through Hadoop books)
3. About Sandbox
High recommend for the Hadoop beginners. They have setup the Hadoop Ecosystem on it (including Hive, Oozie, Sqoop, Zookeeper…), which makes it easier to the beginners. Moreover, there are numbers of tutorials on this.
(2 Hortonworks tutorials
There are bunch of tutorials based on their Sandbox and very easy to learn and practice.
(3 Cloudera QuickStart VM
This one is also decent and friendly to the Hadoop beginners.
(4 MapR Sandbox
Forget it, it needs at least 8GB RAM!
4. What you really need to practice
(1 Really need to install a clean ubuntu or similar linux OS on VMware or VirtualBox and try to install Hadoop single-node or multi-nodes on them. There are plenty of tutorials on youtube about install hadoop on ubuntu.
(2 Alternatively, you can try to install Hadoop on AWS EC2, which is easier to build a multi-nodes cluster.
(3 Run wordcount example on them and understand basic of MapReduce Java Code.
(4 Remember the Latest version of Hadoop is MR v2.0 called Yarn. So, trying to install Hadoop v2.0.
(1 Learning Spark Lightning-Fast Big Data Analysis By Holden Karau
(2. learn Spark from Scratch (Udemy)
This new functional programming language is getting hot due to paralleling programming is well suited to cloud computing such as Hadoop and Spark. I am interested on Scala due to Spark is programmed on Scala, which could be wrote more efficiently and increase application performance.
1. Functional Programming Principles in Scala —–Basic Scala
2. Principles of Reactive Programming —-Advanced Scala
Those are decent Scala courses. Well, I have to admit the Scala (functional programming) is not easy to learn. But the point of study a language is to practice. So get ready to do some Spark coding in Scala would be an efficient way to master this.