July | 2015 | ZephyrRapier

ASIC Verification Summarization

Posted on July 31, 2015 by cyrobin

Since I am moving forward to big data/data mining directions, I would like to summarize my experience on ASIC verification and provide the resources which may be helpful to you guys.

1. ASIC_Verilog_Interview_Questions

Here’s my conclusion for Verilog interview questions if you want to be an ASIC designer.

2. SystemVerilog Interview Questions

My conclusion of SystemVerilog interview questions if you want to be an ASIC verification Engineer.

3. UVM interview questions

The best one for now is just reviewing this website

4. Other resources

4. 1 Websites

test-bench

ASIC world

Basic-UVM by Mentor Graphics

Advanced-UVM by Mentor Graphics

4.2 Books

1) SystemVerilog for Verification: A Guide to Learning the Test-bench Language Features by Chris Spear (pdf)

It’s a must read to get into ASIC verification world. If you read it twice you can crush any SystemVerilog interview questions. (for the pdf download, press ““)

2) The UVM Primer: A Step-by-Step Introduction to the Universal Verification Methodology by Ray Salemi

A decent entry level UVM book with source code provided. You can build and learn a complete simple UVM test-bench.

My Learning Curve of Big Data and Data Analysis

Posted on July 16, 2015 by cyrobin

Hi Guys,

Starting from this year, I stepped into the big data and began to self-study on Hadoop. After taking some online courses and read several books, I really like to share the resources to you guys and hopefully will be beneficial to your big data/data mining study.

I. Linux

As an ASIC verification Engineer, I am working on Llinux OS. However, I am trying to sharpen my Linux skills. So I read these two books.

1. Unix and Linux System Administration Handbook (4th Edition) by Evi Nemeth

Pretty good one for Linux beginners, and I highly recommend to you.

pdf version download

2. Linux Kernel Development (3rd Edition) by Robert Love

Frankly speaking, I believe this one is only suitable for the guys who are working on kernel or OS development.

II. Hadoop & Yarn

1. Books

Still, if you have time, starting from Hadoop rather than Spark and understand the key features, architectures of it will stubborn the foundation of your big data knowledge. However, if you don’t have time, just start from Apache Spark, which we will talk about later. The best book about Hadoop for the beginners is still this one.

Hadoop the definitive guide 4th Edition by Tom White.

I would like to say if you have read it twice, and work on some hands-on exercises, you will beat Cloudera CCD410(Cloudera Certified Developer for Apache Hadoop Exam) (The book’s pdf version download)

I did read it twice, but the reason I didn’t insist on CCD410 certification is that, I found out, “WOW”, there is a new tool named “Apache Spark”, where you don’t need to write specific ugly Map-Reduce Java Code, which are wrapped up and packaged in the Spark! We will talk about Spark later.

2. Online Courses/Introductions

(1 Coursera

Introduction to Data Science (University of Washington)

I didn’t took this one, because it’s a new course. Actually this is the only course of my search result with keyword “hadoop” in the coursera. But it’s looks good and you may like to try it.

(2 Udemy

There are plenty of courses in Udemy about Hadoop. If you like, pick the most enrolled and reviewed one. Try to search some promotion codes, which will make the course only costs you 10 to 20 dollars.

(3 MapR Academy School

Highly recommend. Easy to the beginners, you can even have popcorn while watching.

(4 Hadoop Yahoo introduction

Highly Recommended. It’s a Must Read, if you don’t have time to go through Hadoop books)

3. About Sandbox

(1 Hortonworks Sandbox

High recommend for the Hadoop beginners. They have setup the Hadoop Ecosystem on it (including Hive, Oozie, Sqoop, Zookeeper…), which makes it easier to the beginners. Moreover, there are numbers of tutorials on this.

(2 Hortonworks tutorials

There are bunch of tutorials based on their Sandbox and very easy to learn and practice.

(3 Cloudera QuickStart VM

This one is also decent and friendly to the Hadoop beginners.

(4 MapR Sandbox

Forget it, it needs at least 8GB RAM!

4. What you really need to practice

(1 Really need to install a clean ubuntu or similar linux OS on VMware or VirtualBox and try to install Hadoop single-node or multi-nodes on them. There are plenty of tutorials on youtube about install hadoop on ubuntu.

(2 Alternatively, you can try to install Hadoop on AWS EC2, which is easier to build a multi-nodes cluster.

(3 Run wordcount example on them and understand basic of MapReduce Java Code.

(4 Remember the Latest version of Hadoop is MR v2.0 called Yarn. So, trying to install Hadoop v2.0.

(5 Hartonworks Yarn tutorials

III. Spark

(1 Learning Spark Lightning-Fast Big Data Analysis By Holden Karau

Highly recommended. The whole book is mainly using Python and Scala, which makes it easier and you can go through it very quick if you already know Hadoop. The difficulty session would be the RDD (Resilient Distribution Data-frame) programming and manipulating.

I found Udemy is more suitable for me than Coursera, because it is quick to learn(doesn’t like Cousrera, you have to follow up 2 months), more practical (most of the lecturers are experienced engineers). While Coursera’s advantages are all the courses are highly organized and persuasive certification provided. There is no course in Cousera talking about Apache Spark yet.

(2. learn Spark from Scratch (Udemy)

Again, it is not from scratch, lol, it is not as easy as you might think if you didn’t use Intellij, didn’t know hadoop as all, didn’t understand stream processing, have no idea of SQL, database and streaming, and haven’t used any build tools, Maven, Gradle or SBT. Moreover, the tutorial is sort of less organized. But, after self-resolve all these issues, you will definitely learn a lot from this class.

(3 Apache Spark Offical Website

For Apache Spark, I sincerely like its official website, which is highly categorized and reduce my learning curve a lot, especially for MLlib. Moreover, they provide Scala, Java and Python all three version of sample codes.

(4 Summit 2014 materials

Some very valuable training materials from Spark Summit 2014.

IV. Scala

This new functional programming language is getting hot due to paralleling programming is well suited to cloud computing such as Hadoop and Spark. I am interested on Scala due to Spark is programmed on Scala, which could be wrote more efficiently and increase application performance.

Coursera courses

1. Functional Programming Principles in Scala —–Basic Scala

2. Principles of Reactive Programming —-Advanced Scala

Those are decent Scala courses. Well, I have to admit the Scala (functional programming) is not easy to learn. But the point of study a language is to practice. So get ready to do some Spark coding in Scala would be an efficient way to master this.

V. Data analysis & Machine Learning

1. Coursera–Machine Learning by Andrew Ng

This online course nearly becomes a MUST take one for the Machine Learning Beginners. Dr.Andrew described each algorithm in very detail and make mathematics demonstration relatively simple.

2. Udemy—Learning Python for Data Analysis and Visualization by Jose Portilla

Mr.Jose introduced many Python packages including Numpy, Pandas, Scikit-learn, Matplotlib, etc… His lecturing is extremely specific and you will learn a lot about how to analysis data-set and data-frame in IPythoon Notebook

3. Coursera—Data Science Certification by John Hopkins University

John Hopkins University provides a well-known data science certification on Coursera, which requires you to complete all 9 courses and cost about $430, while, it’s still free to access the course without certification. In addition, this is more like an expanded version of Introduction to Data Science (University of Washington). Recommend to go through all the courses if you have time. I haven’t study these courses, as some of them hasn’t started at the beginning of this year, but I plan to go take a look when classes begin, lol.

4. University of California-Berkeley Online Master Degree of Data Science Program

If you have enough funding or hasn’t took any Master Degree yet, you could consider this Online Master Degree Program, which It takes 18 to 20 months and costs $60,000. Some reviews about this.

5. Udemy—Applied Data Science with R by V2 Maestros

I recently learned R on Udemy by this course, which not only provides a simple and quick introduction of statistic programming language R in data science directions, but also is pretty well organized and suitable for the R beginners like me.

VI. Not the End

These are all the materials I learned in the upper half year of 2015. I am continuing my study and very likely to share my learning path with you guys. Thank you so so much for reading this long article and good luck for Big data learning!

Step by Step of Configuring Apache Spark to Connect with Cassandra

Posted on July 16, 2015 by cyrobin

HI guys,

This time, We will discuss about how to access the data/table in Cassandra.

I. Pre-request

Pre-request:

Cassandra installed (tutorial)

Apache Spark installed (tutorial)

Intellij installed (tutorial)

Scala -version 2.10.x (Apache Spark 1.4.0 only support 2.10.x, refer to spark.apache.org)

If you haven’t installed these tools, feel free to refer to my tutorial in previous sessions.

Note we installed Cassandra and Spark in Ubuntu machine on VMware player. But our intellij is installed on local win7 machine.

II. Apache Cassandra configurations

1. Open one terminal in ubuntu and type

ifconfig

you will find your IP address. e.g 192.168.30.154

2. Configure the Cassandra server

cd /usr/local/cassandra/conf/cassandra.yaml

This is the main configuration files (Spec in DataStax).

There are two addresses we need to configure for our standalone mode run.

listen_address: 192.168.30.154

(192.168.30.154 is an example, you may need to change to your IP)

The other nodes in cluster can communicate with this node using listen_address.

rpc_address: 192.168.30.154

(192.168.30.154 is an example, you may need to change to your IP)

rpc_address specify the ip or host name through which client communicate

3. Let’s open Cassandra server.

cd /usr/local/cassandra/

bin/cassandra -f

4. configure Cqlsh

vi /usr/local/cassandra/bin/cqlsh

search for DEFAULT_HOST and change this address to your local IP, e.g

DEFAULT_HOST = ‘192.168.30.154’

5. Add some values to our previous Cassandra table(Please refer to this)

We have a keyspace named “demo” and a table named users. Let’s insert some values into it.

INSERT INTO demo.users(user_name, birth_year) VALUES(‘Robin’, 1987);

which means we have insert value Robin to column user_name in table users of the keyspace demo. Same thing for value 1987.

III. Build sbt spark project

1. Build a Spark Project on Intellij (refer to here)

SBT dependency (build.sbt)

name := "SparkCassandra"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.4.0"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.4.0-M1"

2. A simple query code like this (refer to this)

import org.apache.spark.{SparkConf, SparkContext}
import com.datastax.spark.connector._

object SparkCassandra extends App {

  val conf = new SparkConf()
    .setMaster("local[*]")
    .setAppName("SparkCassandra")
    //set Cassandra host address as your local address
    .set("spark.cassandra.connection.host", "192.168.30.154")
  val sc = new SparkContext(conf)
  //get table from keyspace and stored as rdd
  val rdd = sc.cassandraTable("demo", "users")
  //collect will dump the whole rdd data to driver node (here's our machine), 
  //which may crush the machine. So take first 100 (for now we use small table
  //so it's ok)
  val file_collect = rdd.collect().take(100)
  file_collect.foreach(println(_))

  sc.stop()
}

3. The result is what we expected

Since we only configured the user_name and birth_year, the rest of the column will return ‘null’.

Reference,

1. http://blog.knoldus.com/2015/06/23/apache-spark-cassandra-basic-steps-to-install-and-configure-cassandra-and-use-it-with-apache-spark-with-example/

2. http://www.datastax.com/dev/blog/accessing-cassandra-from-spark-in-java

3.http://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10

Study Apache Spark MLlib on IPython—Clustering—GMM

Posted on July 12, 2015 by cyrobin

Hi Guys,

Today, let’s talk about the GMM (Gaussian Mixture Model) and how to use the GMM in scikit learn and Spark.

I. Background Knowledge

Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model’s parameters.

Expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the Estep. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

GMM：

1. Objective function：§Maximize the log-likelihood and using EM as framework

2. EM algorithm：

E-step: Compute posterior probability of membership.

M-step: Optimize parameters.

Perform soft assignment during E-step.

Here’s a good youtube tutorial to tell the difference between K-means, GMM and EM.

Some definitions and terms from wiki

1. N random variables corresponding to observations, each assumed to be distributed according to a mixture of K components, with each component belonging to the same parametric family of distributions (e.g., all normal, all Zipfian, etc.) but with different parameters

2. N corresponding random latent variables specifying the identity of the mixture component of each observation, each distributed according to a K-dimensional categorical distribution

3. Normal distribution (Gaussian distribution) . Notation is

\mathcal{N}(\mu,\,\sigma^2)

, where μ ∈ R — mean (location)
σ² > 0 — variance (squared scale). Therefore, if the GMM model returns μ and σ², we can plot the mixed distribution functions.

II. Mathematics Proof

Here are the very detailed and relatively easy to understand videos illustrate GMM and EM algorithms

II. IPython Code (Github)

The following contour plot image indicates the envelope of two GMM components distributions

The following Code is based on PySpark

We can see both of scikit-learn and pyspark model indicates similar GMM components mixture.

Reference

1. http://blog.csdn.net/zouxy09/article/details/8537620 (in Chinese)

2. http://stackoverflow.com/questions/23546349/loading-text-file-containing-both-float-and-string-using-numpy-loadtxt

3. https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda

4. http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_pdf.html#example-mixture-plot-gmm-pdf-py

5. https://github.com/apache/spark/blob/master/data/mllib/gmm_data.txt

6. https://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model

Study Apache Spark MLlib on IPython—Clustering—K-Means

Posted on July 10, 2015 by cyrobin

HI Guys,

Today, let’s study the K-means, which is most commonly used clustering method.

I. Clustering vs Classification

For those puzzling about difference between classification and clustering, (refer to)

Classification– The task of assigning instances to pre-defined classes. (Supervised)
–E.g. Deciding whether a particular patient record can be associated with a specific disease.

Clustering – The task of grouping related data points together without labeling them. (Unsupervised)
–E.g. Grouping patient records with similar symptoms without knowing what the symptoms indicate.II.

II. Types of Clustering in MLlib

1. Partitioning Approach

Clustering the dataset first and then use the same estimate standard

to converge

K-Means

2. Model Based Approach

Estimate a distribution model and then find the best of it

GMM (Guassian Mixture Model)

LDA (Latent Dirichlet allocation—text processing)

3. Dimensionality Reduction Approach

Reduce the dimensions the then clustering.

PIC (Power iteration clustering—–Graph processing)

4. K-Means

Suppose that we have n example feature vectors x₁, x₂, …, x_n all from the same class, and we know that they fall into k compact clusters, k < n. Let m_i be the mean of the vectors in Cluster i. If the clusters are well separated, we can use a minimum-distance classifier to separate them. That is, we can say that x is in Cluster i if || x – m_i || is the minimum of all the k distances. This suggests the following procedure for finding the k means (reference):

Make initial guesses for the means m₁, m₂, …, m_k
Until there are no changes in any mean
- Use the estimated means to classify the examples into clusters
- For i from 1 to k
  - Replace m_i with the mean of all of the examples for Cluster i
- end_for
end_until

1. Good and short tutorial about K-Means

2. scikit-learn K-means

3. Apache MLlib K-means

Let’s look at the IPython notebook. Github source code

We can see these two models have a similar WSSSE(Within Set Sum of Squared Error) value. Scikit-learn gives 2470.602, whileas the MLlib result is 2450.403.

Reference,

1. https://spark.apache.org/docs/latest/mllib-clustering.html#k-means

2. http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html

3. http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/k_means.htm

Study Apache Spark MLlib on IPython—Regression & Classification—Random Forest & GBTs

Posted on July 9, 2015 by cyrobin

HI Guys,

Today, let’s study the Ensembles of Trees Algorithms: Random Forest and Gradient Boosted Trees(GBTs).

I. Background Knowledge

1. Random Forest is trying to build several estimators independently and then to average their predictions. It’s main idea is to reduce variances

2. GBTs: is to combine several weak models to produce a powerful ensemble, which aim to reduce bias

FYI: In MLlib, GBT doesn’t support Multiclass classification yet

3. Difference between variance and bias

4. Here’s a good tutorial from DataBricks

II. IPython Code (Github)

Let’s look at the IPython Code here. We are using the same dataset and procedure as the decision tree session

And we will find our dataset is more bias oriented or has high variance

In PySpark

In Conclusion, both scikit-learn randomforest model and MLlib one have better performance then the GBTs. Therefore, we can draw the answer that the dataset we used is more variance oriented. Besides, these mutual identifiation certify in scikit-learn randomforest model and MLlib’s model are reliable.

Reference

1. http://scikit-learn.org/stable/modules/ensemble.html

2. https://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests

3. https://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts

4. http://scott.fortmann-roe.com/docs/BiasVariance.html

5. http://scikit-learn.org/stable/modules/ensemble.html

Study Apache Spark MLlib on IPython—Regression & Classification—Decision Tree

Posted on July 9, 2015 by cyrobin

HI Guys,

Today, let’s study the Decision Tree algorithm and see how to use this in Python scikit-learn and MLlib. Decision Tree is also the foundation of some ensemble algorithms such as Random Forest and Gradient Boosted Trees.

I. Background Knowledge

For decision trees, here are some basic concept background links.

1. What is ID3 (KeyWord: Information gain)

2. What is C4.5 (KeyWord: information gain ratio)

3. What is CART (Keyword: Gini coefficient)

4. Deference between ID3, C4.5 and CART

4. MLlib decision tree is using ID3 with CART, debate

5. Apache Spark MLlib decision tree (Note decision tree support both regression and classification usages)

II. LibSVM Datatype

1. In MLlib

LibSVM, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format: (refer to Apache MLlib)

label index1:value1 index2:value2 ...

For more detail info, please refer to this github dataset

2. In sklearn

Please refer to scikit-learn load_libsvm

Note we use the memory to cache the libsvm files by Python decoration to increase performance.

III. Code

Let’s look at the IPython Code here. Github source code

Note the functionality of zip() , is to concatenate the former RDD and the later one and return as key, value pairs.

Note the MLlib model gives a very ideal accuracy.

The decision tree pdf we generated is like this CART_decision_tree

Reference

1.http://scikit-learn.org/stable/modules/tree.html

2.http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html#example-tree-plot-iris-py

3.https://spark.apache.org/docs/latest/mllib-decision-tree.html#classification

4.http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

5.http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html#sklearn.datasets.load_svmlight_file

Study Apache Spark MLlib on IPython—Classification—Naive Bayes

Posted on July 9, 2015 by cyrobin

HI Guys,

Today, let’s study the Navie Bayes. Bayes formula is well known to all of us, but the way to apply it to classification may be puzzling you. Here’s a very brief tutorial about it from Andrew Ng, which I high recommend you to take a look at (only 10 mins!)

1. Backgound of different type of Bayes classifier, which mainly different from distribution of $P(x_i \mid y)$ .

2. For MLlib, it supports multinomial naive Bayes and Bernoulli naive Bayes. They are typically used for document classification.

3. One common sense. “Naive” of Naive Bayes stands for considering every pairs of features are independent.

Multinomial naive Bayes: In context, each observation is a document and each feature represents a term whose value is the frequency of the term

Bernoulli naive Bayes: a zero or one indicating whether the term was found in the document

Let’s look at IPython code. Github repo is at here.

We will still use the classic Iris dataset here with all three categories.

Next let’s use MLlib model

We can see the accuracy by MLlib Naive Bayes model is similar to the one by Bernouli Bayes in scikit-learn packages. While, the Multinomial one gets better accuray. Mostly because the Iris dataset features are not integers. If we changed them to integers, the accuracy would become better.

Reference,

1. https://spark.apache.org/docs/latest/mllib-naive-bayes.html

2. http://matplotlib.org/api/pyplot_api.html?highlight=scatter#matplotlib.pyplot.scatter

3. http://scikit-learn.org/stable/modules/naive_bayes.html

Step by Step of installing Apache Cassandra on Ubuntu Standalone mode

Posted on July 8, 2015 by cyrobin

Hi Guys,

Today, let’s take a look at a NoSQL database-Apache Cassandra, which is an open source distributed database management system, initially developed by Facebook.

I. Apache Cassandra Introduction

A very good study material from DataStax

1) Some basic concept we should pay attention to

commit log-Please look at the session “writing and reading data”

memtable-Please look at the session “writing and reading data”

SSTable-Please look at the session “writing and reading data”

Bloom filter–checks the probability of an SSTable having the needed data

sharding–Relational databases and some NoSQL systems require manual, developer-driven methods for distributing data across the multiple machines of a database cluster.

partitioner–which determines how data is distributed across the nodes that make up a database cluster. In short, a partitioner is a hashing mechanism that takes a table row’s primary key, computes a numerical token for it, and then assigns it to one of the nodes in a cluster in a way that is predictable and consistent.

keyspaces–analogous to Microsoft SQL Server and MySQL databases or Oracleschemas.

replication—is configured at the keyspace level, allowing different keyspaces to have different replication models.

replication factor—The total number of data copies that are replicated is referred to as the A replication factor of 1 means that there is only one copy of each row in a cluster, whereas a replication factor of 3 means three copies of the data are stored across the cluster.

1) Architecture

In Cassandra, all nodes play an identical role; there is no concept of a master node, with all nodes communicating with each other via a distributed, scalable protocol called “gossip.”

Cassandra’s built-for-scale architecture means that it is capable of handling large amounts of data and thousands of concurrent users or operations per second—even across multiple data centers—as easily as it can manage much smaller amounts of data and user traffic.

Cassandra’s architecture also means that, unlike other master-slave or sharded systems, it has no single point of failure and therefore is capable of offering true continuous availability and uptime.

Here’s an example image (refer to DataStax pdf),

2) Writing and Reading

Data written to a Cassandra node is first recorded in an on-disk commit log and then written to a memory-based structure called a memtable.

When a memtable’s size exceeds a configurable threshold, the data is written to an immutable file on disk called an SSTable.

Buffering writes in memory in this way allows writes always to be a fully sequential operation, with many megabytes of disk I/O happening at the same time, rather than one at a time over a long period. T

For a read request, Cassandra consults an in-memory data structure called a Bloom filter that checks the probability of an SSTable having the needed data.

If answer is a tenative yes, Cassandra consults another layer of in-memory caches, then fetches the compressed data on disk. If the answer is no, Cassandra doesn’t trouble with reading that SSTable at all, and moves on to the next.

3) About CQL (Cassandra Query Language)

Cassandra Query Language (CQL) is primary API used for interacting with a Cassandra cluster. CQL resembles the standard SQL used by all relational databases. So CQL is similar to other SQL languages.

4) Here are some well-knowned NoSQL database comparisons

1. Eleven kinds of NoSql database comparisons.

2. Cassandra vs MongoDB vs Neo4j

Note Cassandra supports MapReduce which makes it perfectly suitable to Hadoop and Spark and it is using token-ring architecture which is more reliable compared to HBase, which is the basic column-family database.

II. Let’s start our installation

ENV version

OS- Ubuntu-15.04

Apache Cassandra-2.1.7

1. Download and unzip Apache Cassandra

wget http://www.apache.org/dyn/closer.cgi?path=/cassandra/2.1.7/apache-cassandra-2.1.7-bin.tar.gz

tar -xvzf apache-cassandra-2.1.7-bin.tar.gz

mv apache-cassandra-2.1.7 /usr/local/cassandra

2. Two options to grant access to log folder.

Both of these methods are based on study of /conf/cassandra.yaml, which is the storage config file.

2.1 Create three log folders under. (refer to here, and these are related to the architecture we discussed above)

sudo mkdir /usr/local/cassandra/commitlog

sudo mkdir /usr/local/cassandra/data

sudo mkdir /usr/local/cassandra/saved_caches

2.2 make directories and grant access (refer to this)

sudo mkdir /var/lib/cassandra

sudo mkdir /var/log/cassandra

sudo chown -R $USER:$GROUP /var/lib/cassandra

sudo chown -R $USER:$GROUP /var/log/cassandra

3. Add environment variables to ~/.bashrc

export CASSANDRA_HOME=~/cassandra

export PATH=$PATH:$CASSANDRA_HOME/bin

4. Per-thread stack size issue

For Cassandra-2.1.7, the per-thread stack size in ../conf/cassandra-env.sh, is already increased to 256K, which iis good enough for us.

5. Start Cassandra

cd /usr/local/cassandra

bin/cassandra -f //-f means start in foreground

If the terminal hanging at here, it is successfully started.

6. Let’s open another terminal and try some CQL statement

cd /usr/local/cassandra

bin/cqlsh

Notice this shell is running on the 127.0.0.1:9042 which is exactly the CQL client ip address showed up in the previous step.

7. Basic CQL

Note: remember to use keyspace first to create tables.

Code is refer to here,

CREATE KEYSPACE Demo WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘replication_factor’:3};
use Demo;
CREATE TABLE users (user_name varchar PRIMARY KEY, password varchar, gender varchar, session_token varchar, state varchar, birth_year bigint);
select * from system.schema_keyspaces;
describe tables;
describe keyspaces;
describe table users;

Congratulations~, we have studied the basic architecture and installation of Apache Cassandra!

Reference

1. https://www.youtube.com/watch?v=ZK0hQl9VPBY

2. https://www.digitalocean.com/community/tutorials/how-to-install-cassandra-and-run-a-single-node-cluster-on-a-ubuntu-vps

3. http://www.datastax.com/doc-source/pdf/cassandra11.pdf

Step by Step of Installing Apache Kafka and Communicating with Spark

Posted on July 8, 2015 by cyrobin

Hi Guys,

Till now, We have learned Yarn, hadoop, and mainly focused on Spark and practise several of Machine learning Algorithms either with Scikit-learn Packages in Python or with MLlib in PySpark. Today, let’s take a break from Spark and MLlib and learn something with Apache Kafka.

I. Background

Mainly, Apache Kafka is distributed, partitioned, replicated and real time commit log service. It provides the functionality of a messaging system, but with a unique design. Here’s the Linkedin Kafka paper

1. Here’s some very basic concepts you need to understand

A stream of messages of a particular type is defined as a topic. A Messagei s defined as a payload of bytes and a Topic is a category or feed name to which messages are published.
A Producer can be anyone who can publish messages to a Topic.
The published messages are then stored at a set of servers called Brokers or Kafka Cluster.
A Consumer can subscribe to one or more Topics and consume the published Messages by pulling data from the Brokers.

2. Usage of Zookepper in Kafka: As for coordination and facilitation of distributed system ZooKeeper is used, for the same reason Kafka is using it. ZooKeeper is used for managing, coordinating Kafka broker. Notice, in Hadoop ecosystem, Zookeeper is also used for cluster management for Hadoop. Thus, we have to say Zookeeper is mainly solving the problem of reliable distributed coordination.

II. Install the Apache Kafka on Ubnutu

Basically, the following procedure is based on this youtube lecture with my comment and confusion I met when I tries this by myself.

Env versions

Ubuntu 15.04

Apache Kafka-2.10.8.2.0

1. Download and un-zip the kafka tar file from here

tar -zxvf kafka_2.10-0.8.2.0.tgz

mv kafka_2.10-0.8.2.0 /usr/local/kafka

2. Start Zookeeper

Zookeeper is complicated when installing manually so that Kafka has it’s server stop and start shell integrated.

Start zookeeper server command:

bin/zookeeper-server-start.sh config/zookeeper.properties

Note: if you are stucking at the last statment “INFO binding to port 0.0.0.0/0.0.0.0:2181” don’t worry. This means the zookeeper server starts correctly and waiting for your next move.

3. Start Kafka Broker

We can open another terminal and start the broker.

bin/kafka-server-start.sh config/server.properties

If the terminal hangs at “INFO new leader is 0”, the Kafka broker/server is started correctly. And you will see the connection is at port 9092.

4. Topic Creation

Let’s create a topic named “test” in another terminal

bin/kafka-topics.sh –create –zookeeper localhost:2181 –replication-factor 1 –partitions 1 –topic test

5. List the topics

bin/kafka-topics.sh –list –zookeeper localhost:2181

Since we have created only one topic, it will show up “test”

6. Start the Producer

bin/kafka-console-producer.sh –broker-list localhost:9092 –topic test

Note: the localhost port is 9092 which is the port we created for Kafka broker.

Now producer is waiting for your move.

7. Start the Consumer

We can start another terminal to kick off the consumer

bin/kafka-console-consumer.sh –zookeeper localhost:2181 –topic test –from-beginning

Note: the localhost of consumer is 2181 which is the port zookeeper server runs on.

8. Interactively and real-timely testing

You can type whatever text you want at the producer terminal and press “Enter” when you finished. The same message will be deliver to the consumer terminal. Isn’t it fun?!

9. Try WordCount Streaming example with Spark

9.1 open a new terminal and go to your spark directory

cd /usr/local/spark

9.2 In Spark run command shown below.

“test” is the topic, “4” is the number of threads (about Kafka consumer thread, please refer to this). Normally, each consumer will get one thread. (Code is at here)

bin/run–example org.apache.spark.examples.streaming.KafkaWordCount localhost:2181 test–group test 4

9.3 Here’s the final result of Kafka Producer communicate with Spark is shown below.

It is clearly showing that the spark is working in streaming processing with the input at the Kafka producer side and word-counting the input words.

Congratulations! Now you know how to install the Apache kafka and even about communicating with Apache Spark!

Reference

1. http://www.infoq.com/articles/apache-kafka

2. http://tech.lalitbhatt.net/2014/07/apache-kafka-tutorial.html

3. http://www.sparkfu.com/2015/01/spark-kafka-cassandra-part-1.html