Till now, We have learned Yarn, hadoop, and mainly focused on Spark and practise several of Machine learning Algorithms either with Scikit-learn Packages in Python or with MLlib in PySpark. Today, let’s take a break from Spark and MLlib and learn something with Apache Kafka.
Mainly, Apache Kafka is distributed, partitioned, replicated and real time commit log service. It provides the functionality of a messaging system, but with a unique design. Here’s the Linkedin Kafka paper
1. Here’s some very basic concepts you need to understand
- A stream of messages of a particular type is defined as a topic. A Messagei s defined as a payload of bytes and a Topic is a category or feed name to which messages are published.
- A Producer can be anyone who can publish messages to a Topic.
- The published messages are then stored at a set of servers called Brokers or Kafka Cluster.
- A Consumer can subscribe to one or more Topics and consume the published Messages by pulling data from the Brokers.
2. Usage of Zookepper in Kafka: As for coordination and facilitation of distributed system ZooKeeper is used, for the same reason Kafka is using it. ZooKeeper is used for managing, coordinating Kafka broker. Notice, in Hadoop ecosystem, Zookeeper is also used for cluster management for Hadoop. Thus, we have to say Zookeeper is mainly solving the problem of reliable distributed coordination.
II. Install the Apache Kafka on Ubnutu
Basically, the following procedure is based on this youtube lecture with my comment and confusion I met when I tries this by myself.
1. Download and un-zip the kafka tar file from here
tar -zxvf kafka_2.10-0.8.2.0.tgz
mv kafka_2.10-0.8.2.0 /usr/local/kafka
2. Start Zookeeper
Zookeeper is complicated when installing manually so that Kafka has it’s server stop and start shell integrated.
Start zookeeper server command:
Note: if you are stucking at the last statment “INFO binding to port 0.0.0.0/0.0.0.0:2181” don’t worry. This means the zookeeper server starts correctly and waiting for your next move.
3. Start Kafka Broker
We can open another terminal and start the broker.
If the terminal hangs at “INFO new leader is 0”, the Kafka broker/server is started correctly. And you will see the connection is at port 9092.
4. Topic Creation
Let’s create a topic named “test” in another terminal
bin/kafka-topics.sh –create –zookeeper localhost:2181 –replication-factor 1 –partitions 1 –topic test
5. List the topics
bin/kafka-topics.sh –list –zookeeper localhost:2181
Since we have created only one topic, it will show up “test”
6. Start the Producer
bin/kafka-console-producer.sh –broker-list localhost:9092 –topic test
Note: the localhost port is 9092 which is the port we created for Kafka broker.
Now producer is waiting for your move.
7. Start the Consumer
We can start another terminal to kick off the consumer
bin/kafka-console-consumer.sh –zookeeper localhost:2181 –topic test –from-beginning
Note: the localhost of consumer is 2181 which is the port zookeeper server runs on.
8. Interactively and real-timely testing
You can type whatever text you want at the producer terminal and press “Enter” when you finished. The same message will be deliver to the consumer terminal. Isn’t it fun?!
9. Try WordCount Streaming example with Spark
9.1 open a new terminal and go to your spark directory
9.2 In Spark run command shown below.