Step by Step of Configuring Apache Spark to Connect with Cassandra

HI guys,

This time, We will discuss about how to access the data/table in Cassandra.

I. Pre-request

Pre-request:

Cassandra installed (tutorial)

Apache Spark installed (tutorial)

Intellij installed (tutorial)

Scala -version 2.10.x (Apache Spark 1.4.0 only support 2.10.x, refer to spark.apache.org)

If you haven’t installed these tools, feel free to refer to my tutorial in previous sessions.

Note we installed Cassandra and Spark in Ubuntu machine on VMware player. But our intellij is installed on local win7 machine.

II. Apache Cassandra configurations

1. Open one terminal in ubuntu and type

ifconfig

you will find your IP address. e.g 192.168.30.154

final

2. Configure the Cassandra server

cd /usr/local/cassandra/conf/cassandra.yaml

This is the main configuration files (Spec in DataStax).

There are two addresses we need to configure for our standalone mode run.

listen_address: 192.168.30.154

(192.168.30.154 is an example, you may need to change to your IP)

The other nodes in cluster can communicate with this node using listen_address.

rpc_address: 192.168.30.154

(192.168.30.154 is an example, you may need to change to your IP)

rpc_address specify the ip or host name through which client communicate

3. Let’s open Cassandra server.

cd /usr/local/cassandra/

bin/cassandra -f

4. configure Cqlsh

vi /usr/local/cassandra/bin/cqlsh

search for DEFAULT_HOST and change this address to your local IP, e.g

DEFAULT_HOST = ‘192.168.30.154’

5. Add some values to our previous Cassandra table(Please refer to this)

We have a keyspace named “demo” and a table named users. Let’s insert some values into it.

INSERT INTO demo.users(user_name, birth_year) VALUES(‘Robin’, 1987);

which means we have insert value Robin to column user_name in table users of the keyspace demo. Same thing for value 1987.

III. Build sbt spark project

1. Build a Spark Project on Intellij (refer to here)

SBT dependency (build.sbt)

name := "SparkCassandra"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.4.0"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.4.0-M1"

2. A simple query code like this (refer to this)

import org.apache.spark.{SparkConf, SparkContext}
import com.datastax.spark.connector._

object SparkCassandra extends App {

  val conf = new SparkConf()
    .setMaster("local[*]")
    .setAppName("SparkCassandra")
    //set Cassandra host address as your local address
    .set("spark.cassandra.connection.host", "192.168.30.154")
  val sc = new SparkContext(conf)
  //get table from keyspace and stored as rdd
  val rdd = sc.cassandraTable("demo", "users")
  //collect will dump the whole rdd data to driver node (here's our machine), 
  //which may crush the machine. So take first 100 (for now we use small table
  //so it's ok)
  val file_collect = rdd.collect().take(100)
  file_collect.foreach(println(_))

  sc.stop()
}

3. The result is what we expected

final_2

Since we only configured the user_name and birth_year, the rest of the column will return ‘null’.

Reference,

1. http://blog.knoldus.com/2015/06/23/apache-spark-cassandra-basic-steps-to-install-and-configure-cassandra-and-use-it-with-apache-spark-with-example/

2. http://www.datastax.com/dev/blog/accessing-cassandra-from-spark-in-java

3.http://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10

5 comments

  1. Srinu · September 17, 2015

    Hi..
    I followed this Tutorial and trying to communicate but i am getting error.
    In Windows i installed Intellij and create a project SparkCassandra..and i imported the four jar files to the project but i unble to run the spark code from windows.
    Can you help me on this..

    Thanks
    Srinu

    Like

  2. Srinu · September 18, 2015

    Thanks Much for your reply..Chong.
    I ran spam classifier and its works for me.
    And then i tried to run the spark cassandra…not sure where i am doing wrong.
    I came across below errors.
    I did sbt as what you did for this project.Can you please shed some light on this…how to over come the errors.

    Error:(5, 12) object apache is not a member of package org
    import org.apache.spark.{SparkConf, SparkContext}
    ^
    Error:(6, 12) object datastax is not a member of package com
    import com.datastax.spark.connector._
    ^
    Error:(10, 18) not found: type SparkConf
    val conf = new SparkConf()
    ^
    Error:(15, 16) not found: type SparkContext
    val sc = new SparkContext(conf)

    Thanks
    Srinu
    ^

    Like

    • cyrobin · September 18, 2015

      HI Srinu,
      If you met errors of import org.apache.spark, most likely it is a dependency issue. Right now spark 1.5 is released and I am not the expert who can dig in how spark, sbt and scala versions cooperate with each other. However, kindly I suggest you to search for this dependency issue online.

      Thanks,
      -Chong

      Like

  3. Meenakshi · March 30, 2017

    hi Chong,
    i followed your steps and able to connect cassandra in my ubuntu local machine through spark using intellij project.
    hoe can i create spark jobs using jar files and use it my live cassandra cluster which is not in local but another server fro example any production server

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s