Step by Step of installing Apache Spark on Apache Hadoop

HI guys,

This time, I am going to install Apache Spark on our existing Apache Hadoop 2.7.0.

Env versions

OS-Ubuntu 15.04

Scala-2.11.7

spark-spark-1.4.0-bin-hadoop2.6.tgz

1. Install Scala (refer to this)
sudo apt-get remove scala-library scala
sudo dpkg -i scala-2.11.7.deb
sudo apt-get update
sudo apt-get install scala
2.Install Spark
wget http://apache.mirrors.ionfish.org/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz
tar -zxvf spark-1.4.0-bin-hadoop2.6.tgz
mv spark-1.4.0-bin-hadoop2.6 /usr/local/spark
3 get hadoop version
hadoop version
It should show 2.7.0
4 add spark home
sudo vi ~/.bashrc
add
export SPARK_HOME=/usr/local/spark
source ~/.bashrc
5 Spark Version
Since spark-1.4.0-bin-hadoop2.6.tgz is an built version for hadoop 2.6.0 and later, it is also usable for hadoop 2.7.0. Thus, we don’t bother to re-build by sbt or maven tools, which are indeed complicated. If you download the source code from Apache spark org, and build with command
build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -DskipTests clean package
There are lots of build tool dependency crashed.
So, no bother about building spark.
6 let’s verify our installation
cd $SPARK_HOME
7. launch spark shell (refer to this)
./bin/spark-shell
1
It means spark shell is running
8.  Test spark shell
scala:> sc.parallelize(1 to 100).count()
background info
sc—spark context, Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
parallelize—Distribute a local Scala collection to form an RDD.
count—Return the number of elements in the dataset.
3
scala:> exit
9 Let’s try another typical example
bin/spark-submit –class org.apache.spark.examples.SparkPi –master local[*] lib/spark-example* 10
The last variable 10 s the argument for the main of the application. For here is the slice number used for calculation Pi
 4

Congratulations! We have finishing Spark installation and next we can start using this powerfull tool to perform data analysis and many other fun stuffs.

4 comments

  1. AMK · October 17, 2015

    Do we have to repeat this process on all the nodes including namenode and datanodes.

    Like

    • cyrobin · November 24, 2015

      Most of the process you have to repeat.

      Like

  2. Biswo · February 26, 2017

    Hi
    I followed the steps suggested to install spark-2.1.0-bin-hadoop2.7.tgz.But am not getting hadoop version
    It should show 2.7.0(Step 3 ).
    I am getting below error.
    /usr/local/spark$ hadoop version
    hadoop: command not found

    Like

  3. ghandzhipeng · August 7, 2017

    This is not building Spark on top of an existing hadoop cluster. Essentially you need to add the HADOOP_CONF_DIR in spark_env.sh

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s