This time, I am going to install Apache Spark on our existing Apache Hadoop 2.7.0.
1. Install Scala (refer to this)
sudo apt-get remove scala-library scala
sudo dpkg -i scala-2.11.7.deb
sudo apt-get update
sudo apt-get install scala
tar -zxvf spark-1.4.0-bin-hadoop2.6.tgz
mv spark-1.4.0-bin-hadoop2.6 /usr/local/spark
3 get hadoop version
It should show 2.7.0
4 add spark home
sudo vi ~/.bashrc
5 Spark Version
Since spark-1.4.0-bin-hadoop2.6.tgz is an built version for hadoop 2.6.0 and later, it is also usable for hadoop 2.7.0. Thus, we don’t bother to re-build by sbt or maven tools, which are indeed complicated. If you download the source code from Apache spark org, and build with command
build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -DskipTests clean package
There are lots of build tool dependency crashed.
So, no bother about building spark.
6 let’s verify our installation
7. launch spark shell (refer to this)
It means spark shell is running
8. Test spark shell
scala:> sc.parallelize(1 to 100).count()
sc—spark context, Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
parallelize—Distribute a local Scala collection to form an RDD.
count—Return the number of elements in the dataset.
9 Let’s try another typical example
bin/spark-submit –class org.apache.spark.examples.SparkPi –master local[*] lib/spark-example* 10
The last variable 10 s the argument for the main of the application. For here is the slice number used for calculation Pi
Congratulations! We have finishing Spark installation and next we can start using this powerfull tool to perform data analysis and many other fun stuffs.