Step by Step of Installing Tachyon in Stanalone mode and Work with Apache Spark

Hi Guys,

Today let’s talk about a brand new technology—Tachyon, which is not the particle faster than the light, lol, but the “ memory-centric distributed storage system. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read.” quota from Tachyon offical document

I. Pre-request

Apache Spark installed (tutorial)

II. Tachyon installation

We are following the article from UC-Berkeley Ampcamp.

1.Download

$ wget https://github.com/amplab/tachyon/releases/download/v0.7.1/tachyon-0.7.1-bin.tar.gz

$ tar xvfz tachyon-0.7.1-bin.tar.gz

$ cd tachyon-0.7.1

$ mv tachyon-0.7.1 /usr/local/tachyon

cp conf/tachyon-env.sh.template conf/tachyon-env.sh

set TACHYON_UNDERFS_ADDRESS in conf/tachyon-env.sh as your desired file folder, for me is ~/Downloads

2.Format the system

1

3. Kick-off the Tachyon

2

4. Verify the installation

$ ./bin/tachyon runTest Basic CACHE_THROUGH
7

5. Let’s copy license from local to TFS (tachyon file system) as a sample file

./bin/tachyon tfs copyFromLocal LICENSE /LICENSE

6. Local file folder

You file folder will look like this, which contains two folders, one is data one is workers. The file 2 in the /data folder is our LICENSE file.

546

III Run Spark on Tachyon

As the discuss above, the Tachyon is served as the cache between the Spark Engine and YARN/HDFS.

Two examples from AMPCamp

1. I/O of Tachyon

open the Spark shell

./bin/spark-shell

Here’s the wordcount on LICENSE file where the output is in /result folder

sc.hadoopConfiguration.set(“fs.tachyon.impl”, “tachyon.hadoop.TFS”)
var file = sc.textFile(“tachyon://localhost:19998/LICENSE”)
val counts = file.flatMap(line => line.split(” “)).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile(“tachyon://localhost:19998/result”)

2. OFF_HEAP mode

This mode is the most powerful part of Tachyon, caching the data off heap will increase the data transfer rate and saving GC (garbage collection) resources.

sc.hadoopConfiguration.set(“fs.tachyon.impl”, “tachyon.hadoop.TFS”)
var file = sc.textFile(“tachyon://localhost:19998/LICENSE”)
val counts = file.flatMap(line => line.split(” “)).map(word => (word, 1)).reduceByKey(_ + _)
counts.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
counts.take(10)
counts.take(10)

For this example, the first counts.take(10) will take a much longer time than the second counts.take(10)

2.1 the first counts.take(10) takes 11 seconds

p2_1_1_1p_2_1_1_2

2.2 the second one takes only 1 second!!!

Although this is not a strict test, but you can imagine how fast it can be with tachyon tricks.

p_2_2_2_2

IV. Web browser UI

Tachyon has integrated a very friendly webserver running at https://localhost:19999. The following images will give you a brief idea.

1.The home page of Tachyon

p_1_6

2. The files in the TFS

p1_4

3. Binary files saved in the TFS

p_111111

V. Conclusion

Tachyon is a pretty new tool which is good at caching the data from upper level engine and lower level on disk file systems, while optimize the GC issue of spark and fault tolerance. It is easy to use and compatible to the latest version of Spark. I believe Tachyon will become widely used.

Reference

1.http://ampcamp.berkeley.edu/5/exercises/tachyon.html

2.http://tachyon-project.org/documentation/Running-Tachyon-Locally.html

3. http://tachyon-project.org/documentation/index.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s