Tue 03 Nov 2015

Installing Spark 1.5 on CDH 5.4

If you have not tried processing data with Spark yet, you should. It’s the next happening framework, centered around processing data up to 100x more efficiently than Hadoop, while leveraging the existing Hadoop’s components (HDFS and YARN). Since Spark is evolving rapidly, in most cases you will want to run the latest released version by the Spark community, rather than the version packaged with your Hadoop distribution. This guide will walk you through what it takes to get the latest version of Spark running on your cluster.

At the time of this writing, Spark 1.5 is the latest released version and as such I will focus on this version.

Requirements: Existing install of CDH with YARN

Step 1 - Download Spark binaries.

Navigate to Spark’s Downloads and download the package prebuilt for Hadoop 2.6+ (CDH 5.4 is packaged with Hadoop 2.6)

Step 2 - Uncompress the archive

On your gateway node(s), uncompress the archive you downloaded to a desired install directory.

Step 3 - Configure

Add the following line to conf/spark-env.sh

# Path to Cluster's Hadoop configuration
export HADOOP_CONF_DIR=/etc/hadoop/conf

Step 4 - Test the install Run spark-shell from your install folder’s bin folder. Run a test job that reads a file and counts number of lines in it:

sc.textFile("/path/to/file").count()

TIP: Make sure SPARK_PATH is NOT set in your environment to an older version, as that will prevent you from running jobs on the newer version.

You are done! (Kind of, see below)

Other components that you will want to upgrade along with Spark are the history server and the shuffle service. History Server provides an interface to review prior job runs which helps with troubleshooting and runtime analysis. External shuffle service is very useful if you plan to use Dynamic Allocation as it allows for executors to be safely removed without losing the shuffle data they have generated (details). Both services should be backwards compatible in case you will need to run multiple Spark versions side by side. I am going to cover the upgrade of both components in a future blog post.

Tags: spark, hadoop, cdh

We are Hiring!

We have in Engineering, including:

Apply online and see all positions at our job board.