Installing Spark 1.5 on CDH 5.4
If you have not tried processing data with Spark yet, you should. It’s the next happening framework, centered around processing data up to 100x more efficiently than Hadoop, while leveraging the existing Hadoop’s components (HDFS and YARN). Since Spark is evolving rapidly, in most cases you will want to run the latest released version by the Spark community, rather than the version packaged with your Hadoop distribution. This guide will walk you through what it takes to get the latest version of Spark running on your cluster.
At the time of this writing, Spark 1.5 is the latest released version and as such I will focus on this version.
Requirements: Existing install of CDH with YARN
Step 1 - Download Spark binaries.
Navigate to Spark’s Downloads and download the package prebuilt for Hadoop 2.6+ (CDH 5.4 is packaged with Hadoop 2.6)
Step 2 - Uncompress the archive
On your gateway node(s), uncompress the archive you downloaded to a desired install directory.
Step 3 - Configure
Add the following line to conf/spark-env.sh
# Path to Cluster's Hadoop configuration
export HADOOP_CONF_DIR=/etc/hadoop/conf
Step 4 - Test the install Run spark-shell from your install folder’s bin folder. Run a test job that reads a file and counts number of lines in it:
sc.textFile("/path/to/file").count()
TIP: Make sure SPARK_PATH is NOT set in your environment to an older version, as that will prevent you from running jobs on the newer version.
You are done! (Kind of, see below)
Other components that you will want to upgrade along with Spark are the history server and the shuffle service. History Server provides an interface to review prior job runs which helps with troubleshooting and runtime analysis. External shuffle service is very useful if you plan to use Dynamic Allocation as it allows for executors to be safely removed without losing the shuffle data they have generated (details). Both services should be backwards compatible in case you will need to run multiple Spark versions side by side. I am going to cover the upgrade of both components in a future blog post.