Spark

Spark is a cluster computing platform which includes a compute engine and libraries for parallel data processing. Spark is written in Scala but includes APIs for other languages such as Java, Python and R. Being written in Scala, Scala provides the most complete and powerful API. A particularly appealing feature of Spark is that you can run it in stand-alone mode on a single computer, or run it on a cluster of thousands of computers. From the user’s perspective, using Spark is the same. This uniformity makes Spark particularly appealing because you can learn Spark on your laptop and when you later need to work with truly big data – datasets so large that they must be distributed across a cluster of computers – the skills you learned in stand-alone mode transfer directly.

Downloading and Installing Spark

Spark uses Hadoop’s client libraries for HDFS (Hadoop Distributed Filesystem) and YARN (Yet Another Resource Manager). Since we will use the latest version of Spark (2.4.3 as of May 2019) compiled with Scala 2.12, we will first need to download and install Hadoop separately so that we can use the latest Spark package which does not bundle Hadoop.

Note: these instructions will work on Unix (like) systems like Linux and macOS. If you have Windows, seriously, do yourself a favor and get a Mac or install Linux.

Installing Hadoop

  1. Download the latest Hadoop binary from Hadoop’s downloads page.

    • Note that Hadoop 3.x only supports Java 8 officially. Go to https://jdk.java.net/8/ and download the JDK build for your operating system.
      • If you need another version of Java, JDK 8 can be installed along side it.
  2. Unpack the Hadoop archive to a location on your hard disk. I suggest creating a ~/opt/hadoop/ directory. Unpacking the Hadoop archive there will create a ~/opt/hadoop/hadoop-3.2.0 directory.

  3. Create a symlink to the Hadoop directory with ln -s ~/opt/hadoop/hadoop-3.2.0 ~/opt/hadoop/current.

  4. Edit ~/opt/hadoop/current/etc/hadoop/hadoop-env.sh and add the following line (for macOS – modify appropriately for Linux):

     export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home
    

Installing Spark

  1. Download the latest Spark 2.4 binary from Spark’s downloads page – choose the one for Scala 2.12 without bundled Hadoop.

  2. Unpack the Spark archive to a location on your hard disk. I suggest creating a ~/opt/spark/ directory. Unpacking the Spark archive there will create a ~/opt/spark/spark-2.4.0-bin-without-hadoop-scala-2.12 directory.

  3. Create a symlink to the Spark directory with ln -s ~/opt/spark/spark-2.4.0-bin-without-hadoop-scala-2.12 ~/opt/spark/current.

  4. Edit ~/opt/spark/current/conf/spark-env.sh and add the following line (for macOS – modify appropriately for Linux):

     export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home
     export SPARK_DIST_CLASSPATH=$(hadoop classpath)
    

System-wide Environment Variables

After completing these steps your ~/opt directory will look something like this:

[chris@nijinsky ~/opt]
$ tree -L 2
.
├── hadoop
│   ├── current -> hadoop-3.2.0
│   └── hadoop-3.2.0
└── spark
    ├── current -> spark-2.4.0-bin-without-hadoop-scala-2.12
    └── spark-2.4.0-bin-without-hadoop-scala-2.12

To make it easier to run Spark (and Hadoop), add the following to your .bash_profile (modify for your shell if you don’t use Bash):

# Hadoop
export PATH=$PATH:~/opt/hadoop/current/bin

# Spark
export PATH=$PATH:~/opt/spark/current/bin

Now you can run spark from any directory. Run the spark shell in stand-alone mode with:

spark-shell --master local[2]