Homework 0 - Spark

Introduction

This assignment gets you started with the basic tools you will need to complete all of your homework projects in Spark using Scala. This project will ensure that you have correctly installed Scala, SBT, Spark and IntelliJ.

Problem Description

You are a student who needs to install all the tools necessary to get started in CS4641.

Solution Description

In this assignment you will set up your computer to

Scala and Spark

  1. Install Scala for system-wide use on your computer by downloading the appropriate distribution from the bottom of https://www.scala-lang.org/download/

  2. Download and install a programmer’s text editor (you can also use IntelliJ as a general text editor, but it can be awkward for quick file editing). In this course we will prmiarily use IntelliJ, but it’s important to be comfortable with general-purpose text editors too.

  3. Install Spark using the Spark instructions on the course web site.

  4. Install SBT for your operating system using the instructions linked on the Getting Started with Scala and SBT on the Command Line page on docs.scala-lang.org.

  5. Create a directory for your CS4641 coursework somewhere on your hard disk – we suggest cs4641.
    • You can do this on the command line by navigating to the directory you want to contain is the cs4641 folder (using the cd command).
      • Create the folder with the command mkdir cs4641.
      • Enter the new folder with the command cd cs4641.
      • Note: avoid putting spaces in file and directory names, since doing so complicates the use of some command line tools.
  6. Create a subdirectory of your cs4641 directory named hw0.

  7. On the command line, make sure you are in the hw0 folder. Enter these commands (remember that ‘$’ is the shell prompt (something like ‘C:\cs4641\hw0>’ on Windows) – don’t type the shell prompt character(s)):

     $ scalac -version > hw0-output.txt
     $ scala -version 2>> hw0-output.txt
    

    Please note what is happening here:

    > redirects the standard output of a program. 2> (or 2>>) redirects stderr, which is used for diagnostics (such as version strings). The first line creates the hw0-output.txt file, and the second line (with the extra >) adds more text to the file. Here is a nice discussion of the file descriptors stdin, stdout and stderr.

    What this means is that > (or 2>) will overwrite the file, so if you go back to repeat the first step, you’ll need to repeat all the other steps as well.

  8. Open your text editor and create the following files and directories (substitute your loginID for loginID):

     .
     ├── build.sbt
     └── src
         └── main
             └── scala
                 └── edu
                     └── gatech
                         └── cs4641
                             └── loginID
                                 └── HelloSpark.scala
        
     7 directories, 2 files
    
  9. In HelloSpark.scala enter the following Scala code (substitute your loginID for loginID):

     package edu.gatech.cs4641.loginID.hw0;
        
     import org.apache.spark.sql.SparkSession
        
     object HelloSpark {
       def main(args: Array[String]) {
         val spark = SparkSession.builder.appName("Hello Spark").getOrCreate()
         println(s"Spark version: ${spark.version}")
         spark.stop()
       }
     }
    
  10. In build.sbt enter these contents:

    name := "Hello Spark"
        
    version := "1.0"
        
    scalaVersion := "2.12.8"
        
    libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.3"
    
  11. Compile and package your HelloSpark application with the following command (the first time you run it may take severla minutes):

    sbt package
    
  12. Run your application by submitting it to Spark (Note: you can use --local[2] if you have more than 2 cores):

    spark-submit --class "edu.gatech.cs4641.loginID.hw0.HelloSpark" --master local[1] target/scala-2.12/hello-spark_2.12-1.0.jar
    

    Lots of output will appear on the console.

  13. Run the script again and add its output to hw0-output.txt by entering

    Unix/Linux:

    spark-submit --class "edu.gatech.cs4641.loginID.hw0.HelloSpark" --master local[1] target/scala-2.12/hello-spark_2.12-1.0.jar >> hw0-output.txt
    

    Don’t forget the the double arrows in >>!

    Most of the same output will appear on the console, except one line – the output of your println – which will be in hw0-output.txt.

  14. Examine your hw0-output.txt file to ensure that it contains the scalac version string, the scala version string, and the output of running your HelloSpark program.

Double-Check your hw0-output.txt File

At this point your `hw0-output.txt file should contain

If your hw0-output.txt file is missing any of those elements you should redo all the steps that add content to hw0-output.txt in each of the previous sections.

Turn-in Procedure

Submit your hw0-output.txt file on Canvas as an attachment. When you’re ready, double-check that you have submitted and not just saved a draft.

Verify the Success of Your Submission to Canvas

Practice safe submission! Verify that your HW files were truly submitted correctly, the upload was successful, and that your program runs with no syntax or runtime errors. It is solely your responsibility to turn in your homework and practice this safe submission safeguard. NOTE: Unlike TSquare, Canvas will not send an email indicating that your assignment has been submitted successfully. Follow the steps outlined below to ensure you have submitted correctly.