Introduction
If you’d like to setup Apache Spark to experiment with but you don’t want to use a premade ISO or setup your own then I’m going to show you how. This configuration will be a minimal one using Linux Operating Sytems; I’m going to use Ubuntu so change the install based on your package mananger. I’m going to assume that you’ve setup the hosts, their networking and have some way to configure and deploy them. There are options like Puppet or Salt but I’ll be avoiding those and leave them up to you.
Installation
I have script that does all of this but we’re going to go over it so you understand each part and then I’ll attach the bash script at the end. To start with, we’re going to need Java since Spark is dependent on it. Login to each host - Master and Slaves - and you’ll want to run:
apt update && apt install openjdk-8-jre openjdk-8-jdk -y
Remember to adjust this for own Linux Distribution.
Figure out which directory you want to install Apache Spark into; Normally, you’d use /opt
so that’s what we’re going to be using. Download the archive of the files from the website:
wget -O /opt/spark-2.4.6-bin-hadoop2.7.tgz https://downloads.apache.org/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
If they obsolete the download link then you can find the versions online here. I’m using the 3.0.1 version Pre-built for Apache Hadoop 2.7 for this but feel free to experiment. Once the file is downloaded, you’ll want to unpack the archive:
tar -xzvf /opt/spark-2.4.6-bin-hadoop2.7.tgz -C /opt
rm /opt/spark-2.4.6-bin-hadoop2.7.tgz
Now you’ll have files ready for usage. Next you’ll want to add the environmental variables so that linux will know where to look for the binary when you call it:
echo -e "\nexport SPARK_HOME=/opt/spark-2.4.6-bin-hadoop2.7\nexport PATH=/opt/spark-2.4.6-bin-hadoop2.7/bin:$PATH" | tee -a ~/.bashrc
export SPARK_HOME=/opt/spark-2.4.6-bin-hadoop2.7
export PATH=/opt/spark-2.4.6-bin-hadoop2.7/bin
Do this for all the hosts that you’ll need to run Spark on.
Configuration Files
It is best to create a master copy of these next few configuratoin files to copy to each host in turn. This way you only need to edit each file ones and then copy them all the the appropriate hosts
For your master server, you’ll need to update two files: spark-defaults.conf
and spark-env.sh
. Both of these should be found inside the conf directory of your spark home. Make sure you make a backup of them before you do anything else:
cp spark-defaults.conf spark-defaults.conf.bkp
cp spark-env.sh spark-env.sh.bkp
Next you’ll want to open them and add the master info information. For the defaults file simple uncomment and modify the line to look like:
spark.master spark://<host or IP Addr>:7077
… and the env file you’ll want to look for the line:
SPARK_MASTER_HOST='<host of IP Addr>'
You shouldn’t need to change anything else in this file so long as you have full control of the systems you’re using. In my case, I happen to not and changed where shuffle and worker logs are ran to avoid the operating system from filling up and crashing the host. If you need to worrry about those then update the lines SPARK_LOCAL_DIRS
, SPARK_WORKER_DIR
and SPARK_PID_DIR
to point to somewhere else on the system which wont fill up the partition.
Next you’ll want to collect the names or IP addresses of all the hosts in your cluster and add them to the slaves
file in config
directory just like were the others are. Make sure to test the connectivity of your hosts using ping
or something else to confirm they can actually talk to one another!
Datastore
Now we’re going to work around not having a Hadoop cluster. How this works, is that we’re going to create a shared folder on all of the hosts which references the Master as the Source of Truth. First, create a folder in your spark home to hold the data:
mdkir $SPARK_HOME/Data
Go ahead and create a file in here for future usage: touch turtles
Next you’ll go ahead and install a package called sshfs
which is used to remotely mount a folder from one host and another:
sudo apt install sshfs
Repeat this for all the hosts in your cluster. Once that is done, you’ll connect the slaves to the master using:
sshfs <username>@<master-address>:/opt/spark-2.4.6-bin-hadoop2.7/Data /opt/spark-2.4.6-bin-hadoop2.7/Data
Now you should be able to see the turtles
file we created earlier if you list the files in the Data directory
ls Data
If you see the file then feel free to move on! If not, then double back and troubleshoot the connection between those two computers. Could also be permissions or something like that as well!
Connect the Dots, Start the Services
Now that we’ve got it all connected together, go ahead and run the appropriate commands on the masters and servers to start them all up:
# master:
$SPARK_HOME/sbin/start-master.sh
# slaves:
$SPARK_HOME/sbin/start-slave.sh spark://<master-Addr>:7077
Success!
Now try and run it on the master:
username@HOST:~# $SPARK_HOME/bin/pyspark
Python 2.7.12 (default, Apr 15 2020, 17:07:12)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
20/10/13 05:19:50 WARN Utils: Your hostname, HOST.localdomain resolves to a loopback address: 127.0.0.1; using <address> instead (on interface eth0)
20/10/13 05:19:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/10/13 05:19:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.6
/_/
Using Python version 2.7.12 (default, Apr 15 2020 17:07:12)
SparkSession available as 'spark'.
>>>
That should give you the above.
Now you can transfer data into that directory and read from it using the spark.read.*
function that you need. Note that copying Big Data into that directory is not a good idea. If you’re looking at TeraBytes or Petabytes worth of data then you’ll definitely need a real Cluster. But, I’ve already made some interesting observations in this limited environment.