- Configuration Files
If you'd like to setup Apache Spark to experiment with but you don't want to use a premade ISO or setup your own then I'm going to show you how. This configuration will be a minimal one using Linux Operating Sytems; I'm going to use Ubuntu so change the install based on your package mananger. I'm going to assume that you've setup the hosts, their networking and have some way to configure and deploy them. There are options like Puppet or Salt but I'll be avoiding those and leave them up to you.
I have script that does all of this but we're going to go over it so you understand each part and then I'll attach the bash script at the end. To start with, we're going to need Java since Spark is dependent on it. Login to each host - Master and Slaves - and you'll want to run:
apt update && apt install openjdk-8-jre openjdk-8-jdk -y
Remember to adjust this for own Linux Distribution.
Figure out which directory you want to install Apache Spark into; Normally, you'd use
/opt so that's what we're going to be using. Download the archive of the files from the website:
wget -O /opt/spark-2.4.6-bin-hadoop2.7.tgz https://downloads.apache.org/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
If they obsolete the download link then you can find the versions online here. I'm using the 3.0.1 version Pre-built for Apache Hadoop 2.7 for this but feel free to experiment. Once the file is downloaded, you'll want to unpack the archive:
tar -xzvf /opt/spark-2.4.6-bin-hadoop2.7.tgz -C /opt rm /opt/spark-2.4.6-bin-hadoop2.7.tgz
Now you'll have files ready for usage. Next you'll want to add the environmental variables so that linux will know where to look for the binary when you call it:
echo -e "\nexport SPARK_HOME=/opt/spark-2.4.6-bin-hadoop2.7\nexport PATH=/opt/spark-2.4.6-bin-hadoop2.7/bin:$PATH" | tee -a ~/.bashrc export SPARK_HOME=/opt/spark-2.4.6-bin-hadoop2.7 export PATH=/opt/spark-2.4.6-bin-hadoop2.7/bin
Do this for all the hosts that you'll need to run Spark on.
For your master server, you'll need to update two files:
spark-env.sh. Both of these should be found inside the conf directory of your spark home.
Make sure you make a backup of them before you do anything else:
cp spark-defaults.conf spark-defaults.conf.bkp cp spark-env.sh spark-env.sh.bkp
Next you'll want to open them and add the master info information. For the defaults file simple uncomment and modify the line to look like:
spark.master spark://<host or IP Addr>:7077
... and the env file you'll want to look for the line:
SPARK_MASTER_HOST='<host of IP Addr>'
You shouldn't need to change anything else in this file so long as you have full control of the systems you're using. In my case, I happen to not and changed where shuffle and worker logs are ran to avoid the operating system from filling up and crashing the host. If you need to worrry about those then update the lines
SPARK_PID_DIR to point to somewhere else on the system which wont fill up the partition.
Next you'll want to collect the names or IP addresses of all the hosts in your cluster and add them to the
slaves file in
config directory just like were the others are.
Make sure to test the connectivity of your hosts using
ping or something else to confirm they can actually talk to one another!
Now we're going to work around not having a Hadoop cluster. How this works, is that we're going to create a shared folder on all of the hosts which references the Master as the Source of Truth. First, create a folder in your spark home to hold the data:
Go ahead and create a file in here for future usage:
Next you'll go ahead and install a package called
sshfs which is used to remotely mount a folder from one host and another:
sudo apt install sshfs
Repeat this for all the hosts in your cluster. Once that is done, you'll connect the slaves to the master using:
sshfs <username>@<master-address>:/opt/spark-2.4.6-bin-hadoop2.7/Data /opt/spark-2.4.6-bin-hadoop2.7/Data
Now you should be able to see the
turtles file we created earlier if you list the files in the Data directory
If you see the file then feel free to move on! If not, then double back and troubleshoot the connection between those two computers. Could also be permissions or something like that as well!
Now that we've got it all connected together, go ahead and run the appropriate commands on the masters and servers to start them all up:
# master: $SPARK_HOME/sbin/start-master.sh # slaves: $SPARK_HOME/sbin/start-slave.sh spark://<master-Addr>:7077
Now try and run it on the master:
username@HOST:~# $SPARK_HOME/bin/pyspark Python 2.7.12 (default, Apr 15 2020, 17:07:12) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. 20/10/13 05:19:50 WARN Utils: Your hostname, HOST.localdomain resolves to a loopback address: 127.0.0.1; using <address> instead (on interface eth0) 20/10/13 05:19:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 20/10/13 05:19:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Python version 2.7.12 (default, Apr 15 2020 17:07:12) SparkSession available as 'spark'. >>>
That should give you the above.
Now you can transfer data into that directory and read from it using the
spark.read.* function that you need.
Note that copying Big Data into that directory is not a good idea. If you're looking at TeraBytes or Petabytes worth of data then you'll definitely need a real Cluster. But, I've already made some interesting observations in this limited environment.