9 issues I’ve encountered when setting up a Hadoop/Spark cluster for the first time
Constantin Lungu
Senior Data Engineer ? Contract / Freelancer ? AWS Certified ? Opinions my own
In a previous article, we’ve discussed how to prepare the setup of a Hadoop/Spark cluster. Now, preparing the cluster was only the beginning. While there are plenty of resources about setting up Hadoop on a cluster, as a beginner I’ve found it confusing at times and I spent a good dozen hours putting it all together. Therefore, in this article I’ll document the issues encountered during setting up the Hadoop/Spark cluster and how to solve them. If you wish to give it a try yourself, please review the tutorial which I’ve followed to set this all up, which I wholeheartedly recommend.
Now, to give you some context, my setup is made by a laptop (which will serve both as a name node and a data node) and Raspberry Pis as another two data nodes , as follows:
Note that I will be running Spark 2.4.5 and Hadoop 3.2.1.
So, we’ve downloaded, unpacked and moved hadoop to /opt/hadoop.
Let’s try to start it
hadoop version
Oops!
JAVA_HOME_NOT_SET
First of all, please do check you have Java installed. If you’d like Spark down the road, keep in mind that the current stable Spark version is not compatible with Java 11.
sudo apt install openjdk-8-jre-headless -y
Then, let’s look at the environment variables. The tutorial (built for Raspbian) recommended setting this environment variable to something like
export JAVA_HOME=$(readlink –f /usr/bin/java | sed "s:bin/java::")
whereas the following, inspired from here, made it work for me (on Ubuntu):
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
Please also note the environment variables I’ve added to .bashrc :
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-arm64 export HADOOP_HOME=/opt/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Also, we need to add JAVA_HOME as above to /opt/hadoop/etc/hadoop/hadoop-env.sh.
Let’s try again now:
hadoop version
Works! Moving forward.
Permission denied (publickey,password).
So, everything is ready to spin off the Hadoop cluster. Only thing is when running
start-dfs.sh && start-yarn.sh
I keep receiving the following error:
localhost: Permission denied (publickey,password).
Turns out, there’s multiple things happening here. First, I my user had a different name than the one on the slave nodes, so my SSH authentication was failing. To sort that out, on the master, I had to set up a config file at ~/.ssh/config as follows:
Host rpi-3 HostName rpi-3 User ubuntu IdentityFile ~/.ssh/id_rsa Host rpi-4 HostName rpi-4 User ubuntu IdentityFile ~/.ssh/id_rsa
This made clear what is the user I want to use when connecting with SSH.
But this wasn’t all. More importantly though, it turns out I needed to SSH into localhost first in order to be able to run start-dfs.sh and start-yarn.sh. Moving forward.
Hadoop cluster setup — java.net.ConnectException: Connection refused
Next error on our list took me a good evening to debug. So when trying to start the cluster with start-dfs.sh I was receiving a connection refused error. First, without thinking too much about it, I’ve followed the advice from here and set up the default server to 0.0.0.0.
The correct answer was in fact to set it to my name node server’s address (in core-site.xml) my AND to make sure there isn’t an entry in /etc/hosts tying that to 127.0.0.1 or localhost. Hadoop doesn’t like that, and I’ve been warned.
Nodes not showing up
In a multi-node setup having some nodes missing can steam from a wide array of causes. Wrong configuration, differences in environments and even firewall can cause issues. Here’s a couple of tips how to make head or tail of it.
Check the UIs first
There’s a number of web interfaces exposed in a typical Hadoop stack. Two of them are the Yarn UI exposed on port 8088 and Name Node Information UI exposed on port 9870. In my case, I had a different number of nodes in HDFS that YARN, which pointed me to investigate and find the issue with my YARN settings.
Use jps to see that service is running
Jps is a Java tool, but you can use it to see which Hadoop services are up on a particular machine.
Is the DataNode up on the given node? Let’s move forward.
Dive into hadoop logs
Hadoop logs are located in the logs subfolder, so in my case /opt/hadoop/logs. Don’t see a particular data node? Ssh into that machine and analyze the logs say for that particular service.
Check the configuration files
That’s were the most issues could stem from. Review the logs, document yourself upon what the correct configuration is and apply them. The configuration files are located in /opt/hadoop/etc/hadoop. Start with the file corresponding to the service you’re trying to debug.
OK, so we’re more or less fine with Hadoop now. What about Spark? I’ve downloaded Spark (without Hadoop) and unpacked it. Now what? Smooth sailing? Not so fast.
Spark fails to start
If you’ve downloaded Spark standalone, chances are you’d bump into the same issue as I did, so be sure to add the following to conf/spark-env.sh
export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
Credits for this solution go here.
Pyspark error — Unsupported class file major version 55
Seriously, I’ve warned you above that current stable Spark version is only compatible with Java 8. More details on that.
java.lang.NumberFormatException: For input string: “0x100”
Next, this harmless but annoying message that you’d see once starting the spark-shell can be easily fixed by adding an environment variable to .bashrc.
export TERM=xterm-color
Read more about it here.
Spark UI has broken CSS
When starting a spark-shell or submitting a Spark job, a spark context Web UI is made available at port 4040 on the namenode. In my case, the issue was that the UI had a broken interface, which made using it impossible.
Once again, StackOverflow was my friend. To sort this one, one would need to start a spark-shell and run the following command:
sys.props.update("spark.ui.proxyBase", "")
Credits to the solution here.
Property Spark.yarn.jars
When running Spark in a clustered mode on top of a YARN cluster, the Spark .jar classes need to be shipped across to other nodes that don’t have Spark installed. Probably it took me more tinkering to sort it out than it should, but in my case the solution was to upload it to a location inside HDFS and let the nodes get it from there when they need it.
Creating the archive:
jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
Upload to HDFS
hdfs dfs -put spark-libs.jar /some/path/
Then, in spark-defaults.conf set
spark.yarn.archive hdfs://your-name-node:9000/spark-libs.jar
Credits to the solution and explanation here.
Conclusion
This concludes our recap on some errors encountered during setting up Hadoop and Spark as a beginner. We’ll give our cluster a spin, test it out and report in a future article. Thanks for reading and stay tuned!
This article first appeared on my Medium blog at https://medium.com/@cnstlungu/9-issues-ive-encountered-when-setting-up-a-hadoop-spark-cluster-for-the-first-time-87b023624a43
AWS Data Architect @ Capgemini | AWS Certified
3 年what is XPS-15-9560 ? is it your machine name? did you make any entry in etc hosts file?