登录查看更多内容

How I've set up my first Hadoop / Spark cluster: Preparation

Constantin Lungu

Senior Data Engineer ? Contract / Freelancer ? AWS Certified ? Opinions my own

发布日期: 2020年4月2日

I’ve been envisaging setting up a cluster for Big Data practice for quite some time now. Coming from a traditional Business Intelligence background into a Software Engineering role (albeit still heavily data-related), it was only a matter of time that Big Data technologies would catch my eye.

From an outsider’s perspective, the Big Data landscape sounds like a collection of buzzwords and a myriad of products: starting from the more familiar Hadoop and going to the wide array of components that fit into that ecosystem: HDFS, Yarn, Spark, Pig, Hive, Airflow or Nifi. I’ve got to admit that some of them have really funny names, but how do they all fit into this picture?

Well, I was bound to start finding out, but this sort of things is not something you’d pick up by watching a video tutorial. Moreover, learning something new is always exciting in the beginning but pretty discouraging when you stumble upon a roadblock. I did try setting up Hadoop at least two times before: once locally and another time using a Cloudera Quick Start VM . The first time I’ve given up after a few minutes trying to set up my local Java environment (I know, I know), whereas the second time I felt that given it’s a single node, sandbox setup, it wasn’t as close to the ‘real thing’ as I’d wanted.

This article, the first in a series of posts, is essentially a compilation of my travel notes on this journey. It’s not really intended to be a tutorial, as I’ll be referencing other comprehensive and well-written resources that covered the 90% of the way. Its purpose is to reduce the time that the reader (or future me) needs for the other 10%, or perhaps serve as a soft introduction into the topic. This has been an awfully long introduction, so let’s get to work!

The Preparation

Before starting doing anything, I’ve looked at different resources available online, searching for a tutorial that would guide me throughout the way.

I’ve found this well-documented and simply outstanding series, written by a guy name Andrew based in Dublin, which basically explained everything I needed to know before starting.There were a couple of adjustments I had to make here and there.

Hardware

First, I wanted to go the real hardware route as opposed to setting up a fleet of VMs. So I decided to have a Raspberry Pi Cluster. I already had a 1GB Raspberry Pi 3B (bought a couple of years ago for ￡35) and bought another 2GB Raspberry Pi 4B and a cluster case with cooling fans. Nice thing about it: I’ve got them delivered in under 24 hours from the UK to Bucharest. In addition to the two Pi-nodes, I will be setting up another node on my laptop, which will also be the master.

While there are better options for network performance and look (PoE), since my devices have WiFi connectivity I’ve decided to let them communicate via wireless and be charged using standard (phone-like) chargers. Both have a 16GB microSD card as storage.

Plan for my cluster

Operating system

Next, setting up the Pis. While they do come with Noobs — an easy operating system installer — which allows installing Raspbian, a Debian-based Linux which is the most popular choice for Rasbperry Pis, I’ve decided to go with Ubuntu for this platform. I downloaded the Ubuntu server image and burned it to a microSD on my PC using a SD adapter with the Pi Imager — a straightforward process that took about 3 minutes.

Raspberry Pi Imager

Then, I’ve set up the wireless connectivity for my newly set-up Raspberry Pi using the instructions available here.

/etc/netplan/50-cloud-init.yaml

Hostnames

Next, I’ve set up hostnames and hosts on each machine.

/etc/hostname and /etc/hosts for one of the Pis

SSH

Another important step is the SSH setup. This will allow connecting to and running commands on machines in our cluster. We start with enabling SSH on the machines:

sudo apt update
sudo apt install openssh-server

We can test that the service is up with:

sudo systemctl status ssh

Now, say ubuntu is our user and rpi-3 is the machine we want to connect to, we could execute the following command, type in the password and be connected.

ssh ubuntu@rpi-3

There’s a couple of more things we need to take care of. To enable passwordless login, let’s create a public-private key pair on the master:

ssh-keygen

This will generate a public and a private key, with the default location being ~/.ssh.

We’ll now append our public key to the slaves’ authorized_keys

cat ~/.ssh/id_rsa.pub | ssh ubuntu@rpi-3 'cat >> .ssh/authorized_keys'

Then, on the master, we’ll set up a config file at ~/.ssh/config as follows:

Host rpi-3
HostName rpi-3
User ubuntu
IdentityFile ~/.ssh/id_rsaHost rpi-4

HostName rpi-4
User ubuntu
IdentityFile ~/.ssh/id_rsa

Now, connecting to another machine via SSH is as simple as:

Security and utility

The previously referenced tutorial also took steps towards securing the cluster (disabling password logins, login and the potentially compromising message of the day) — important advice which I believe should be followed. Moreover, utility functions were suggested that would enable executing a command once for all machines in the cluster. For instance, I’ve added the following to my .bashrc so that one command can be issued from the master and executed on all nodes. I’ve slightly modified mine so that the otherpis looks up hosts that have “rpi-” in their name and the clustercmd does not execute the command on my main machine.

function otherpis {
  grep "rpi-" /etc/hosts | awk '{print $2}' | grep -v $(hostname)
}

function clustercmd {
  for pi in $(otherpis); do ssh $pi "$@"; done
}

Conclusion

Now that we have a working set of machines that we can execute commands on, we can proceed to setting up Hadoop and Spark. I will be detailing the problems I’ve encountered in the process in an upcoming article. Thank you for reading and see you soon.

This article first appeared on my Medium blog at https://medium.com/@cnstlungu/how-i-set-up-my-first-hadoop-spark-cluster-preparation-663893486d12 .

要查看或添加评论，请登录

Constantin Lungu的更多文章

Preparing for the AWS Certified Developer-Associate exam

2020年5月9日

Preparing for the AWS Certified Developer-Associate exam

Cloud computing skills are not longer a nice-to-have for an IT professional, but rather a must-have. The speed and the…
9 issues I’ve encountered when setting up a Hadoop/Spark cluster for the first time

2020年4月26日

9 issues I’ve encountered when setting up a Hadoop/Spark cluster for the first time

In a previous article, we’ve discussed how to prepare the setup of a Hadoop/Spark cluster. Now, preparing the cluster…

3 条评论

How I've set up my first Hadoop / Spark cluster: Preparation

Constantin Lungu

Senior Data Engineer ? Contract / Freelancer ? AWS Certified ? Opinions my own

The Preparation

Hardware

Operating system

Hostnames

SSH

Security and utility

Conclusion

Constantin Lungu的更多文章

社区洞察

其他会员也浏览了

A Comprehensive Guide to Hadoop YARN - Yet Another Resource Negotiator.

Hadoop Cluster Revealed

Big Data Quick Tricks(Hive-Fixing Small File Issue)

Big Data Files formats and Compressions

Configuring Hadoop Name&DataNode Using Ansible with Firewall Dynamically

YARN: The Resource Manager Powering Hadoop’s Scalability

Reading unconventional raw data with Custom RecordReader in Spark using Hadoop new API

Top 3 methods of skipping big data’s bad data using Hadoop !

Riding Hadoop on Docker - Running Hadoop in Pseudo distributed mode on Docker

Use Spark, or stick with Hadoop?

The Preparation

Hardware

Operating system

Hostnames

SSH

Security and utility

Conclusion

Constantin Lungu的更多文章

Preparing for the AWS Certified Developer-Associate exam

9 issues I’ve encountered when setting up a Hadoop/Spark cluster for the first time

社区洞察

其他会员也浏览了

A Comprehensive Guide to Hadoop YARN - Yet Another Resource Negotiator.

Hadoop Cluster Revealed

Big Data Quick Tricks(Hive-Fixing Small File Issue)

Big Data Files formats and Compressions

Configuring Hadoop Name&DataNode Using Ansible with Firewall Dynamically

YARN: The Resource Manager Powering Hadoop’s Scalability

Reading unconventional raw data with Custom RecordReader in Spark using Hadoop new API

Top 3 methods of skipping big data’s bad data using Hadoop !

Riding Hadoop on Docker - Running Hadoop in Pseudo distributed mode on Docker

Use Spark, or stick with Hadoop?