HDFS Clustering Through Docker in CentOS
Overview
This guide will show you how to deploy a distributed file system for Hadoop (HDFS). We will first make a docker image and according to that image, we will make 1 NameNode (Master) and 3 DataNodes (Slaves).
Docker Image Generation
Run the command below to create a centos container to install HDFS.
docker run -d -t --privileged --network host --name hdfs centos:7 /sbin/init
docker exec -it hdfs bash
Make an install_hdfs.sh file and add the below contents.
# Install Java and Needed Packages
yum update -y
yum install wget -y
yum install vim -y
yum install openssh-server openssh-clients openssh-askpass -y
yum install java-1.8.0-openjdk-devel.x86_64 -y
# Make Keys so nodes can communicate without requesting password
ssh-keygen -t rsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
ssh-keygen -f /etc/ssh/ssh_host_rsa_key -t rsa -N ""
ssh-keygen -f /etc/ssh/ssh_host_ecdsa_key -t ecdsa -N ""
ssh-keygen -f /etc/ssh/ssh_host_ed25519_key -t ed25519 -N ""
# Make a Directory where hadoop will be located
mkdir /hadoop_home
cd /hadoop_home
wget <https://archive.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz>
tar -xvzf hadoop-2.7.7.tar.gz
# Add all this environment variables to the ~/.bashrc file
echo "export JAVA_HOME=\\$(readlink -f /usr/bin/javac | xargs dirname | xargs dirname)" >> ~/.bashrc
echo "export HADOOP_HOME=/hadoop_home/hadoop-2.7.7" >> ~/.bashrc
echo "export HADOOP_CONFIG_HOME=\\$HADOOP_HOME/etc/hadoop" >> ~/.bashrc
echo "export PATH=\\$PATH:\\$HADOOP_HOME/bin" >> ~/.bashrc
echo "export PATH=\\$PATH:\\$HADOOP_HOME/sbin" >> ~/.bashrc
# Update Map Reduce Setting File
cp /hadoop_home/hadoop-2.7.7/etc/hadoop/mapred-site.xml.template /hadoop_home/hadoop-2.7.7/etc/hadoop/mapred-site.xml
sed -i '/<\\/configuration>/i \\
<property>\\
<name>hadoop.tmp.dir<\\/name>\\
<value>\\/hadoop_home\\/temp<\\/value>\\
<\\/property>\\
<property>\\
<name>mapred.job.tracker<\\/name>\\
<value>nn:9001<\\/value>\\
<\\/property>\\
<property>\\
<name>fs.default.name<\\/name>\\
<value>hdfs:\\/\\/nn:9000<\\/value>\\
<final>true<\\/final>\\
<\\/property>' /hadoop_home/hadoop-2.7.7/etc/hadoop/mapred-site.xml
# Update HDFS Setting File
sed -i '/<\\/configuration>/i \\
<property>\\
<name>dfs.replication<\\/name>\\
<value>2<\\/value>\\
<final>true<\\/final>\\
<\\/property>\\
\\
<property>\\
<name>dfs.namenode.name.dir<\\/name>\\
<value>\\/hadoop_home\\/namenode_dir<\\/value>\\
<final>true<\\/final>\\
<\\/property>\\
\\
<property>\\
<name>dfs.datanode.data.dir<\\/name>\\
<value>\\/hadoop_home\\/datanode_dir<\\/value>\\
<final>true<\\/final>\\
<\\/property>' /hadoop_home/hadoop-2.7.7/etc/hadoop/hdfs-site.xml
# Make Directories for Master, Slaves nodes, and temporary files
mkdir /hadoop_home/temp
mkdir /hadoop_home/namenode_dir
mkdir /hadoop_home/datanode_dir
Run the command below to install all the packages
chmod +x install_hdfs.sh && ./install_hdfs.sh
source ~/.bashrc && hadoop namenode -format
Exit the container and commit it as an image
docker commit hdfs centos:hdfs
Set your Cluster
After the container is made as an image, we must create our cluster environment next:
sudo docker run -it -h nn --restart always --privileged=true --tmpfs /run --name nn -p 50070:50070 centos:hdfs
sudo docker run -it -h dn1 --restart always --privileged=true --tmpfs /run --name dn1 --link nn:nn centos:hdfs
sudo docker run -it -h dn2 --restart always --privileged=true --tmpfs /run --name dn2 --link nn:nn centos:hdfs
sudo docker run -it -h dn3 --restart always --privileged=true --tmpfs /run --name dn3 --link nn:nn centos:hdfs
We need to extract the IP addresses of the next containers. After extracting them, you need to add them to the /etc/hosts file in the NameNode container.
docker inspect nn | grep IPAddress \\
; docker inspect dn1 | grep IPAddress \\
; docker inspect dn2 | grep IPAddress \\
; docker inspect dn3 | grep IPAddress
The IP addresses below are examples of the content that should be in the file.
echo "172.17.0.3 dn1" >> /etc/hosts
echo "172.17.0.4 dn2" >> /etc/hosts
echo "172.17.0.5 dn3" >> /etc/hosts
Also, make sure to add the slaves in the slaves file. The exact address of the file should be $HADOOP_CONFIG_HOME/slaves
echo "dn1" >> $HADOOP_CONFIG_HOME/slaves
echo "dn2" >> $HADOOP_CONFIG_HOME/slaves
echo "dn3" >> $HADOOP_CONFIG_HOME/slaves
After following all the steps above, you can start hadoop with the next command.
start-all.sh
Thanks for reading this guide and if you like these kinds of content, don't forget to follow me for more. Thank you :D