HDFS Clustering Through Docker in CentOS

HDFS Clustering Through Docker in CentOS

Overview

This guide will show you how to deploy a distributed file system for Hadoop (HDFS). We will first make a docker image and according to that image, we will make 1 NameNode (Master) and 3 DataNodes (Slaves).

Docker Image Generation

Run the command below to create a centos container to install HDFS.

docker run -d -t --privileged --network host --name hdfs centos:7 /sbin/init
docker exec -it hdfs bash        

Make an install_hdfs.sh file and add the below contents.

# Install Java and Needed Packages
yum update -y
yum install wget -y
yum install vim -y
yum install openssh-server openssh-clients openssh-askpass -y
yum install java-1.8.0-openjdk-devel.x86_64 -y

# Make Keys so nodes can communicate without requesting password
ssh-keygen -t rsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
ssh-keygen -f /etc/ssh/ssh_host_rsa_key -t rsa -N ""
ssh-keygen -f /etc/ssh/ssh_host_ecdsa_key -t ecdsa -N ""
ssh-keygen -f /etc/ssh/ssh_host_ed25519_key -t ed25519 -N ""

# Make a Directory where hadoop will be located
mkdir /hadoop_home
cd /hadoop_home
wget <https://archive.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz> 
tar -xvzf hadoop-2.7.7.tar.gz

# Add all this environment variables to the ~/.bashrc file
echo "export JAVA_HOME=\\$(readlink -f /usr/bin/javac | xargs dirname | xargs dirname)" >> ~/.bashrc
echo "export HADOOP_HOME=/hadoop_home/hadoop-2.7.7" >> ~/.bashrc
echo "export HADOOP_CONFIG_HOME=\\$HADOOP_HOME/etc/hadoop" >> ~/.bashrc
echo "export PATH=\\$PATH:\\$HADOOP_HOME/bin" >> ~/.bashrc
echo "export PATH=\\$PATH:\\$HADOOP_HOME/sbin" >> ~/.bashrc

# Update Map Reduce Setting File
cp /hadoop_home/hadoop-2.7.7/etc/hadoop/mapred-site.xml.template /hadoop_home/hadoop-2.7.7/etc/hadoop/mapred-site.xml
sed -i '/<\\/configuration>/i \\
    <property>\\
        <name>hadoop.tmp.dir<\\/name>\\
        <value>\\/hadoop_home\\/temp<\\/value>\\
    <\\/property>\\
    <property>\\
        <name>mapred.job.tracker<\\/name>\\
        <value>nn:9001<\\/value>\\
    <\\/property>\\
    <property>\\
        <name>fs.default.name<\\/name>\\
        <value>hdfs:\\/\\/nn:9000<\\/value>\\
        <final>true<\\/final>\\
    <\\/property>' /hadoop_home/hadoop-2.7.7/etc/hadoop/mapred-site.xml

# Update HDFS Setting File
sed -i '/<\\/configuration>/i \\
    <property>\\
        <name>dfs.replication<\\/name>\\
        <value>2<\\/value>\\
        <final>true<\\/final>\\
    <\\/property>\\
    \\
    <property>\\
        <name>dfs.namenode.name.dir<\\/name>\\
        <value>\\/hadoop_home\\/namenode_dir<\\/value>\\
        <final>true<\\/final>\\
    <\\/property>\\
    \\
    <property>\\
        <name>dfs.datanode.data.dir<\\/name>\\
        <value>\\/hadoop_home\\/datanode_dir<\\/value>\\
        <final>true<\\/final>\\
    <\\/property>' /hadoop_home/hadoop-2.7.7/etc/hadoop/hdfs-site.xml

# Make Directories for Master, Slaves nodes, and temporary files
mkdir /hadoop_home/temp
mkdir /hadoop_home/namenode_dir
mkdir /hadoop_home/datanode_dir        

Run the command below to install all the packages

chmod +x install_hdfs.sh && ./install_hdfs.sh
source ~/.bashrc && hadoop namenode -format        

Exit the container and commit it as an image

docker commit hdfs centos:hdfs        

Set your Cluster

After the container is made as an image, we must create our cluster environment next:

  • NameNode —> nn
  • DataNode1 —> dn1
  • DataNode2 —> dn2
  • DataNode3 —> dn3

sudo docker run -it -h nn --restart always --privileged=true --tmpfs /run --name nn -p 50070:50070 centos:hdfs
sudo docker run -it -h dn1 --restart always --privileged=true --tmpfs /run --name dn1 --link nn:nn centos:hdfs
sudo docker run -it -h dn2 --restart always --privileged=true --tmpfs /run --name dn2 --link nn:nn centos:hdfs
sudo docker run -it -h dn3 --restart always --privileged=true --tmpfs /run --name dn3 --link nn:nn centos:hdfs        

We need to extract the IP addresses of the next containers. After extracting them, you need to add them to the /etc/hosts file in the NameNode container.

docker inspect nn | grep IPAddress \\
; docker inspect dn1 | grep IPAddress \\
; docker inspect dn2 | grep IPAddress \\
; docker inspect dn3 | grep IPAddress        

The IP addresses below are examples of the content that should be in the file.

echo "172.17.0.3      dn1" >> /etc/hosts
echo "172.17.0.4      dn2" >> /etc/hosts
echo "172.17.0.5      dn3" >> /etc/hosts
        

Also, make sure to add the slaves in the slaves file. The exact address of the file should be $HADOOP_CONFIG_HOME/slaves

echo "dn1" >> $HADOOP_CONFIG_HOME/slaves
echo "dn2" >> $HADOOP_CONFIG_HOME/slaves
echo "dn3" >> $HADOOP_CONFIG_HOME/slaves
        

After following all the steps above, you can start hadoop with the next command.

start-all.sh
        

Thanks for reading this guide and if you like these kinds of content, don't forget to follow me for more. Thank you :D




要查看或添加评论,请登录

Humberto Villalta的更多文章

  • Network Configuration

    Network Configuration

    Try the following steps to reset the server’s internet connection: Before Starting…. To see which file is connected to…

  • How to Push Local Repo Code into Github Repo

    How to Push Local Repo Code into Github Repo

    This guide is intended to show how to push code from our local computer to a GitHub repository. Step 1 Open the Git…

  • Redis Integration with Python

    Redis Integration with Python

    Overview This guide is made for those who want to install Redis in a Centos 7 docker container and to integrate Python.…

  • What is Redis?

    What is Redis?

    Overview Redis is a C programming written REmote DIctionary Server developed in 2006. Redis read and write operations…

  • Linux [~/.bashrc] File

    Linux [~/.bashrc] File

    This post is intended to show how to interact with the different settings this file offers. Some of the settings that…

  • How to Set a Virtual Machine in Windows OS

    How to Set a Virtual Machine in Windows OS

    Overview Installing a virtual machine on a Windows operating system seems to be the easiest and most basic process that…

    2 条评论
  • Spacetime Data Hub Technology Connecting the Physical and Digital Worlds

    Spacetime Data Hub Technology Connecting the Physical and Digital Worlds

    The advancement of IT technology has enabled people to project physical spaces into virtual spaces known as “digital…

  • Python Decorator Introduction with Examples

    Python Decorator Introduction with Examples

    1. Overview The decorator pattern is a software design pattern that allows us to dynamically add functionality to…

  • HyperLogLog Basics

    HyperLogLog Basics

    Overview Probabilistic data structures are very well-known because of their outstanding time and space complexity among…

  • Cuckoo Index

    Cuckoo Index

    1. Overview Cuckoo hashing is used as a solution for hash collisions, and its worst-case lookup time is constant O(1).

社区洞察

其他会员也浏览了