登录查看更多内容

HDFS Clustering Through Docker in CentOS

Humberto Villalta

Python Developer | Data Engineer

发布日期: 2024年2月6日

Overview

This guide will show you how to deploy a distributed file system for Hadoop (HDFS). We will first make a docker image and according to that image, we will make 1 NameNode (Master) and 3 DataNodes (Slaves).

Docker Image Generation

Run the command below to create a centos container to install HDFS.

docker run -d -t --privileged --network host --name hdfs centos:7 /sbin/init
docker exec -it hdfs bash

Make an install_hdfs.sh file and add the below contents.

# Install Java and Needed Packages
yum update -y
yum install wget -y
yum install vim -y
yum install openssh-server openssh-clients openssh-askpass -y
yum install java-1.8.0-openjdk-devel.x86_64 -y

# Make Keys so nodes can communicate without requesting password
ssh-keygen -t rsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
ssh-keygen -f /etc/ssh/ssh_host_rsa_key -t rsa -N ""
ssh-keygen -f /etc/ssh/ssh_host_ecdsa_key -t ecdsa -N ""
ssh-keygen -f /etc/ssh/ssh_host_ed25519_key -t ed25519 -N ""

# Make a Directory where hadoop will be located
mkdir /hadoop_home
cd /hadoop_home
wget <https://archive.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz> 
tar -xvzf hadoop-2.7.7.tar.gz

# Add all this environment variables to the ~/.bashrc file
echo "export JAVA_HOME=\\$(readlink -f /usr/bin/javac | xargs dirname | xargs dirname)" >> ~/.bashrc
echo "export HADOOP_HOME=/hadoop_home/hadoop-2.7.7" >> ~/.bashrc
echo "export HADOOP_CONFIG_HOME=\\$HADOOP_HOME/etc/hadoop" >> ~/.bashrc
echo "export PATH=\\$PATH:\\$HADOOP_HOME/bin" >> ~/.bashrc
echo "export PATH=\\$PATH:\\$HADOOP_HOME/sbin" >> ~/.bashrc

# Update Map Reduce Setting File
cp /hadoop_home/hadoop-2.7.7/etc/hadoop/mapred-site.xml.template /hadoop_home/hadoop-2.7.7/etc/hadoop/mapred-site.xml
sed -i '/<\\/configuration>/i \\
    <property>\\
        <name>hadoop.tmp.dir<\\/name>\\
        <value>\\/hadoop_home\\/temp<\\/value>\\
    <\\/property>\\
    <property>\\
        <name>mapred.job.tracker<\\/name>\\
        <value>nn:9001<\\/value>\\
    <\\/property>\\
    <property>\\
        <name>fs.default.name<\\/name>\\
        <value>hdfs:\\/\\/nn:9000<\\/value>\\
        <final>true<\\/final>\\
    <\\/property>' /hadoop_home/hadoop-2.7.7/etc/hadoop/mapred-site.xml

# Update HDFS Setting File
sed -i '/<\\/configuration>/i \\
    <property>\\
        <name>dfs.replication<\\/name>\\
        <value>2<\\/value>\\
        <final>true<\\/final>\\
    <\\/property>\\
    \\
    <property>\\
        <name>dfs.namenode.name.dir<\\/name>\\
        <value>\\/hadoop_home\\/namenode_dir<\\/value>\\
        <final>true<\\/final>\\
    <\\/property>\\
    \\
    <property>\\
        <name>dfs.datanode.data.dir<\\/name>\\
        <value>\\/hadoop_home\\/datanode_dir<\\/value>\\
        <final>true<\\/final>\\
    <\\/property>' /hadoop_home/hadoop-2.7.7/etc/hadoop/hdfs-site.xml

# Make Directories for Master, Slaves nodes, and temporary files
mkdir /hadoop_home/temp
mkdir /hadoop_home/namenode_dir
mkdir /hadoop_home/datanode_dir

Run the command below to install all the packages

chmod +x install_hdfs.sh && ./install_hdfs.sh
source ~/.bashrc && hadoop namenode -format

Exit the container and commit it as an image

docker commit hdfs centos:hdfs

Set your Cluster

After the container is made as an image, we must create our cluster environment next:

NameNode —> nn
DataNode1 —> dn1
DataNode2 —> dn2
DataNode3 —> dn3

领英推荐

Configuring Hadoop(NN/DN) via Ansible

Nishant Singh 4 年前

sudo docker run -it -h nn --restart always --privileged=true --tmpfs /run --name nn -p 50070:50070 centos:hdfs
sudo docker run -it -h dn1 --restart always --privileged=true --tmpfs /run --name dn1 --link nn:nn centos:hdfs
sudo docker run -it -h dn2 --restart always --privileged=true --tmpfs /run --name dn2 --link nn:nn centos:hdfs
sudo docker run -it -h dn3 --restart always --privileged=true --tmpfs /run --name dn3 --link nn:nn centos:hdfs

We need to extract the IP addresses of the next containers. After extracting them, you need to add them to the /etc/hosts file in the NameNode container.

docker inspect nn | grep IPAddress \\
; docker inspect dn1 | grep IPAddress \\
; docker inspect dn2 | grep IPAddress \\
; docker inspect dn3 | grep IPAddress

The IP addresses below are examples of the content that should be in the file.

echo "172.17.0.3      dn1" >> /etc/hosts
echo "172.17.0.4      dn2" >> /etc/hosts
echo "172.17.0.5      dn3" >> /etc/hosts

Also, make sure to add the slaves in the slaves file. The exact address of the file should be $HADOOP_CONFIG_HOME/slaves

echo "dn1" >> $HADOOP_CONFIG_HOME/slaves
echo "dn2" >> $HADOOP_CONFIG_HOME/slaves
echo "dn3" >> $HADOOP_CONFIG_HOME/slaves

After following all the steps above, you can start hadoop with the next command.

start-all.sh

Thanks for reading this guide and if you like these kinds of content, don't forget to follow me for more. Thank you :D

要查看或添加评论，请登录

Humberto Villalta的更多文章

Network Configuration

2024年2月13日

Network Configuration

Try the following steps to reset the server’s internet connection: Before Starting…. To see which file is connected to…
How to Push Local Repo Code into Github Repo

2024年2月8日

How to Push Local Repo Code into Github Repo

This guide is intended to show how to push code from our local computer to a GitHub repository. Step 1 Open the Git…
Redis Integration with Python

2024年1月29日

Redis Integration with Python

Overview This guide is made for those who want to install Redis in a Centos 7 docker container and to integrate Python.…
What is Redis?

2024年1月24日

What is Redis?

Overview Redis is a C programming written REmote DIctionary Server developed in 2006. Redis read and write operations…
Linux [~/.bashrc] File

2024年1月18日

Linux [~/.bashrc] File

This post is intended to show how to interact with the different settings this file offers. Some of the settings that…
How to Set a Virtual Machine in Windows OS

2024年1月15日

How to Set a Virtual Machine in Windows OS

Overview Installing a virtual machine on a Windows operating system seems to be the easiest and most basic process that…

2 条评论
Spacetime Data Hub Technology Connecting the Physical and Digital Worlds

2024年1月10日

Spacetime Data Hub Technology Connecting the Physical and Digital Worlds

The advancement of IT technology has enabled people to project physical spaces into virtual spaces known as “digital…
Python Decorator Introduction with Examples

2024年1月3日

Python Decorator Introduction with Examples

1. Overview The decorator pattern is a software design pattern that allows us to dynamically add functionality to…
HyperLogLog Basics

2023年12月27日

HyperLogLog Basics

Overview Probabilistic data structures are very well-known because of their outstanding time and space complexity among…
Cuckoo Index

2023年12月17日

Cuckoo Index

1. Overview Cuckoo hashing is used as a solution for hash collisions, and its worst-case lookup time is constant O(1).

See all articles

HDFS Clustering Through Docker in CentOS

Humberto Villalta

Python Developer | Data Engineer

Overview

Docker Image Generation

Set your Cluster

领英推荐

Humberto Villalta的更多文章

社区洞察

其他会员也浏览了

What is HIVE?

Hadoop and LVM

Building a Hadoop Cluster from the Powerful Automation Tool: Ansible

Integrating LVM with Hadoop and provide elasticity to DataNode with LVM Concept

Big Data #1

How To Create Hadoop Cluster In Just 10 Minutes ?

HDFS Architecture (Basic concepts)

Sqoop

Integrating LVM with Hadoop and providing Elasticity to DataNode Storage

Hadoop

Overview

Docker Image Generation

Set your Cluster

领英推荐

Humberto Villalta的更多文章

Network Configuration

How to Push Local Repo Code into Github Repo

Redis Integration with Python

What is Redis?

Linux [~/.bashrc] File

How to Set a Virtual Machine in Windows OS

Spacetime Data Hub Technology Connecting the Physical and Digital Worlds

Python Decorator Introduction with Examples

HyperLogLog Basics

Cuckoo Index

社区洞察

其他会员也浏览了

What is HIVE?

Hadoop and LVM

Building a Hadoop Cluster from the Powerful Automation Tool: Ansible

Integrating LVM with Hadoop and provide elasticity to DataNode with LVM Concept

Big Data #1

How To Create Hadoop Cluster In Just 10 Minutes ?

HDFS Architecture (Basic concepts)

Sqoop

Integrating LVM with Hadoop and providing Elasticity to DataNode Storage

Hadoop