登录查看更多内容

Configuring Hadoop(NN/DN) via Ansible

Nishant Singh

SDE-1 @ Credgenics ll NIT B'23

发布日期: 2020年12月24日

+ 关注

Before getting hands on into any practical implementation its always good to know the terminologies.......

What is Apache Hadoop ?

* Hadoop is a tool made by Apache community to solve big data problems by making use of RAM/CPU from different computers who acts as datanodes and working under the master computer namenode.

*Datanodes facilitate namenode with a particular amount of storage and contribute their storage by sharing them over a network .

*Several Problems like Velocity, Volume and Veracity can be solved by making use of a big data handling tool.

What is Ansible ?

*Ansible is a DevOps tool made by RedHat and using which, we can configure most of configuration that we will ever need into a networking device . It is very important to note that Ansible is only meant for configuration , although it can perform other tasks like provisioning an OS but this feature is introduced only for automating configuration not launch the OS.

*Ansible is a great management tool which works on push mechanisms which Means we don't need any agent to be setup on the behalf of working of Ansible.

*Ansible uses a declarative approach , it means we just have to tell ansible what to do, and how to do is taken care by the smart modules that ansible uses

*Ansible also provides idempotency

Idempotency means ansible doesn't run a code continuous as much as we trigger, first it goes to the slave node, check the state of system and then at last it decides whether to rerun or state achieved.

Why we need Ansible Here ?

The use of Ansible in configuration of Hadoop Cluster is to achieve automation especially for Hadoop being utilized in bigger environments.

Thus a lot of manual task which many a times also leads to errors can be overcome by utilizing automation scripts.

Therefore , considering that the reader have a slight knowledge of both Ansible and Hadoop let's see the Ansible playbook

Assumptions:

*It is assumed that you want to setup your Hadoop cluster with a basic property of dfs.name.dir and dfs.name.dir for both name node and datanode.

*The variables nndir , nnport , dndir, dnport can be changed accordingly.

*The folder named file consists of template that is used for copying the basic layout for files hdfs-site.xml and core-site.xml .

*The playbook is intended for RedHat 8 Linux, maybe utilized in different environments after manipulation.

- hosts: namenode
  vars:
          nndir: "/nn"
          nnport: 9001
  tasks: 
          - name: Making folder for Redhat DVD
            file:
                    path: /dvd
                    state: directory
                    #mode: 0755
          - name: Mounting Redhat DVD
            mount:
                    src: /dev/cdrom
                    path: /dvd
                    fstype: iso9660
                    state: present
          - name: Making Repository for Redhat Disk AppStream
            yum_repository:
                    name: App1
                    description: "Redhat DVD App List 1"
                    baseurl: "file:///dvd/AppStream"
                    file: redhatdvd
                    gpgcheck: no
          - name: Making Repository for Redhat Disk BaseOS
            yum_repository:
                    name: App2
                    description: "Redhat DVD App List 2"
                    baseurl: "file:///dvd/BaseOS"           
                    file: redhatdvd
                    gpgcheck: no
          - name: Installing wget if not Available
            package:
                    name: wget
                    state: present
          - name: Installing Java JDK 
            package:
                    name: jdk
                    state: present
          - name: Downloading Hadoop Software
            command: "wget -c https://archive.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm"
          - name: Installing Hadoop Software
            command: "rpm -i --force hadoop-1.2.1-1.x86_64.rpm "
          - name: Deleting preexisting folder for NameNode if present
            file:
                    path: "{{ nndir }}"
                    state: absent


          - name: Making folder for NameNode
            file:
                    path: "{{ nndir }}"
                    state: directory


          - name: Copying hdfs-site.xml file from Controller Node
            copy:
                    src: files/hdfs-site.xml
                    dest: /etc/hadoop/
          - name: Copying core-site.xml file from Controller Node
            copy:
                    src: files/core-site.xml
                    dest: /etc/hadoop/ 
          - name: Adding dfs.name.dir property
            shell: |
                    echo '<configuration>
                    <property>
                    <name>dfs.name.dir</name>
                    <value>{{ nndir }}</value>
                    </property>
                    </configuration>' >> /etc/hadoop/hdfs-site.xml
          - name: Adding fs.default.name property
            shell: |
                    echo '<configuration>
                    <property>
                    <name>fs.default.name</name>
                    <value>hdfs://{{ ansible_facts['default_ipv4']['address'] }}:{{ nnport }}</value>
                    </property>
                    </configuration>' >> /etc/hadoop/core-site.xml 
          - name: Checking process running 
            command: "pidof /usr/java/default/bin/java"
            register: x
            failed_when: false
            ignore_errors: yes


          - name: Checking NameNode process if running already
            shell: "kill `pidof /usr/java/default/bin/java`"
            when: x.rc == 0 
      
          - name: Formatting the namenode directory
            shell: "echo 'Y' | hadoop namenode -format"


          - name: Starting Namenode

            command: "hadoop-daemon.sh start namenode"

One of importance step in the above tasks is Checking Running java process which maybe due to preinstalled Hadoop Software .Also its been used for making our setup idempotent in nature.

- hosts: datanodes
  vars:
          dndir: "/dn"
          nnip: "{{ groups.namenode[0] }}"
          nnport: 9001
  tasks:
          - name: Making folder for Redhat DVD
            file:
                    path: /dvd
                    state: directory
          - name: Mounting Redhat DVD
            mount:
                    src: /dev/cdrom
                    path: /dvd
                    fstype: iso9660
                    state: present
          - name: Making Repository for Redhat Disk AppStream           
            yum_repository:
                    name: App1       
                    description: "Redhat DVD App List 1"
                    baseurl: "file:///dvd/AppStream"
                    file: redhatdvd
                    gpgcheck: no
          - name: Making Repository for Redhat Disk BaseOS
            yum_repository:
                    name: App2
                    description: "Redhat DVD App List 2"
                    baseurl: "file:///dvd/BaseOS"
                    file: redhatdvd
                    gpgcheck: no
          - name: Installing wget if not Available
            package:
                    name: wget
                    state: present
          - name: Installing Java JDK
            package:
                    name: jdk
                    state: present
          - name: Downloading Hadoop Software
            command: "wget -c https://archive.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm"
          - name: Installing Hadoop Software
            command: "rpm -i --force hadoop-1.2.1-1.x86_64.rpm "
          - name: Deleting preexisting folder for DataNode if present
            file:
                    path: "{{ dndir }}"
                    state: absent
          - name: Making folder for DataNode
            file:
                    path: "{{ dndir }}"
                    state: directory
          - name: Copying hdfs-site.xml file from Controller Node
            copy:
                    src: files/hdfs-site.xml
                    dest: /etc/hadoop/
          - name: Copying core-site.xml file from Controller Node
            copy:
                    src: files/core-site.xml
                    dest: /etc/hadoop/ 
          - name: Adding dfs.data.dir property
            shell: |
                    echo '<configuration>
                    <property>
                    <name>dfs.data.dir</name>
                    <value>{{ dndir }}</value>
                    </property>
                    </configuration>' >> /etc/hadoop/hdfs-site.xml
          - name: Adding fs.default.name property
            shell: |
                    echo '<configuration>
                    <property>
                    <name>fs.default.name</name>
                    <value>hdfs://{{ nnip }}:{{ nnport }}</value>
                    </property>
                    </configuration>' >> /etc/hadoop/core-site.xml 
          - name: Checking DataNode process if running already
            shell: "pidof /usr/java/default/bin/java"
            register: x
            failed_when: false
            ignore_errors: yes


          - name: Killing running previously process                     
            shell: "kill `pidof /usr/java/default/bin/java`"
            when: x.rc == 0


          - name: Starting DataNode
            command: "hadoop-daemon.sh start datanode"

Time to apply the playbook

For running the playbook , first we have to make an inventory file and put the IPs of namenode and datanodes into the respective groups

*Note: It is not recommended to use root user

Its always good practice to check connection to managed nodes before running ansible commands and so finally lets run our playbook

The below snap shows the running playbook

Finally ? ,We can use the Hadoop health site to see the connected nodes as well as the last contact time .

So that's the start after so many days getting back to work and so let us continue and keep on learning day by day ??.

要查看或添加评论，请登录

Nishant Singh的更多文章

Configuring Hive with HDFS & MapReduce Cluster backend

2021年5月14日

Configuring Hive with HDFS & MapReduce Cluster backend

Hello to the reader , hope you are all doing great. Now that you are here, Lets just start it already ??.
Why handlers are used in Ansible?

2021年3月22日

Why handlers are used in Ansible?

Handlers are the tasks which gets triggered when some changes are made to a particular task. This solves a very…
Setting up AWS CDN with AWS CLI

2021年3月21日

Setting up AWS CDN with AWS CLI

Content Delivery Networks is one of the best utilization of a company's own private network across the globe. A company…
Play with IPs , IPv4 in particular

2020年12月29日

Play with IPs , IPv4 in particular

This article is an interesting one at least for me, Although it takes time to understand networking concepts since I…
A Session with two experts

2020年12月29日

A Session with two experts

The session was started by Mr. Arun Eapen with the explanation of what automation is and specially why we need it,So…
Configuring HAProxy-LB with Ansible

2020年12月29日

Configuring HAProxy-LB with Ansible

As I always say its always better to have a look onto the basic technical terms to get started, and so lets see what we…
Getting started with AWS CLI....

2020年10月17日

Getting started with AWS CLI....

This is an small article on explanation of getting started with Command line interface with some easy and helpful…
Ubisoft got enhanced with AWS

2020年9月22日

Ubisoft got enhanced with AWS

Normally Every Gaming company demands some big infrastructure with good quality CPU, RAM for the game development and…
Hybrid Cloud Setup: K8s and RDS

2020年9月22日

Hybrid Cloud Setup: K8s and RDS

A great setup to learn the intrication of the two different cloud platforms working together with the help of terraform…
Big Data Problems ........

2020年9月17日

Big Data Problems ........

Numbers are increasing day by day call it no. of people , no.

See all articles

Configuring Hadoop(NN/DN) via Ansible

Nishant Singh

SDE-1 @ Credgenics ll NIT B'23

What is Apache Hadoop ?

What is Ansible ?

Why we need Ansible Here ?

Nishant Singh的更多文章

社区洞察

其他会员也浏览了

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Introduction to Hadoop

CONFIGURING HADOOP CLUSTER USING ANSIBLE

Automating Hadoop Using Ansible

Hadoop – Hive, Impala, Zookeeper, and a Data Strategy

Configuration of HDFS Cluster with Ansible

Configure Hadoop and start cluster services using Ansible Playbook

Starting My Journey into Big Data: Apache Spark, Apache Hadoop, and AWS EMR

CONFIGURE HADOOP AND START CLUSTER SERVICES USING ANSIBLE PLAYBOOK:-

Setup a Multi-Node Hadoop Cluster using Docker

What is Apache Hadoop ?

What is Ansible ?

Why we need Ansible Here ?

Nishant Singh的更多文章

Configuring Hive with HDFS & MapReduce Cluster backend

Why handlers are used in Ansible?

Setting up AWS CDN with AWS CLI

Play with IPs , IPv4 in particular

A Session with two experts

Configuring HAProxy-LB with Ansible

Getting started with AWS CLI....

Ubisoft got enhanced with AWS

Hybrid Cloud Setup: K8s and RDS

Big Data Problems ........

社区洞察

其他会员也浏览了

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Introduction to Hadoop

CONFIGURING HADOOP CLUSTER USING ANSIBLE

Automating Hadoop Using Ansible

Hadoop – Hive, Impala, Zookeeper, and a Data Strategy

Configuration of HDFS Cluster with Ansible

Configure Hadoop and start cluster services using Ansible Playbook

Starting My Journey into Big Data: Apache Spark, Apache Hadoop, and AWS EMR

CONFIGURE HADOOP AND START CLUSTER SERVICES USING ANSIBLE PLAYBOOK:-

Setup a Multi-Node Hadoop Cluster using Docker