Automating Hadoop Using Ansible
ansible+hadoop

Automating Hadoop Using Ansible

Hello everyone, Back with another article. In this you will find how we can automate hadoop using the linux automation tool i.e Redhat Ansible on the top of AWS...

What is hadoop ?

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

What is Hadoop Cluster ?

Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform parallel computations on big data sets.

What is Namenode ?

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. The NameNode is a Single Point of Failure for the HDFS Cluster

What is Datanode ?

DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them. It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

What is Ansible ?

Ansible is a software tool that provides simple but powerful automation for cross-platform computer support. It is primarily intended for IT professionals, who use it for application deployment, updates on workstations and servers, cloud provisioning, configuration management, intra-service orchestration, and nearly anything a systems administrator does on a weekly or daily basis. Ansible doesn't depend on agent software and has no additional security infrastructure, so it's easy to deploy.

How Ansible works

In Ansible, there are two categories of computers: the control node and managed nodes. The control node is a computer that runs Ansible. There must be at least one control node, although a backup control node may also exist. A managed node is any device being managed by the control node.

Ansible works by connecting to nodes (clients, servers, or whatever you're configuring) on a network, and then sending a small program called an Ansible module to that node. Ansible executes these modules over SSH and removes them when finished. The only requirement for this interaction is that your Ansible control node has login access to the managed nodes. SSH Keys are the most common way to provide access, but other forms of authentication are also supported.

Ansible playbooks

While modules provide the means of accomplishing a task, the way you use them is through an Ansible playbook. A playbook is a configuration file written in YAML that provides instructions for what needs to be done in order to bring a managed node into the desired state. Playbooks are meant to be simple, human-readable, and self-documenting. They are also idempotent, meaning that a playbook can be run on a system at any time without having a negative effect upon it. If a playbook is run on a system that's already properly configured and in its desired state, then that system should still be properly configured after a playbook runs.

So, let’s carry out this practical??

To carry out the above task, I have installed two different ec2 Virtual Machines on the top of Oracle AWS cloud. and my controller node is on virtual box...

Here is the configuration file of the ansible:

ansiblecfg

Here is the inventory file of the ansible:

No alt text provided for this image

Configuring hdfs-site

Instead of editing the existing files through the playbook, the more efficient way is to copy the file from the controller, by making the changes according to the need. Create the hdfs-site.xml file in the controller and edit the syntax as follows for Jinja Parsing:

No alt text provided for this image

Here the node and hdfs_dir are the variables we had created main playbook. They will help us in editing the files for both uses - master and slave nodes.

We use the template module to copy the files so that the file will get parsed by Ansible during the task execution.

Configuring core-site

Similarly we configure the core-site.xml file as follows...

core-site.xml

Here is the complete playbook for configuring the target nodes as namenode and datanode:

hadoop.yml

- name: "Namenode configuration"
  hosts: namenode

  vars_prompt:

  - name: "hdfs_dir"

    prompt: "Enter Namenode Directory"

    private: no


  - name: "node"

    prompt: "Enter node"

    private: no
 

  - name: "ip_addr"

    prompt: "Enter the Ip Address"

    private: no
 

  tasks:

  - name: "Copying JDK"

    copy:

       src: "/root/jdk-8u171-linux-x64.rpm"

       dest: /home/ec2-user/

    register: jdk
 

  - name: "Copying Hadoop"

    copy:

       src: "/root/hadoop-1.2.1-1.x86_64.rpm"

       dest: /home/ec2-user/

    register: hadoop

 
  - name: "Installing JDK"

    yum:

       name: "/home/ec2-user/jdk-8u171-linux-x64.rpm"

       state: present

    when: jdk.failed==false

    register: ijdk

 

  - name: "Installing Hadoop"

    command: "rpm -i /home/ec2-user/hadoop-1.2.1-1.x86_64.rpm --force"

    when: hadoop.failed=false

    register: ihadoop

    when: ijdk.failed==false

 

  - name: "Deleting Directory"

    shell: "rm -rf  {{ hdfs_dir }}"

    ignore_errors: yes

 

  - name: "Creating directory"

    file:

       state: directory

       path: "{{ hdfs_dir }}"

 

  - name: "Configuring hdfs-site"

    template:

       src: "/AnsibleWS/hdfs-site.xml"

       dest: "/etc/hadoop/hdfs-site.xml"

    when: ihadoop.failed==false

 

  - name: "Configuring core-site"

    template:

       src: "/AnsibleWS/core-site.xml"

       dest: "/etc/hadoop/core-site.xml"

    when: ihadoop.failed==false

 

  - name: "Formatting the Namenode"

    shell: "echo Y | hadoop namenode -format"

    register: format

 

  - debug:

       var: format.stdout

 

  - name: "stopping the namenode"

    command: hadoop-daemon.sh stop namenode

    ignore_errors: yes

 

  - name: "starting the namenode server"

    command: hadoop-daemon.sh start namenode

    when: format.failed==false

    register: startnn

 

  - debug:

       var: startnn.stdout

 

  - name: "checking status"

    command: jps

    register: jps

    when: format.failed==false and startnn.failed==false

 

  - debug:

       var: jps

 

 ############################################################

 
- name: "Datanode configuration"

  hosts: datanode

  vars_prompt:

  - name: "hdfs_dir"

    prompt: "Enter Datanode Directory"

    private: no

 

  - name: "node"

    prompt: "Enter node"

    private: no

 

  - name: "ip_addr"

    prompt: "Enter the namenode Ip Address"

    private: no

 

  tasks:

  - name: "Copying JDK"

    copy:

       src: "/root/jdk-8u171-linux-x64.rpm"

       dest: /home/ec2-user/

    register: jdk

 

  - name: "Copying Hadoop"

    copy:

       src: "/root/hadoop-1.2.1-1.x86_64.rpm"

       dest: /home/ec2-user/

    register: hadoop

 

  - name: "Installing JDK"

    yum:

       name: "/home/ec2-user/jdk-8u171-linux-x64.rpm"

       state: present

    when: jdk.failed==false

    register: ijdk

 

  - name: "Deleting Directory"

    shell: "rm -rf  {{ hdfs_dir }}"

    ignore_errors: yes

 

  - name: "Creating directory"

    file:

       state: directory

       path: "{{ hdfs_dir }}"

 

 

  - name: "Installing Hadoop"

    command: "rpm -i /home/ec2-user/hadoop-1.2.1-1.x86_64.rpm --force"

    when: hadoop.failed=false

    register: ihadoop

    when: ijdk.failed==false

 

  - name: "Configuring hdfs-site"

    template:

       src: hdfs-site.xml

       dest: /etc/hadoop/hdfs-site.xml

    when: ihadoop.failed==false

    register: hdfs

 

  - name: "Configuring core-site"

    template:

       src: core-site.xml

       dest: /etc/hadoop/core-site.xml

    when: ihadoop.failed==false

    register: core

 

  - name: "stopping the datanode server"

    command: hadoop-daemon.sh stop datanode

    ignore_errors: yes

 

  - name: "starting the datanode server"

    command: hadoop-daemon.sh start datanode

    when: ihadoop.failed==false

    register: startdn

 

  - debug:

       var: startdn.stdout

 

  - name: "checking status"

    command: jps

    register: jps

    when: ihadoop.failed==false and startdn.failed==false

 

  - debug:

       var: jps.stdout

 

  - name: Pause for 15 seconds to build cache

    pause:

       seconds: 15

 

  - name: "Checking Report"

    shell: "hadoop dfsadmin -report"

    register: report

  - debug:

          var: report.stdout

We now have our playbook configured. Run the playbook for getting the configurations automated -

ansible-playbook hadoop.yml
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Now from namenode we can see the cluster report for confirmation:

No alt text provided for this image

We can also check it from Hadoop WebUI ...

No alt text provided for this image

Thank You ?... Keep Learning Keep Sharing !!! 

Have a good day !! ???????

要查看或添加评论,请登录

Surayya Shaikh的更多文章

社区洞察

其他会员也浏览了