Automating Hadoop Using Ansible
Surayya Shaikh
1x RedHat Certified | ARTH LEARNER | RHCE | Kubernetes | DevOps | Docker | Linux | Python | AWS
Hello everyone, Back with another article. In this you will find how we can automate hadoop using the linux automation tool i.e Redhat Ansible on the top of AWS...
What is hadoop ?
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
What is Hadoop Cluster ?
A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform parallel computations on big data sets.
What is Namenode ?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. The NameNode is a Single Point of Failure for the HDFS Cluster
What is Datanode ?
A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them. It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.
What is Ansible ?
Ansible is a software tool that provides simple but powerful automation for cross-platform computer support. It is primarily intended for IT professionals, who use it for application deployment, updates on workstations and servers, cloud provisioning, configuration management, intra-service orchestration, and nearly anything a systems administrator does on a weekly or daily basis. Ansible doesn't depend on agent software and has no additional security infrastructure, so it's easy to deploy.
How Ansible works
In Ansible, there are two categories of computers: the control node and managed nodes. The control node is a computer that runs Ansible. There must be at least one control node, although a backup control node may also exist. A managed node is any device being managed by the control node.
Ansible works by connecting to nodes (clients, servers, or whatever you're configuring) on a network, and then sending a small program called an Ansible module to that node. Ansible executes these modules over SSH and removes them when finished. The only requirement for this interaction is that your Ansible control node has login access to the managed nodes. SSH Keys are the most common way to provide access, but other forms of authentication are also supported.
Ansible playbooks
While modules provide the means of accomplishing a task, the way you use them is through an Ansible playbook. A playbook is a configuration file written in YAML that provides instructions for what needs to be done in order to bring a managed node into the desired state. Playbooks are meant to be simple, human-readable, and self-documenting. They are also idempotent, meaning that a playbook can be run on a system at any time without having a negative effect upon it. If a playbook is run on a system that's already properly configured and in its desired state, then that system should still be properly configured after a playbook runs.
So, let’s carry out this practical??
To carry out the above task, I have installed two different ec2 Virtual Machines on the top of Oracle AWS cloud. and my controller node is on virtual box...
Here is the configuration file of the ansible:
Here is the inventory file of the ansible:
Configuring hdfs-site
Instead of editing the existing files through the playbook, the more efficient way is to copy the file from the controller, by making the changes according to the need. Create the hdfs-site.xml file in the controller and edit the syntax as follows for Jinja Parsing:
Here the node and hdfs_dir are the variables we had created main playbook. They will help us in editing the files for both uses - master and slave nodes.
We use the template module to copy the files so that the file will get parsed by Ansible during the task execution.
Configuring core-site
Similarly we configure the core-site.xml file as follows...
Here is the complete playbook for configuring the target nodes as namenode and datanode:
hadoop.yml
- name: "Namenode configuration" hosts: namenode vars_prompt: - name: "hdfs_dir" prompt: "Enter Namenode Directory" private: no - name: "node" prompt: "Enter node" private: no - name: "ip_addr" prompt: "Enter the Ip Address" private: no tasks: - name: "Copying JDK" copy: src: "/root/jdk-8u171-linux-x64.rpm" dest: /home/ec2-user/ register: jdk - name: "Copying Hadoop" copy: src: "/root/hadoop-1.2.1-1.x86_64.rpm" dest: /home/ec2-user/ register: hadoop - name: "Installing JDK" yum: name: "/home/ec2-user/jdk-8u171-linux-x64.rpm" state: present when: jdk.failed==false register: ijdk - name: "Installing Hadoop" command: "rpm -i /home/ec2-user/hadoop-1.2.1-1.x86_64.rpm --force" when: hadoop.failed=false register: ihadoop when: ijdk.failed==false - name: "Deleting Directory" shell: "rm -rf {{ hdfs_dir }}" ignore_errors: yes - name: "Creating directory" file: state: directory path: "{{ hdfs_dir }}" - name: "Configuring hdfs-site" template: src: "/AnsibleWS/hdfs-site.xml" dest: "/etc/hadoop/hdfs-site.xml" when: ihadoop.failed==false - name: "Configuring core-site" template: src: "/AnsibleWS/core-site.xml" dest: "/etc/hadoop/core-site.xml" when: ihadoop.failed==false - name: "Formatting the Namenode" shell: "echo Y | hadoop namenode -format" register: format - debug: var: format.stdout - name: "stopping the namenode" command: hadoop-daemon.sh stop namenode ignore_errors: yes - name: "starting the namenode server" command: hadoop-daemon.sh start namenode when: format.failed==false register: startnn - debug: var: startnn.stdout - name: "checking status" command: jps register: jps when: format.failed==false and startnn.failed==false - debug: var: jps ############################################################ - name: "Datanode configuration" hosts: datanode vars_prompt: - name: "hdfs_dir" prompt: "Enter Datanode Directory" private: no - name: "node" prompt: "Enter node" private: no - name: "ip_addr" prompt: "Enter the namenode Ip Address" private: no tasks: - name: "Copying JDK" copy: src: "/root/jdk-8u171-linux-x64.rpm" dest: /home/ec2-user/ register: jdk - name: "Copying Hadoop" copy: src: "/root/hadoop-1.2.1-1.x86_64.rpm" dest: /home/ec2-user/ register: hadoop - name: "Installing JDK" yum: name: "/home/ec2-user/jdk-8u171-linux-x64.rpm" state: present when: jdk.failed==false register: ijdk - name: "Deleting Directory" shell: "rm -rf {{ hdfs_dir }}" ignore_errors: yes - name: "Creating directory" file: state: directory path: "{{ hdfs_dir }}" - name: "Installing Hadoop" command: "rpm -i /home/ec2-user/hadoop-1.2.1-1.x86_64.rpm --force" when: hadoop.failed=false register: ihadoop when: ijdk.failed==false - name: "Configuring hdfs-site" template: src: hdfs-site.xml dest: /etc/hadoop/hdfs-site.xml when: ihadoop.failed==false register: hdfs - name: "Configuring core-site" template: src: core-site.xml dest: /etc/hadoop/core-site.xml when: ihadoop.failed==false register: core - name: "stopping the datanode server" command: hadoop-daemon.sh stop datanode ignore_errors: yes - name: "starting the datanode server" command: hadoop-daemon.sh start datanode when: ihadoop.failed==false register: startdn - debug: var: startdn.stdout - name: "checking status" command: jps register: jps when: ihadoop.failed==false and startdn.failed==false - debug: var: jps.stdout - name: Pause for 15 seconds to build cache pause: seconds: 15 - name: "Checking Report" shell: "hadoop dfsadmin -report" register: report - debug: var: report.stdout
We now have our playbook configured. Run the playbook for getting the configurations automated -
ansible-playbook hadoop.yml
Now from namenode we can see the cluster report for confirmation:
We can also check it from Hadoop WebUI ...
Thank You ?... Keep Learning Keep Sharing !!!
Have a good day !! ???????