Configuring Hadoop(NN/DN) via Ansible
Before getting hands on into any practical implementation its always good to know the terminologies.......
What is Apache Hadoop ?
* Hadoop is a tool made by Apache community to solve big data problems by making use of RAM/CPU from different computers who acts as datanodes and working under the master computer namenode.
*Datanodes facilitate namenode with a particular amount of storage and contribute their storage by sharing them over a network .
*Several Problems like Velocity, Volume and Veracity can be solved by making use of a big data handling tool.
What is Ansible ?
*Ansible is a DevOps tool made by RedHat and using which, we can configure most of configuration that we will ever need into a networking device . It is very important to note that Ansible is only meant for configuration , although it can perform other tasks like provisioning an OS but this feature is introduced only for automating configuration not launch the OS.
*Ansible is a great management tool which works on push mechanisms which Means we don't need any agent to be setup on the behalf of working of Ansible.
*Ansible uses a declarative approach , it means we just have to tell ansible what to do, and how to do is taken care by the smart modules that ansible uses
*Ansible also provides idempotency
Idempotency means ansible doesn't run a code continuous as much as we trigger, first it goes to the slave node, check the state of system and then at last it decides whether to rerun or state achieved.
Why we need Ansible Here ?
The use of Ansible in configuration of Hadoop Cluster is to achieve automation especially for Hadoop being utilized in bigger environments.
Thus a lot of manual task which many a times also leads to errors can be overcome by utilizing automation scripts.
Therefore , considering that the reader have a slight knowledge of both Ansible and Hadoop let's see the Ansible playbook
Assumptions:
*It is assumed that you want to setup your Hadoop cluster with a basic property of dfs.name.dir and dfs.name.dir for both name node and datanode.
*The variables nndir , nnport , dndir, dnport can be changed accordingly.
*The folder named file consists of template that is used for copying the basic layout for files hdfs-site.xml and core-site.xml .
*The playbook is intended for RedHat 8 Linux, maybe utilized in different environments after manipulation.
- hosts: namenode vars: nndir: "/nn" nnport: 9001 tasks: - name: Making folder for Redhat DVD file: path: /dvd state: directory #mode: 0755 - name: Mounting Redhat DVD mount: src: /dev/cdrom path: /dvd fstype: iso9660 state: present - name: Making Repository for Redhat Disk AppStream yum_repository: name: App1 description: "Redhat DVD App List 1" baseurl: "file:///dvd/AppStream" file: redhatdvd gpgcheck: no - name: Making Repository for Redhat Disk BaseOS yum_repository: name: App2 description: "Redhat DVD App List 2" baseurl: "file:///dvd/BaseOS" file: redhatdvd gpgcheck: no - name: Installing wget if not Available package: name: wget state: present - name: Installing Java JDK package: name: jdk state: present - name: Downloading Hadoop Software command: "wget -c https://archive.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm" - name: Installing Hadoop Software command: "rpm -i --force hadoop-1.2.1-1.x86_64.rpm " - name: Deleting preexisting folder for NameNode if present file: path: "{{ nndir }}" state: absent - name: Making folder for NameNode file: path: "{{ nndir }}" state: directory - name: Copying hdfs-site.xml file from Controller Node copy: src: files/hdfs-site.xml dest: /etc/hadoop/ - name: Copying core-site.xml file from Controller Node copy: src: files/core-site.xml dest: /etc/hadoop/ - name: Adding dfs.name.dir property shell: | echo '<configuration> <property> <name>dfs.name.dir</name> <value>{{ nndir }}</value> </property> </configuration>' >> /etc/hadoop/hdfs-site.xml - name: Adding fs.default.name property shell: | echo '<configuration> <property> <name>fs.default.name</name> <value>hdfs://{{ ansible_facts['default_ipv4']['address'] }}:{{ nnport }}</value> </property> </configuration>' >> /etc/hadoop/core-site.xml - name: Checking process running command: "pidof /usr/java/default/bin/java" register: x failed_when: false ignore_errors: yes - name: Checking NameNode process if running already shell: "kill `pidof /usr/java/default/bin/java`" when: x.rc == 0 - name: Formatting the namenode directory shell: "echo 'Y' | hadoop namenode -format" - name: Starting Namenode command: "hadoop-daemon.sh start namenode"
One of importance step in the above tasks is Checking Running java process which maybe due to preinstalled Hadoop Software .Also its been used for making our setup idempotent in nature.
- hosts: datanodes vars: dndir: "/dn" nnip: "{{ groups.namenode[0] }}" nnport: 9001 tasks: - name: Making folder for Redhat DVD file: path: /dvd state: directory - name: Mounting Redhat DVD mount: src: /dev/cdrom path: /dvd fstype: iso9660 state: present - name: Making Repository for Redhat Disk AppStream yum_repository: name: App1 description: "Redhat DVD App List 1" baseurl: "file:///dvd/AppStream" file: redhatdvd gpgcheck: no - name: Making Repository for Redhat Disk BaseOS yum_repository: name: App2 description: "Redhat DVD App List 2" baseurl: "file:///dvd/BaseOS" file: redhatdvd gpgcheck: no - name: Installing wget if not Available package: name: wget state: present - name: Installing Java JDK package: name: jdk state: present - name: Downloading Hadoop Software command: "wget -c https://archive.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm" - name: Installing Hadoop Software command: "rpm -i --force hadoop-1.2.1-1.x86_64.rpm " - name: Deleting preexisting folder for DataNode if present file: path: "{{ dndir }}" state: absent - name: Making folder for DataNode file: path: "{{ dndir }}" state: directory - name: Copying hdfs-site.xml file from Controller Node copy: src: files/hdfs-site.xml dest: /etc/hadoop/ - name: Copying core-site.xml file from Controller Node copy: src: files/core-site.xml dest: /etc/hadoop/ - name: Adding dfs.data.dir property shell: | echo '<configuration> <property> <name>dfs.data.dir</name> <value>{{ dndir }}</value> </property> </configuration>' >> /etc/hadoop/hdfs-site.xml - name: Adding fs.default.name property shell: | echo '<configuration> <property> <name>fs.default.name</name> <value>hdfs://{{ nnip }}:{{ nnport }}</value> </property> </configuration>' >> /etc/hadoop/core-site.xml - name: Checking DataNode process if running already shell: "pidof /usr/java/default/bin/java" register: x failed_when: false ignore_errors: yes - name: Killing running previously process shell: "kill `pidof /usr/java/default/bin/java`" when: x.rc == 0 - name: Starting DataNode command: "hadoop-daemon.sh start datanode"
Time to apply the playbook
For running the playbook , first we have to make an inventory file and put the IPs of namenode and datanodes into the respective groups
*Note: It is not recommended to use root user
Its always good practice to check connection to managed nodes before running ansible commands and so finally lets run our playbook
The below snap shows the running playbook
Finally ? ,We can use the Hadoop health site to see the connected nodes as well as the last contact time .
So that's the start after so many days getting back to work and so let us continue and keep on learning day by day ??.