Configuration of  HDFS Cluster with Ansible

Configuration of HDFS Cluster with Ansible

Hello Connections,

Here is the documentation to configure the HDFS Cluster with Ansible.

The Article has the documentation to :

  1. Create a role to Configure the NameNode in HDFS Cluster
  2. Create a role to Configure the DataNode in HDFS Cluster and configuration

Hadoop:

No alt text provided for this image

Hadoop is a product to create distributed storage and distributed compute clusters. Hadoop solves a major challenge of BigData. Hadoop has a cluster setup, Master is called the NameNode and The Nodes in the cluster that provide the storage is called as DataNodes.

To know more about BigData Challenge and Hadoop, Read this article;

Let's say we have a single file of 300 petabytes of data. It takes a lot of time to perform I/O on such a huge file and It's harder to managed and we can't store it in a single location.

With HDFS Cluster setup, the 300 petabytes of data can be striped in parts and stores in different DataNodes parallelly, increasing the performance of I/O. The small strips we make on a 300-petabyte file will store in different nodes in the cluster.

Hadoop Configuration Manually,

Let's say we need to configure Hadoop manually.

  1. Install Java, Hadoop
  2. Configure core-site, hdfs-site file in /etc/hadoop/
  3. start NameNode or DataNode

It's quite easy to do right? If we are doing configuration at once but think, you have 500 nodes to configure as Hadoop Data Node is it possible to configure each node as a Data Node by writing commands manually one by one? So, There comes a tool called Ansible to make our life easy.

No alt text provided for this image

Ansible is the Open Source, Automation, and Configuration Management tool. Ansible is the intelligent cause of the modules written in python. Ansible configuration files can be written in YAML


Ansible Workspace:

No alt text provided for this image

The workspace has

1. two roles one to configure the Hadoop NameNode, another to configure DataNode

2.The main file to launch instances and start roles

3. the public_ip file which will store

4. secure.yml file is a vault to sore secure variables

  • To know more about the Dynamic Inventory, Initializing role, and launch instances in AWS with ansible check those blogs, I am considering you have that knowledge and doing the documentation from here.
NameNode Role
  • The directory structure of the NameNode role
├── NameNode
│   ├── defaults
│   │   └── main.yml
│   ├── files
│   ├── handlers
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── README.md
│   ├── tasks
│   │   ├── back.md
│   │   └── main.yml
│   ├── templates
│   ├── tests
│   │   ├── inventory
│   │   └── test.yml
│   └── vars
│       ├── main.yml
│       └── vault.yml

  • NameNode/tasks/main the playbook of main tasks to configure in Managed Node.
  • Include the vault present in the vars/vault file in playbook to download the packages from s3
  - include_vars: vault.yml

Download the Java and Hadoop packages. Here I am going to download the packages from my S3 Bucket. For that, the managed node needs boto and boto3 modules of python.

  • Install python-pip, boto, and boto3
  - name: Install pip
    package:
      name: python-pip


  - name: Install boto3
    pip:
      name:
      - boto
      - boto3

  • Download Hadoop, Java

I am going to download the packages from S3 Bucket. created by me before. It's common to store some custom packages in s3. Created a vault in variables directory and included it before in tasks. This part you can customize. aws_s3 module is to download objects

  - name: Download Hadoop
    aws_s3:
      bucket: hadoop-software-ansible
      object: /hadoop-1.2.1.rpm
      dest: /home/ec2-user/hadoop-1.2.1.rpm
      mode: get
      aws_access_key: "{{ access_key }}"
      aws_secret_key: "{{ secret_key }}"
      region: "ap-south-1"


  - name: Download Dependencies - Java
    aws_s3:
      bucket: hadoop-software-ansible
      object: /jdk-8u171.rpm
      dest: /home/ec2-user/jdk-8u171.rpm
      mode: get
      aws_access_key: "{{ access_key }}"
      aws_secret_key: "{{ secret_key }}"
      region: "ap-south-1"

  • Install Java and Hadoop

package or yum module can be used to install packages and we just need to specify the name and to install or remove with the state. To install hadoop-1.2.1 we have to use --force

  - name: Installing Hadoop Dependencies
    yum:
      name:  /home/ec2-user/jdk-8u171.rpm
      state: present

  - name: Installing Hadoop
    shell: rpm -i  /home/ec2-user/hadoop-1.2.1.rpm --force

  • Create Directory

We can create a directory with a file module. with state, we can specify the file or directory and just stored it in a variable for further check namenode_dir_created

  - name: Create Directory
    file:
      path: /namenode
      state: directory
    register: namenode_dir_created

  • Ensure core-site configuration, we have to check the configuration before we write in file /etc/hadoop/core-site.xml. Here, the second task has an interesting condition. The configuration will be done only when the configuration is not present.

"' word_to_search' in file ". I felt the power of Ansible here cause of in keyword.

  - name: Ensure core-site configuration
    shell: cat /etc/hadoop/core-site.xml
    register: core_site_file


  - name: Configure core-site.xml
    lineinfile:
      path: /etc/hadoop/core-site.xml
      insertafter: '^<configuration>'
      line: "<property>\n<name>fs.default.name</name>\n<value>hdfs://0.0.0.0:9001</value>\n</property>"
    notify: Refresh
    when: "'<property>\n<name>fs.default.name</name>\n<value>hdfs://0.0.0.0:9001</value>\n</property>' not in core_site_file.stdout"

  • Ensure hdfs-site file configuration,
  - name: Ensure hdfs-site configuration
    shell: cat /etc/hadoop/hdfs-site.xml
    register: hdfs_site_file


  - name: Configuring hdfs-site.xml
    lineinfile:
      path: /etc/hadoop/hdfs-site.xml
      insertafter: '^<configuration>'
      line: "<property>\n<name>dfs.name.dir</name>\n<value>/namenode</value>\n</property>"
    notify: Refresh
    when: "'<property>\n<name>dfs.name.dir</name>\n<value>/namenode</value>\n</property>' not in hdfs_site_file.stdout"

  • Format the directory we created before. Hadoop format command needs an interactive output to successfully do format, here we can take the help of shell commands to achieve the task. Newly created directory only formatted here
  - name: Format Name Node
    shell: echo Y | hadoop namenode -format
    when: namenode_dir_created.changed == true

  • Start the Hadoop. before that, we have to ensure Hadoop is not running
  - name: Ensure Hadoop is running
    shell: jps
    become: true
    register: hadoop_namenode_state


  - name: Start Service
    shell: hadoop-daemon.sh start namenode
    when: "'NameNode' not in hadoop_namenode_state.stdout"

  • Save public IP of Name Node to a file in ansible directory so that we can use that IP in the configuration of Name Node core-site.xml file
  - name: Save Public IP
    shell:  curl ifconfig.me
    register: public_ip_namenode


  - name: Copy Public IP in file
    copy:
      content: "{{ public_ip_namenode.stdout }}"
      dest: public_ip
    delegate_to: localhost

---
DataNode
  • The configuration of Data Node is mostly the same as the configuration of Name Node

and we just need to add the public IP of the name node to core-site.xml file configuration of the data node

  - name: Public IP of Master
    shell: cat public_ip
    register: public_ip_namdenode
    delegate_to: localhost


  - name: Ensure core-site configuration
    shell: cat /etc/hadoop/core-site.xml
    register: core_site_file


  - name: Configure core-site.xml
    lineinfile:
      path: /etc/hadoop/core-site.xml
      insertafter: '^<configuration>'
      line: "<property>\n<name>fs.default.name</name>\n<value>hdfs://{{ public_ip_namdenode.stdout }}:9001</value>\n</property>"
    notify: Refresh
    when: "'<property>\n<name>fs.default.name</name>\n' not in core_site_file.stdout"

Finally, Let's Run the Code

Run Playbook :

ansible-playbook  main.yml  --ask-vault-pass
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Everything looks cool. Let's, check the cluster

Name Node configuration

No alt text provided for this image

Data Node Configuration:

No alt text provided for this image

dfsadmin -repot: Ther are two active nodes

No alt text provided for this image

jps commands:

No alt text provided for this image
No alt text provided for this image

GitHub Link :


Thank you for Reading, Please Drop a message If you have any questions about this article. Happy to Help!

Kalla Kruparaju

DevOps Engineer at DataGrokr | ?? 1X AWS, 2X Google Cloud, and 3X Azure Certified | Cloud Architect | Infrastructure Automation Enthusiast | Docker | Kubernetes | Terraform | Ansible | Jenkins | Gitlab

3 年

Great work Siva Naik Kethavath ?

Siva Naik Kethavath

DevOps Engineer | MLOps | DataOps | Founding Engineer

3 年
回复

要查看或添加评论,请登录

Siva Naik Kethavath的更多文章

社区洞察

其他会员也浏览了