登录查看更多内容

Configuration of HDFS Cluster with Ansible

Siva Naik Kethavath

DevOps Engineer | MLOps | DataOps | Founding Engineer

发布日期: 2021年4月28日

+ 关注

Hello Connections,

Here is the documentation to configure the HDFS Cluster with Ansible.

The Article has the documentation to :

Create a role to Configure the NameNode in HDFS Cluster
Create a role to Configure the DataNode in HDFS Cluster and configuration

Hadoop:

Hadoop is a product to create distributed storage and distributed compute clusters. Hadoop solves a major challenge of BigData. Hadoop has a cluster setup, Master is called the NameNode and The Nodes in the cluster that provide the storage is called as DataNodes.

To know more about BigData Challenge and Hadoop, Read this article;

Let's say we have a single file of 300 petabytes of data. It takes a lot of time to perform I/O on such a huge file and It's harder to managed and we can't store it in a single location.

With HDFS Cluster setup, the 300 petabytes of data can be striped in parts and stores in different DataNodes parallelly, increasing the performance of I/O. The small strips we make on a 300-petabyte file will store in different nodes in the cluster.

Hadoop Configuration Manually,

Let's say we need to configure Hadoop manually.

Install Java, Hadoop
Configure core-site, hdfs-site file in /etc/hadoop/
start NameNode or DataNode

It's quite easy to do right? If we are doing configuration at once but think, you have 500 nodes to configure as Hadoop Data Node is it possible to configure each node as a Data Node by writing commands manually one by one? So, There comes a tool called Ansible to make our life easy.

Ansible is the Open Source, Automation, and Configuration Management tool. Ansible is the intelligent cause of the modules written in python. Ansible configuration files can be written in YAML

Ansible Workspace:

The workspace has

1. two roles one to configure the Hadoop NameNode, another to configure DataNode

2.The main file to launch instances and start roles

3. the public_ip file which will store

4. secure.yml file is a vault to sore secure variables

To know more about the Dynamic Inventory, Initializing role, and launch instances in AWS with ansible check those blogs, I am considering you have that knowledge and doing the documentation from here.

NameNode Role

The directory structure of the NameNode role

├── NameNode
│   ├── defaults
│   │   └── main.yml
│   ├── files
│   ├── handlers
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── README.md
│   ├── tasks
│   │   ├── back.md
│   │   └── main.yml
│   ├── templates
│   ├── tests
│   │   ├── inventory
│   │   └── test.yml
│   └── vars
│       ├── main.yml
│       └── vault.yml

NameNode/tasks/main the playbook of main tasks to configure in Managed Node.
Include the vault present in the vars/vault file in playbook to download the packages from s3

  - include_vars: vault.yml

Download the Java and Hadoop packages. Here I am going to download the packages from my S3 Bucket. For that, the managed node needs boto and boto3 modules of python.

Install python-pip, boto, and boto3

  - name: Install pip
    package:
      name: python-pip


  - name: Install boto3
    pip:
      name:
      - boto
      - boto3

Download Hadoop, Java

I am going to download the packages from S3 Bucket. created by me before. It's common to store some custom packages in s3. Created a vault in variables directory and included it before in tasks. This part you can customize. aws_s3 module is to download objects

  - name: Download Hadoop
    aws_s3:
      bucket: hadoop-software-ansible
      object: /hadoop-1.2.1.rpm
      dest: /home/ec2-user/hadoop-1.2.1.rpm
      mode: get
      aws_access_key: "{{ access_key }}"
      aws_secret_key: "{{ secret_key }}"
      region: "ap-south-1"


  - name: Download Dependencies - Java
    aws_s3:
      bucket: hadoop-software-ansible
      object: /jdk-8u171.rpm
      dest: /home/ec2-user/jdk-8u171.rpm
      mode: get
      aws_access_key: "{{ access_key }}"
      aws_secret_key: "{{ secret_key }}"
      region: "ap-south-1"

Install Java and Hadoop

package or yum module can be used to install packages and we just need to specify the name and to install or remove with the state. To install hadoop-1.2.1 we have to use --force

  - name: Installing Hadoop Dependencies
    yum:
      name:  /home/ec2-user/jdk-8u171.rpm
      state: present

  - name: Installing Hadoop
    shell: rpm -i  /home/ec2-user/hadoop-1.2.1.rpm --force

Create Directory

We can create a directory with a file module. with state, we can specify the file or directory and just stored it in a variable for further check namenode_dir_created

  - name: Create Directory
    file:
      path: /namenode
      state: directory
    register: namenode_dir_created

Ensure core-site configuration, we have to check the configuration before we write in file /etc/hadoop/core-site.xml. Here, the second task has an interesting condition. The configuration will be done only when the configuration is not present.

"' word_to_search' in file ". I felt the power of Ansible here cause of in keyword.

  - name: Ensure core-site configuration
    shell: cat /etc/hadoop/core-site.xml
    register: core_site_file


  - name: Configure core-site.xml
    lineinfile:
      path: /etc/hadoop/core-site.xml
      insertafter: '^<configuration>'
      line: "<property>\n<name>fs.default.name</name>\n<value>hdfs://0.0.0.0:9001</value>\n</property>"
    notify: Refresh
    when: "'<property>\n<name>fs.default.name</name>\n<value>hdfs://0.0.0.0:9001</value>\n</property>' not in core_site_file.stdout"

Ensure hdfs-site file configuration,

  - name: Ensure hdfs-site configuration
    shell: cat /etc/hadoop/hdfs-site.xml
    register: hdfs_site_file


  - name: Configuring hdfs-site.xml
    lineinfile:
      path: /etc/hadoop/hdfs-site.xml
      insertafter: '^<configuration>'
      line: "<property>\n<name>dfs.name.dir</name>\n<value>/namenode</value>\n</property>"
    notify: Refresh
    when: "'<property>\n<name>dfs.name.dir</name>\n<value>/namenode</value>\n</property>' not in hdfs_site_file.stdout"

Format the directory we created before. Hadoop format command needs an interactive output to successfully do format, here we can take the help of shell commands to achieve the task. Newly created directory only formatted here

  - name: Format Name Node
    shell: echo Y | hadoop namenode -format
    when: namenode_dir_created.changed == true

Start the Hadoop. before that, we have to ensure Hadoop is not running

  - name: Ensure Hadoop is running
    shell: jps
    become: true
    register: hadoop_namenode_state


  - name: Start Service
    shell: hadoop-daemon.sh start namenode
    when: "'NameNode' not in hadoop_namenode_state.stdout"

Save public IP of Name Node to a file in ansible directory so that we can use that IP in the configuration of Name Node core-site.xml file

  - name: Save Public IP
    shell:  curl ifconfig.me
    register: public_ip_namenode


  - name: Copy Public IP in file
    copy:
      content: "{{ public_ip_namenode.stdout }}"
      dest: public_ip
    delegate_to: localhost

---

DataNode

The configuration of Data Node is mostly the same as the configuration of Name Node

and we just need to add the public IP of the name node to core-site.xml file configuration of the data node

  - name: Public IP of Master
    shell: cat public_ip
    register: public_ip_namdenode
    delegate_to: localhost


  - name: Ensure core-site configuration
    shell: cat /etc/hadoop/core-site.xml
    register: core_site_file


  - name: Configure core-site.xml
    lineinfile:
      path: /etc/hadoop/core-site.xml
      insertafter: '^<configuration>'
      line: "<property>\n<name>fs.default.name</name>\n<value>hdfs://{{ public_ip_namdenode.stdout }}:9001</value>\n</property>"
    notify: Refresh
    when: "'<property>\n<name>fs.default.name</name>\n' not in core_site_file.stdout"

Finally, Let's Run the Code

Run Playbook :

ansible-playbook  main.yml  --ask-vault-pass

Everything looks cool. Let's, check the cluster

Name Node configuration

Data Node Configuration:

dfsadmin -repot: Ther are two active nodes

jps commands:

GitHub Link :

Thank you for Reading, Please Drop a message If you have any questions about this article. Happy to Help!

Kalla Kruparaju

3 年

Great work Siva Naik Kethavath ?

1 次回应

Siva Naik Kethavath

DevOps Engineer | MLOps | DataOps | Founding Engineer

3 年

GitHub Repository https://github.com/kethavathsivanaik/AnsibleConf-HDFSCluster

查看更多评论

要查看或添加评论，请登录

Siva Naik Kethavath的更多文章

Kubernetes Configuration Manager - Helm

2021年7月30日

Kubernetes Configuration Manager - Helm

I am back with another article this time about Helm, The Configuration Management tool for Kubernetes. Challenge For…

3 条评论
Configure Kubernetes Cluster with Ansible and Deploy WordPress application

2021年7月30日

Configure Kubernetes Cluster with Ansible and Deploy WordPress application

Hello Connections, Task Description: ?? Automate Kubernetes Cluster Using Ansible ?? Launch ec2-instances on AWS Cloud…
Accurate Routing Rules for better Connectivity and Security

2021年7月1日

Accurate Routing Rules for better Connectivity and Security

Hello Everyone, I am here with the documentation to write accurate routing rules based on our requirements. In this…

4 条评论
Configure Docker Containers with Ansible Dynamic Inventory

2021年6月30日

Configure Docker Containers with Ansible Dynamic Inventory

Hello Connections, I am here with the documentation to configure docker with Ansible dynamic Inventory. The Ansible…

3 条评论
AWS Relational Database connectivity to Applications

2021年6月18日

AWS Relational Database connectivity to Applications

Hello Connections, Here is the documentation to create a blogging site of your own with WordPress and Amazon Relational…
Configuration of Kubernetes Cluster with Ansible

2021年4月25日

Configuration of Kubernetes Cluster with Ansible

Hello Connections, Here is the documentation to Configure the Multi-Node Kubernetes Cluster with Ansible Role. There…

3 条评论
Accurate Routing Rules for Best Security

2021年3月14日

Accurate Routing Rules for Best Security

Hello, Here is my article about Security. The best security is to limit the exposure of the server to the outside world.
Ansible Dynamic Inventory and Refresh Inventory at Run time

2020年12月19日

Ansible Dynamic Inventory and Refresh Inventory at Run time

Hello Connection, Here is my article about Provisioning EC2 instance in AWS with Ansible Dynamically update inventory…

4 条评论
Dockerzie SSH and Configure Webserver in Docker with Ansible

2020年12月17日

Dockerzie SSH and Configure Webserver in Docker with Ansible

Hello ?? ! Here is my Article to configure Webserver inside a Docker Container with Ansible Playbook In this article, I…

8 条评论
Create High Availability Architecture with AWS CLI to Deliver the Static Content

2020年12月16日

Create High Availability Architecture with AWS CLI to Deliver the Static Content

Hello, Connections! Here is my article about how to Deliver Content quickly. ? Task Description ? Create the…

2 条评论

See all articles

Configuration of HDFS Cluster with Ansible

Siva Naik Kethavath

DevOps Engineer | MLOps | DataOps | Founding Engineer

GitHub Link :

Siva Naik Kethavath的更多文章

社区洞察

其他会员也浏览了

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Difference between RDBMS and HBase

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Understanding Hadoop: The Backbone of Big Data Processing

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

HDFS (Hadoop Distributed File System):

Hadoop: A Powerful Tool for Big Data Management

Hadoop Architecture Made Easy!

Understanding Hadoop: Powering Big Data Processing and Analytics

GitHub Link :

Siva Naik Kethavath的更多文章

Kubernetes Configuration Manager - Helm

Configure Kubernetes Cluster with Ansible and Deploy WordPress application

Accurate Routing Rules for better Connectivity and Security

Configure Docker Containers with Ansible Dynamic Inventory

AWS Relational Database connectivity to Applications

Configuration of Kubernetes Cluster with Ansible

Accurate Routing Rules for Best Security

Ansible Dynamic Inventory and Refresh Inventory at Run time

Dockerzie SSH and Configure Webserver in Docker with Ansible

Create High Availability Architecture with AWS CLI to Deliver the Static Content

社区洞察

其他会员也浏览了

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Difference between RDBMS and HBase

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Understanding Hadoop: The Backbone of Big Data Processing

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

HDFS (Hadoop Distributed File System):

Hadoop: A Powerful Tool for Big Data Management

Hadoop Architecture Made Easy!

Understanding Hadoop: Powering Big Data Processing and Analytics