Configuration of HDFS Cluster with Ansible
Hello Connections,
Here is the documentation to configure the HDFS Cluster with Ansible.
The Article has the documentation to :
- Create a role to Configure the NameNode in HDFS Cluster
- Create a role to Configure the DataNode in HDFS Cluster and configuration
Hadoop:
Hadoop is a product to create distributed storage and distributed compute clusters. Hadoop solves a major challenge of BigData. Hadoop has a cluster setup, Master is called the NameNode and The Nodes in the cluster that provide the storage is called as DataNodes.
To know more about BigData Challenge and Hadoop, Read this article;
Let's say we have a single file of 300 petabytes of data. It takes a lot of time to perform I/O on such a huge file and It's harder to managed and we can't store it in a single location.
With HDFS Cluster setup, the 300 petabytes of data can be striped in parts and stores in different DataNodes parallelly, increasing the performance of I/O. The small strips we make on a 300-petabyte file will store in different nodes in the cluster.
Hadoop Configuration Manually,
Let's say we need to configure Hadoop manually.
- Install Java, Hadoop
- Configure core-site, hdfs-site file in /etc/hadoop/
- start NameNode or DataNode
It's quite easy to do right? If we are doing configuration at once but think, you have 500 nodes to configure as Hadoop Data Node is it possible to configure each node as a Data Node by writing commands manually one by one? So, There comes a tool called Ansible to make our life easy.
Ansible is the Open Source, Automation, and Configuration Management tool. Ansible is the intelligent cause of the modules written in python. Ansible configuration files can be written in YAML
Ansible Workspace:
The workspace has
1. two roles one to configure the Hadoop NameNode, another to configure DataNode
2.The main file to launch instances and start roles
3. the public_ip file which will store
4. secure.yml file is a vault to sore secure variables
- To know more about the Dynamic Inventory, Initializing role, and launch instances in AWS with ansible check those blogs, I am considering you have that knowledge and doing the documentation from here.
NameNode Role
- The directory structure of the NameNode role
├── NameNode │ ├── defaults │ │ └── main.yml │ ├── files │ ├── handlers │ │ └── main.yml │ ├── meta │ │ └── main.yml │ ├── README.md │ ├── tasks │ │ ├── back.md │ │ └── main.yml │ ├── templates │ ├── tests │ │ ├── inventory │ │ └── test.yml │ └── vars │ ├── main.yml │ └── vault.yml
- NameNode/tasks/main the playbook of main tasks to configure in Managed Node.
- Include the vault present in the vars/vault file in playbook to download the packages from s3
- include_vars: vault.yml
Download the Java and Hadoop packages. Here I am going to download the packages from my S3 Bucket. For that, the managed node needs boto and boto3 modules of python.
- Install python-pip, boto, and boto3
- name: Install pip package: name: python-pip - name: Install boto3 pip: name: - boto - boto3
- Download Hadoop, Java
I am going to download the packages from S3 Bucket. created by me before. It's common to store some custom packages in s3. Created a vault in variables directory and included it before in tasks. This part you can customize. aws_s3 module is to download objects
- name: Download Hadoop aws_s3: bucket: hadoop-software-ansible object: /hadoop-1.2.1.rpm dest: /home/ec2-user/hadoop-1.2.1.rpm mode: get aws_access_key: "{{ access_key }}" aws_secret_key: "{{ secret_key }}" region: "ap-south-1" - name: Download Dependencies - Java aws_s3: bucket: hadoop-software-ansible object: /jdk-8u171.rpm dest: /home/ec2-user/jdk-8u171.rpm mode: get aws_access_key: "{{ access_key }}" aws_secret_key: "{{ secret_key }}" region: "ap-south-1"
- Install Java and Hadoop
package or yum module can be used to install packages and we just need to specify the name and to install or remove with the state. To install hadoop-1.2.1 we have to use --force
- name: Installing Hadoop Dependencies yum: name: /home/ec2-user/jdk-8u171.rpm state: present - name: Installing Hadoop shell: rpm -i /home/ec2-user/hadoop-1.2.1.rpm --force
- Create Directory
We can create a directory with a file module. with state, we can specify the file or directory and just stored it in a variable for further check namenode_dir_created
- name: Create Directory file: path: /namenode state: directory register: namenode_dir_created
- Ensure core-site configuration, we have to check the configuration before we write in file /etc/hadoop/core-site.xml. Here, the second task has an interesting condition. The configuration will be done only when the configuration is not present.
"' word_to_search' in file ". I felt the power of Ansible here cause of in keyword.
- name: Ensure core-site configuration shell: cat /etc/hadoop/core-site.xml register: core_site_file - name: Configure core-site.xml lineinfile: path: /etc/hadoop/core-site.xml insertafter: '^<configuration>' line: "<property>\n<name>fs.default.name</name>\n<value>hdfs://0.0.0.0:9001</value>\n</property>" notify: Refresh when: "'<property>\n<name>fs.default.name</name>\n<value>hdfs://0.0.0.0:9001</value>\n</property>' not in core_site_file.stdout"
- Ensure hdfs-site file configuration,
- name: Ensure hdfs-site configuration shell: cat /etc/hadoop/hdfs-site.xml register: hdfs_site_file - name: Configuring hdfs-site.xml lineinfile: path: /etc/hadoop/hdfs-site.xml insertafter: '^<configuration>' line: "<property>\n<name>dfs.name.dir</name>\n<value>/namenode</value>\n</property>" notify: Refresh when: "'<property>\n<name>dfs.name.dir</name>\n<value>/namenode</value>\n</property>' not in hdfs_site_file.stdout"
- Format the directory we created before. Hadoop format command needs an interactive output to successfully do format, here we can take the help of shell commands to achieve the task. Newly created directory only formatted here
- name: Format Name Node shell: echo Y | hadoop namenode -format when: namenode_dir_created.changed == true
- Start the Hadoop. before that, we have to ensure Hadoop is not running
- name: Ensure Hadoop is running shell: jps become: true register: hadoop_namenode_state - name: Start Service shell: hadoop-daemon.sh start namenode when: "'NameNode' not in hadoop_namenode_state.stdout"
- Save public IP of Name Node to a file in ansible directory so that we can use that IP in the configuration of Name Node core-site.xml file
- name: Save Public IP shell: curl ifconfig.me register: public_ip_namenode - name: Copy Public IP in file copy: content: "{{ public_ip_namenode.stdout }}" dest: public_ip delegate_to: localhost
---
DataNode
- The configuration of Data Node is mostly the same as the configuration of Name Node
and we just need to add the public IP of the name node to core-site.xml file configuration of the data node
- name: Public IP of Master shell: cat public_ip register: public_ip_namdenode delegate_to: localhost - name: Ensure core-site configuration shell: cat /etc/hadoop/core-site.xml register: core_site_file - name: Configure core-site.xml lineinfile: path: /etc/hadoop/core-site.xml insertafter: '^<configuration>' line: "<property>\n<name>fs.default.name</name>\n<value>hdfs://{{ public_ip_namdenode.stdout }}:9001</value>\n</property>" notify: Refresh when: "'<property>\n<name>fs.default.name</name>\n' not in core_site_file.stdout"
Finally, Let's Run the Code
Run Playbook :
ansible-playbook main.yml --ask-vault-pass
Everything looks cool. Let's, check the cluster
Name Node configuration
Data Node Configuration:
dfsadmin -repot: Ther are two active nodes
jps commands:
GitHub Link :
Thank you for Reading, Please Drop a message If you have any questions about this article. Happy to Help!
DevOps Engineer at DataGrokr | ?? 1X AWS, 2X Google Cloud, and 3X Azure Certified | Cloud Architect | Infrastructure Automation Enthusiast | Docker | Kubernetes | Terraform | Ansible | Jenkins | Gitlab
3 年Great work Siva Naik Kethavath ?
DevOps Engineer | MLOps | DataOps | Founding Engineer
3 年GitHub Repository https://github.com/kethavathsivanaik/AnsibleConf-HDFSCluster