Configuring Hive with HDFS & MapReduce Cluster backend
Hello to the reader , hope you are all doing great. Now that you are here, Lets just start it already ??.
The focus of this article is to give readers a brief knowledge on explanation of creating an automating script using terraform which is the best in provisioning the cloud resources with infrastructure as a code and thus making a code to launch Apache Hive ?? on the AWS cloud ??.
So therefore lets go through some of the basic terminologies and a brief introduction of every technology that we are going to use
Apache Hadoop
Hadoop is an open source software which intents to solve Big Data problems , that may create problems to the Big MNCs .
Although there's a lot of advantages of using Hadoop in the Big Data World i.e. the 3Vs Velocity, Veracity, Volume . But the major problem why we needed this technology is that currently there's a limit up to which Hard Disk can be designed.
For companies producing lots of data like Facebook to a data of about 500GB in size, we can't just remove this Hard Disk once attached ,since users should be able to retrieve their data anytime they want. This is why we definitely need software like Hadoop.
Apache Hive
Hive is an open source software which was created in order and to be able to perform CRUD operations on the Hadoop Cluster. Hive is like any other SQL based Database like MySQL so the data analyst who are always great in Databases have no need to learn any new language to work upon Hadoop
Actually Hadoop is used for provisioning a Data Lake , that means for a particular company, normally they put every relevant document into this cluster. Now , lets say we want to analyze some particular file may be of GBs in size.
This would be a wastage of time as well as resources if we need to first download such a huge data and then we do analysis. Its always best if analysis would be done in the same location where data is put. This given rise to Hive as a software.
Terraform
Terraform is also an open-source Infrastructure as a code software initiative taken by HashiCorp which can provision almost anything that you will need through a lot of cloud providers and AWS is one of them which we are going to use today.
The best part of terraform is that you don't have to keep worrying about state of infrastructure , once you have designed code just refresh the state of terraform in order to get in touch, This also describes its idempotent nature.
Why we need to setup Hadoop for getting Hive???
Actually this is the reason why Hive is developed in the first place. Our main need was to use Hadoop as a Data Lake but It was found that using the CPU unit from Map Reduce Cluster and Storage unit from HDFS Cluster , they can also make it a collaboration with SQL DBMS so that even data analysis can be performed in a user friendly way ??.
SQL and NoSQL Databases were already famous before the launch of Hadoop. Looking upon this use case given rise to Hive ??.
Hive behind the scenes actually run the similar hadoop commands when we perform certain operations like creating a database , modifying it , query in it and many more.
Now we have a basic ideology behind the setup , So we are ready to start with the setup and lets do this ??.
1) Starting with setting up AWS as the cloud provider and defining some variables
#Setting up aws as provider provider "aws" { region = "ap-south-1" profile = "default" }
#Defining Variables variable "vpc_id" { type = string } variable "aws_key_name" { type = string
}
The variables vpc_id and aws_key_name will pop out at the time of execution of terraform apply command. Actually it necessary for us to provide the key name and VPC ID where we want to launch the setup.
2) Setting up the template scripts for HDFS NameNode and DataNode , Jobtracker and Tasktracker
Make sure that you have all the bash scripts ready in the same folder where your terraform code will be present. These scripts will be utilized as instance user data in the form of template.
Thus we will load all the scripts in the form of templates where namenode_ip and jobtracker_ip are the two variables that I have used in order to directly render the IP address for namenode and jobtracker into the template.
data "template_file" "hdfs_master" { template = "${file("${path.cwd}/hdfs_master.sh")}" } data "template_file" "hdfs_slave" { template = "${file("${path.cwd}/hdfs_slave.sh")}" vars = { namenode_ip = "${aws_instance.hdfs_master.private_ip}" } depends_on = [ aws_instance.hdfs_master ] } data "template_file" "mapred_jobtracker" { template = "${file("${path.cwd}/mapred_jobtracker.sh")}" vars = { namenode_ip = "${aws_instance.hdfs_master.private_ip}" } } data "template_file" "mapred_tasktracker" { template = "${file("${path.cwd}/mapred_tasktracker.sh")}" vars = { jobtracker_ip = "${aws_instance.mapred_jobtracker.private_ip}" } depends_on = [ aws_instance.mapred_jobtracker ] } data "template_file" "hadoop_client" { template = "${file("${path.cwd}/hadoop_client.sh")}" vars = { namenode_ip = "${aws_instance.hdfs_master.private_ip}" jobtracker_ip = "${aws_instance.mapred_jobtracker.private_ip}" } }
Since each of the script is similar so here I'm focusing on describing only one script and that is hdfs_slave.sh shell script.
hdfs_slave.sh
#!/usr/bin/bash instanceip=$(hostname -i) sudo yum install wget -y sudo wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" https://download.oracle.com/otn-pub/java/jdk/8u141-b15/336fa29ff2bb4ef291e347e091f7f4a7/jdk-8u141-linux-x64.rpm sudo yum install jdk-8u141-linux-x64.rpm -y sudo wget -c https://archive.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm sudo rpm -i --force hadoop-1.2.1-1.x86_64.rpm sudo rm -rf /dn sudo mkdir /dn sudo chmod 677 /etc/hadoop/hdfs-site.xml sudo cat <<EOF > /etc/hadoop/hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.data.dir</name> <value>/dn</value> </property> </configuration> EOF sudo chmod 677 /etc/hadoop/core-site.xml sudo cat <<EOF > /etc/hadoop/core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://${namenode_ip}:9001</value> </property> </configuration> EOF if pidof /usr/java/default/bin/java then sudo kill `pidof /usr/java/default/bin/java` fi sudo chmod 777 /etc/rc.d/rc.local sudo cat <<EOF >> /etc/rc.d/rc.local sudo hadoop-daemon.sh start datanode EOF sudo hadoop-daemon.sh start datanode
- In the above script , we can see that the environmental variable instance_ip is used to retrieve the IP address of current instance.
- The yum, wget commands are used for downloading the desired software
- Also we need to create /nn or /dn folder for giving the directory of namenode or datanode respectively
- Then there is cat command which is used for putting the configuration files
- One of the most important feature of terraform template is that ${namenode_ip} is one of the variable that will be filled up by terraform on the fly
3) Creating Security Groups for All the Node of HDFS and MapReduce Cluster
Make sure that you allowed all the desired traffics needed for Hadoop HDFS and MapReduce Cluster to work properly. Here I have allowed all the inbound and outbound traffics so that there won't be any firewall challenge.
#Creating Security Groups resource "aws_security_group" "hdfs_master_sg" { name = "hdfs_master_sg" description = "Allow traffic from security group" vpc_id = var.vpc_id egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] ipv6_cidr_blocks = ["::/0"] } tags = { Name = "hdfs_master_sg" } } resource "aws_security_group" "hdfs_slave_sg" { name = "hdfs_slave_sg" description = "Allow traffic from security group" vpc_id = var.vpc_id ingress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] ipv6_cidr_blocks = ["::/0"] } egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] ipv6_cidr_blocks = ["::/0"] } tags = { Name = "hdfs_slave_sg" } } resource "aws_security_group_rule" "hdfs_master_ingress" { type = "ingress" from_port = 0 to_port = 0 protocol = "-1" security_group_id = "${aws_security_group.hdfs_master_sg.id}" cidr_blocks = ["0.0.0.0/0"] ipv6_cidr_blocks = ["::/0"] }
For the sake of convenience , I have given all range CIDR i.e. 0.0.0.0/0 so that none of the traffic will be blocked
4) Launching the desired number of Data Nodes and Task Trackers
At this stage we are going to attach the different user_data and different name tags to each of the instances where we need to execute the scripts.
Launching Name Node with hdfs_master.sh script
#Launching HDFS Master Instance resource "aws_instance" "hdfs_master" { ami = "ami-052c08d70def0ac62" instance_type = "t2.micro" security_groups = [ "hdfs_master_sg" ] key_name = "${var.aws_key_name}" user_data = "${data.template_file.hdfs_master.template}" tags = { Name = "hdfs-master" } }
Launching Data Node with hdfs_slave.sh script
Launching HDFS Slaves resource "aws_instance" "hdfs_slave1" { ami = "ami-052c08d70def0ac62" instance_type = "t2.micro" security_groups = [ "hdfs_slave_sg" ] key_name = "${var.aws_key_name}" associate_public_ip_address = false user_data = "${data.template_file.hdfs_slave.rendered}" tags = { Name = "hdfs-slave1" } depends_on = [ aws_instance.hdfs_master ]
}
Launching Jobtracker with mapred_jobtracker.sh script
#Launching MapReduce master resource "aws_instance" "mapred_jobtracker" { ami = "ami-052c08d70def0ac62" instance_type = "t2.micro" security_groups = [ "mapred_jobtracker_sg" ] key_name = "${var.aws_key_name}" user_data = "${data.template_file.mapred_jobtracker.rendered}" tags = { Name = "mapred_jobtracker" } }
Launching Tasktracker with mapred_tasktracker.sh script
#Launching MapReduce slaves resource "aws_instance" "mapred_tasktracker1" { ami = "ami-052c08d70def0ac62" instance_type = "t2.micro" security_groups = [ "mapred_tasktracker_sg" ] key_name = "${var.aws_key_name}" associate_public_ip_address = false user_data = "${data.template_file.mapred_tasktracker.rendered}" tags = { Name = "mapred_tasktracker1" } depends_on = [ aws_instance.mapred_jobtracker ] }
5)Taking the output for the IP address of needed Instances
A user can print as many variables they need , I have printed out only those output that are useful with respect to hive cluster.
#Creating Output output "hdfs_master_public_IP" { value = aws_instance.hdfs_master.public_ip } output "hdfs_master_private_IP" { value = aws_instance.hdfs_master.private_ip } output "mapred_jobtracker_public_IP" { value = aws_instance.mapred_jobtracker.public_ip } output "mapred_jobtracker_private_IP" { value = aws_instance.mapred_jobtracker.private_ip } output "Hadooop_client_public_IP" { value = aws_instance.hadoop_client.public_ip }
Execution :
Finally its time to execute our code and for this we have to run the following command:
terraform apply
Here you have to give the name of AWS Private key that you want to associate with all these instance and also we need VPC ID of VPC where we want to launch the setup.
Then on the pop-up you have to say yes we want to execute.
Once the code is run successfully , the output will look something like the below ?? Snapshot with a green line showing how many resources were added.
Great just a single command and you don't even realize that the whole setup is ready ??.So Lets check whether things are working or not
Testing
In order to check all the things I have logged into Client Node .Here we can see that there are three instances are configured as Data Nodes of their Name Node
The following picture shows that there are three nodes available as tasktracker to work under their jobtracker.
Wooho! Atleast the namenode and datanode are now configured .Okay Lets move on further to get the further details
Run the following command to check whether hive command is running or not
bash hive
** Note: Its important that we use bash command to call a new bash shell since the previously used shell might not have environmental variables HIVE_HOME and new PATH attached with the shell
Lets check what exactly hive command does ??
In the above picture , the following command is used to check what is there in the Hadoop HDFS Cluster that is running behind the scenes.
hadoop fs -ls /
Lets try to create a database in hive and find out what will happen in the HDFS cluster.
The following command proves that a database named ssjnm is created in the cluster
Then lets check what effect does it do to the HDFS cluster.
Here's the final conclusion - As soon as we create a database in Hive , this database is actually formed in the HDFS cluster and will be kept on adding to the HDFS cluster only.
Great ??! The output that we have wanted to achieve i.e. to do analysis directly over the Hadoop Cluster is achieved.
Hope you liked the setup. Thanks to all the tech enthusiasts who are patiently reading this ??.