Configuring Hive with HDFS & MapReduce Cluster backend

Configuring Hive with HDFS & MapReduce Cluster backend

Hello to the reader , hope you are all doing great. Now that you are here, Lets just start it already ??.

The focus of this article is to give readers a brief knowledge on explanation of creating an automating script using terraform which is the best in provisioning the cloud resources with infrastructure as a code and thus making a code to launch Apache Hive ?? on the AWS cloud ??.

So therefore lets go through some of the basic terminologies and a brief introduction of every technology that we are going to use

Apache Hadoop

Hadoop is an open source software which intents to solve Big Data problems , that may create problems to the Big MNCs .

No alt text provided for this image


Although there's a lot of advantages of using Hadoop in the Big Data World i.e. the 3Vs Velocity, Veracity, Volume . But the major problem why we needed this technology is that currently there's a limit up to which Hard Disk can be designed.

For companies producing lots of data like Facebook to a data of about 500GB in size, we can't just remove this Hard Disk once attached ,since users should be able to retrieve their data anytime they want. This is why we definitely need software like Hadoop.

Apache Hive

Hive is an open source software which was created in order and to be able to perform CRUD operations on the Hadoop Cluster. Hive is like any other SQL based Database like MySQL so the data analyst who are always great in Databases have no need to learn any new language to work upon Hadoop

No alt text provided for this image


Actually Hadoop is used for provisioning a Data Lake , that means for a particular company, normally they put every relevant document into this cluster. Now , lets say we want to analyze some particular file may be of GBs in size.

This would be a wastage of time as well as resources if we need to first download such a huge data and then we do analysis. Its always best if analysis would be done in the same location where data is put. This given rise to Hive as a software.

Terraform

Terraform is also an open-source Infrastructure as a code software initiative taken by HashiCorp which can provision almost anything that you will need through a lot of cloud providers and AWS is one of them which we are going to use today.

No alt text provided for this image

The best part of terraform is that you don't have to keep worrying about state of infrastructure , once you have designed code just refresh the state of terraform in order to get in touch, This also describes its idempotent nature.

Why we need to setup Hadoop for getting Hive???

Actually this is the reason why Hive is developed in the first place. Our main need was to use Hadoop as a Data Lake but It was found that using the CPU unit from Map Reduce Cluster and Storage unit from HDFS Cluster , they can also make it a collaboration with SQL DBMS so that even data analysis can be performed in a user friendly way ??.

SQL and NoSQL Databases were already famous before the launch of Hadoop. Looking upon this use case given rise to Hive ??.

Hive behind the scenes actually run the similar hadoop commands when we perform certain operations like creating a database , modifying it , query in it and many more.

Now we have a basic ideology behind the setup , So we are ready to start with the setup and lets do this ??.

1) Starting with setting up AWS as the cloud provider and defining some variables

#Setting up aws as provider
provider "aws" {
  region     = "ap-south-1"
  profile    = "default"   
} 


#Defining Variables
variable "vpc_id" {
  type = string
}


variable "aws_key_name" {
  type = string
}

The variables vpc_id and aws_key_name will pop out at the time of execution of terraform apply command. Actually it necessary for us to provide the key name and VPC ID where we want to launch the setup.

2) Setting up the template scripts for HDFS NameNode and DataNode , Jobtracker and Tasktracker

Make sure that you have all the bash scripts ready in the same folder where your terraform code will be present. These scripts will be utilized as instance user data in the form of template.

Thus we will load all the scripts in the form of templates where namenode_ip and jobtracker_ip are the two variables that I have used in order to directly render the IP address for namenode and jobtracker into the template.

data "template_file" "hdfs_master" {
  template = "${file("${path.cwd}/hdfs_master.sh")}"
}




data "template_file" "hdfs_slave" {
  template = "${file("${path.cwd}/hdfs_slave.sh")}"
  vars     = {
    namenode_ip = "${aws_instance.hdfs_master.private_ip}"
  }
  depends_on = [ aws_instance.hdfs_master ]
}


data "template_file" "mapred_jobtracker" {
  template = "${file("${path.cwd}/mapred_jobtracker.sh")}"
  vars     = {
    namenode_ip = "${aws_instance.hdfs_master.private_ip}"
  }
}




data "template_file" "mapred_tasktracker" {
  template = "${file("${path.cwd}/mapred_tasktracker.sh")}"
  vars     = {
    jobtracker_ip = "${aws_instance.mapred_jobtracker.private_ip}"
  }
  depends_on = [ aws_instance.mapred_jobtracker ]
}




data "template_file" "hadoop_client" {
  template = "${file("${path.cwd}/hadoop_client.sh")}"
  vars     = {
    namenode_ip = "${aws_instance.hdfs_master.private_ip}"
    jobtracker_ip = "${aws_instance.mapred_jobtracker.private_ip}"
  }
}

Since each of the script is similar so here I'm focusing on describing only one script and that is hdfs_slave.sh shell script.

hdfs_slave.sh

#!/usr/bin/bash


instanceip=$(hostname -i)
sudo yum install wget -y
sudo wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" https://download.oracle.com/otn-pub/java/jdk/8u141-b15/336fa29ff2bb4ef291e347e091f7f4a7/jdk-8u141-linux-x64.rpm
sudo yum install jdk-8u141-linux-x64.rpm -y


sudo wget -c https://archive.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm
sudo rpm -i --force hadoop-1.2.1-1.x86_64.rpm
sudo rm -rf /dn
sudo mkdir /dn


sudo chmod 677 /etc/hadoop/hdfs-site.xml
sudo cat <<EOF > /etc/hadoop/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<configuration>
<property>
<name>dfs.data.dir</name>
<value>/dn</value>
</property>
</configuration>
EOF


sudo chmod 677 /etc/hadoop/core-site.xml
sudo cat <<EOF > /etc/hadoop/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://${namenode_ip}:9001</value>
</property>
</configuration>
EOF


if pidof /usr/java/default/bin/java
then sudo kill `pidof /usr/java/default/bin/java`
fi


sudo chmod 777 /etc/rc.d/rc.local
sudo cat <<EOF >> /etc/rc.d/rc.local
sudo hadoop-daemon.sh start datanode
EOF


sudo hadoop-daemon.sh start datanode
  • In the above script , we can see that the environmental variable instance_ip is used to retrieve the IP address of current instance.
  • The yum, wget commands are used for downloading the desired software
  • Also we need to create /nn or /dn folder for giving the directory of namenode or datanode respectively
  • Then there is cat command which is used for putting the configuration files
  • One of the most important feature of terraform template is that ${namenode_ip} is one of the variable that will be filled up by terraform on the fly

3) Creating Security Groups for All the Node of HDFS and MapReduce Cluster

Make sure that you allowed all the desired traffics needed for Hadoop HDFS and MapReduce Cluster to work properly. Here I have allowed all the inbound and outbound traffics so that there won't be any firewall challenge.

#Creating Security Groups

resource "aws_security_group" "hdfs_master_sg" {
  name        = "hdfs_master_sg"
  description = "Allow traffic from security group"
  vpc_id      = var.vpc_id


  egress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }


  tags = {
    Name = "hdfs_master_sg"
  }
} 


resource "aws_security_group" "hdfs_slave_sg" {
  name        = "hdfs_slave_sg"
  description = "Allow traffic from security group"
  vpc_id      = var.vpc_id


  ingress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }


  egress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }


  tags = {
    Name = "hdfs_slave_sg"
  }
}


resource "aws_security_group_rule" "hdfs_master_ingress" {
  type              = "ingress"
  from_port         = 0
  to_port           = 0
  protocol          = "-1"
  security_group_id = "${aws_security_group.hdfs_master_sg.id}"
  cidr_blocks      = ["0.0.0.0/0"]
  ipv6_cidr_blocks = ["::/0"]
}

For the sake of convenience , I have given all range CIDR i.e. 0.0.0.0/0 so that none of the traffic will be blocked

4) Launching the desired number of Data Nodes and Task Trackers

At this stage we are going to attach the different user_data and different name tags to each of the instances where we need to execute the scripts.

Launching Name Node with hdfs_master.sh script

#Launching HDFS Master Instance


resource "aws_instance" "hdfs_master" {
  ami               = "ami-052c08d70def0ac62"
  instance_type     = "t2.micro"
  security_groups   = [ "hdfs_master_sg" ]
  key_name          = "${var.aws_key_name}"
  user_data = "${data.template_file.hdfs_master.template}"
  tags = {
    Name = "hdfs-master"
  }

}

Launching Data Node with hdfs_slave.sh script

Launching HDFS Slaves


resource "aws_instance" "hdfs_slave1" {
  ami              = "ami-052c08d70def0ac62"
  instance_type    = "t2.micro"
  security_groups  = [ "hdfs_slave_sg" ]
  key_name         = "${var.aws_key_name}"
  associate_public_ip_address = false
  user_data = "${data.template_file.hdfs_slave.rendered}"
  tags = {
    Name = "hdfs-slave1"
  }


  depends_on = [ aws_instance.hdfs_master ]
}

Launching Jobtracker with mapred_jobtracker.sh script

#Launching MapReduce master


resource "aws_instance" "mapred_jobtracker" {
  ami               = "ami-052c08d70def0ac62"
  instance_type     = "t2.micro"
  security_groups   = [ "mapred_jobtracker_sg" ]
  key_name          = "${var.aws_key_name}"
  user_data = "${data.template_file.mapred_jobtracker.rendered}"
  tags = {
    Name = "mapred_jobtracker"
  }


}

Launching Tasktracker with mapred_tasktracker.sh script

#Launching MapReduce slaves


resource "aws_instance" "mapred_tasktracker1" {
  ami              = "ami-052c08d70def0ac62"
  instance_type    = "t2.micro"
  security_groups  = [ "mapred_tasktracker_sg" ]
  key_name         = "${var.aws_key_name}"
  associate_public_ip_address = false
  user_data = "${data.template_file.mapred_tasktracker.rendered}"
  tags = {
    Name = "mapred_tasktracker1"
  }


  depends_on = [ aws_instance.mapred_jobtracker ]
}

5)Taking the output for the IP address of needed Instances

A user can print as many variables they need , I have printed out only those output that are useful with respect to hive cluster.

#Creating Output

output "hdfs_master_public_IP" {
  value = aws_instance.hdfs_master.public_ip
}


output "hdfs_master_private_IP" {
  value = aws_instance.hdfs_master.private_ip
}




output "mapred_jobtracker_public_IP" {
  value = aws_instance.mapred_jobtracker.public_ip
}


output "mapred_jobtracker_private_IP" {
  value = aws_instance.mapred_jobtracker.private_ip
}




output "Hadooop_client_public_IP" {
  value = aws_instance.hadoop_client.public_ip
}

Execution :

Finally its time to execute our code and for this we have to run the following command:

terraform apply

Here you have to give the name of AWS Private key that you want to associate with all these instance and also we need VPC ID of VPC where we want to launch the setup.

No alt text provided for this image

Then on the pop-up you have to say yes we want to execute.

Once the code is run successfully , the output will look something like the below ?? Snapshot with a green line showing how many resources were added.

No alt text provided for this image

Great just a single command and you don't even realize that the whole setup is ready ??.So Lets check whether things are working or not

Testing

In order to check all the things I have logged into Client Node .Here we can see that there are three instances are configured as Data Nodes of their Name Node

No alt text provided for this image

The following picture shows that there are three nodes available as tasktracker to work under their jobtracker.

No alt text provided for this image

Wooho! Atleast the namenode and datanode are now configured .Okay Lets move on further to get the further details

Run the following command to check whether hive command is running or not

bash
hive
No alt text provided for this image

** Note: Its important that we use bash command to call a new bash shell since the previously used shell might not have environmental variables HIVE_HOME and new PATH attached with the shell

Lets check what exactly hive command does ??

No alt text provided for this image

In the above picture , the following command is used to check what is there in the Hadoop HDFS Cluster that is running behind the scenes.

hadoop fs -ls /

Lets try to create a database in hive and find out what will happen in the HDFS cluster.

The following command proves that a database named ssjnm is created in the cluster

No alt text provided for this image

Then lets check what effect does it do to the HDFS cluster.

No alt text provided for this image

Here's the final conclusion - As soon as we create a database in Hive , this database is actually formed in the HDFS cluster and will be kept on adding to the HDFS cluster only.

Great ??! The output that we have wanted to achieve i.e. to do analysis directly over the Hadoop Cluster is achieved.

Hope you liked the setup. Thanks to all the tech enthusiasts who are patiently reading this ??.


要查看或添加评论,请登录

Nishant Singh的更多文章

  • Why handlers are used in Ansible?

    Why handlers are used in Ansible?

    Handlers are the tasks which gets triggered when some changes are made to a particular task. This solves a very…

  • Setting up AWS CDN with AWS CLI

    Setting up AWS CDN with AWS CLI

    Content Delivery Networks is one of the best utilization of a company's own private network across the globe. A company…

  • Play with IPs , IPv4 in particular

    Play with IPs , IPv4 in particular

    This article is an interesting one at least for me, Although it takes time to understand networking concepts since I…

  • A Session with two experts

    A Session with two experts

    The session was started by Mr. Arun Eapen with the explanation of what automation is and specially why we need it,So…

  • Configuring HAProxy-LB with Ansible

    Configuring HAProxy-LB with Ansible

    As I always say its always better to have a look onto the basic technical terms to get started, and so lets see what we…

  • Configuring Hadoop(NN/DN) via Ansible

    Configuring Hadoop(NN/DN) via Ansible

    Before getting hands on into any practical implementation its always good to know the terminologies..

  • Getting started with AWS CLI....

    Getting started with AWS CLI....

    This is an small article on explanation of getting started with Command line interface with some easy and helpful…

  • Ubisoft got enhanced with AWS

    Ubisoft got enhanced with AWS

    Normally Every Gaming company demands some big infrastructure with good quality CPU, RAM for the game development and…

  • Hybrid Cloud Setup: K8s and RDS

    Hybrid Cloud Setup: K8s and RDS

    A great setup to learn the intrication of the two different cloud platforms working together with the help of terraform…

  • Big Data Problems ........

    Big Data Problems ........

    Numbers are increasing day by day call it no. of people , no.

社区洞察

其他会员也浏览了