Hive Tutorial:
Table of Contents
If you have had a look at the Hadoop EcoSystems, you may have noticed the yellow elephant trunk logo that says HIVE, but do you know what Hive is all about and what it does? At a high level, some of Hive's main features include querying and analyzing large datasets stored in HDFS. It supports easy data summarization, ad-hoc queries, and analysis of vast volumes of data stored in various databases and file systems that integrate with Hadoop. In other words, in the world of big data, Hive is huge.
?In this Hive tutorial, let's start by understanding why Hive came into existence.
History of Hive
Hive has a fascinating history related to the world's largest social networking site: Facebook. Facebook adopted the Hadoop framework to manage their big data. If you have read our previous blogs, you would know that big data is nothing but massive amounts of data that cannot be stored, processed, and analyzed by traditional systems.
As we know, Hadoop uses MapReduce to process data. With MapReduce, users were required to write long and extensive Java code. Not all users were well-versed with Java and other coding languages. Users were comfortable with writing queries in SQL (Structured Query Language), and they wanted a language similar to SQL. Enter the HiveQL language. The idea was to incorporate the concepts of tables and columns, just like SQL.?
Hive is a data warehouse system that is used to query and analyze large datasets stored in the HDFS. Hive uses a query language called HiveQL, which is similar to SQL.?
As seen from the image below, the user first sends out the Hive queries. These queries are converted into MapReduce tasks, and that accesses the Hadoop MapReduce system.?
In the next section of the Hive tutorial, let's now take a look at the architecture of the Hive.
Architecture of Hive
The architecture of the Hive is as shown below. We start with the Hive client, who could be the programmer who is proficient in SQL, to look up the data that is needed.?
The Hive client supports different types of client applications in different languages to perform queries. Thrift is a software framework. The Hive Server is based on Thrift, so it can serve requests from all of the programming languages that support Thrift.?
Next, we have the JDBC (Java Database Connectivity) application and Hive JDBC Driver.?
The JDBC application is connected through the JDBC Driver. Then we have an ODBC (Open Database Connectivity) application connected through the ODBC Driver. All these client requests are submitted to the Hive server.?
In addition to the above, we also have the Hive web interface, or GUI, where programmers execute Hive queries. Commands are executed directly in CLI. Up next is the Hive driver, which is responsible for all the queries submitted. It performs three steps internally:
Metastore is a repository for Hive metadata. It stores metadata for Hive tables, and you can think of this as your schema. This is located on the Apache Derby DB. Hive uses the MapReduce framework to process queries. Finally, we have distributed storage, which is HDFS. If you have read our other Hadoop blogs, you'll know that these are on commodity machines and are linearly scalable, which means they're very affordable.
In this Hive tutorial, let's understand how does the data flow in the Hive.
Data Flow in Hive
Data flow in the Hive contains the Hive and Hadoop System. Underneath the user interface, we have driver, compiler, execution engine, and metastore. All of that goes into the MapReduce and the Hadoop file system.
The data flow in the following sequence:
Hive Data Modeling
That was how data flows in the Hive. Let's now take a look at Hive data modeling , which consists of tables, partitions, and buckets:
Fig: Hive Data Modelling
Now, as you have understood what is Hive data modeling, let us dive into the Hive data types in this Hive tutorial.
Hive Data Types
Now that you know how data is classified in Hive. Let us look into the different Hive data types. These are classified as primitive and complex data types.?
Primitive Data Types:
Complex Data Types:
Different Modes of Hive
Next, let us move on to understand the modes Hive operates in. Hive operates in two modes depending on the number and size of data nodes. They are:
Difference Between Hive and RDBMS?
RDBMS, which stands for Relational Database Management System, also works with tables. But what is the difference between Hive and RDBMS?
Hive
RDBMS
Features of Hive
Now that we have learned about the architecture of the Hive, the different data types of Hive, and Hive data modeling, let us look into the Hive's various features:?
领英推荐
Hive Demo
Finally, we will go through a quick Hive demo, which will help us understand how HiveQL works. Before diving into the demo, you can have a quick look at the Hive website, which is hive.apache.org. Hortonworks provides a useful?Hive cheat sheet, too. It shows the different HiveQL commands and various data types.?
Now, let's run our Hive demo on a Hadoop cluster. We will use the Cloudera QuickStart, which has the Hadoop setup on a single node. Hadoop and Hive are designed to run across a cluster of computers, but here we are talking about a single node. Therefore, we will start with the Cloudera QuickStart.??
When you are in Cloudera, you can access the Hive in two ways. The first one is by using Hue, which has more visuals than code. The screen will look like this:
Once you click on Hive, as seen above, you can start writing queries in the query space. The downside of Hue is that it can be slow. Now, we will move on to the Linux terminal window and start writing commands. You have to start by typing the Hive; this will start the shell. Your screen will now look like the following:
Now, type this:
create database office; // We are creating a database called the office
show databases; // Shows the created database
drop database office; // Drops the office database as it is empty
drop database office cascade; // Drops the tables in the database when it is not empty
create database office; // We will recreate the database office
use office; // Sets office as the default database
Then, open another terminal window and type the following:
From the above image, you can see that we already have a file named Employee.csv, which we will be using. The content of the file is as seen above; the data has a header, and below it, all the values are given. These values are comma-separated. You will then type the following:
pwd // Displays the path
gedit Employee.csv // Lets you edit the content of the table and remove any extra spaces
Go back to the Hive shell and enter the following command:
create table employee // No ";" as we don't want to execute the line
In addition to this, type the following schema for our database. If you already have it ready, you can paste it.
We put a semicolon at the end and run these lines. Then type show tables; you will see the employee displayed. You can also type describe employee; you will see the data and its types described. After this, you can go back to the Linux terminal, and copy the path and enter the following commands:
In the above image, we put content into the table. Then we can type:
select * from employee; // Displays all the content
select count (*) from employee; // Starts the count and finally displays the number of rows
select * from office.employee WHERE Salary>25000; // Displays the below results?
Hive also gives you the ability to alter the table and rename it. You can then have a look at the renamed table:
Now, go back to the loading data and see the tables by navigating back to the terminal window. Then, go ahead and type the cat commands to display the data:
We just completed the above operation to join the different datasets.
If we completed the above steps correctly, we should be able to select the data and complete the following steps:
Now, to find some specific information related to these orders, type the following:
You will now have the final result, as displayed below:
This is very helpful if you have to find particular information. This is how the join works, and it is the widespread use of HiveQL. After this, you can also go ahead and perform the drop operation, along with cascade. You can also try out a few functions used in Hive. Let's have a look at a few of them:
hive> SELECT round(2.3) from temp; // Rounds off the value to the nearest highest integer -> 2.3 - 2
hive> SELECT floor(2.3) from temp; // Rounds off any positive or negative decimal value down to the next least integer value -> 2.3 - 2
hive> SELECT ceil(2.3) from temp; // This function is used to get the smallest integer which is greater than, or equal to, the specified numeric expression -> 2.3 - 3
This was the Hive demo as you saw; all the HiveQL queries are very similar to SQL. The commands are easy to understand and perform.?
Conclusion
We hope this has helped you gain a better understanding of Apache Hive. You have learned about the importance of Hive, what Hive does, the various data types in Hive, the different modes in which Hive operates, and the differences between Hive and RDBMS. You also learned how Hive works through a short demo.