登录查看更多内容

Hive Tutorial:

Girish V

Top Data Engineering voice 2023 ???| Writes to 26K+ | Data Engineer Consultant @Tredence | Python|Spark|PySpark|Azure data factory|ADLS Gen2|Databricks|Azure Synapse|Deltalake

发布日期: 2023年12月31日

History of Hive
Architecture of Hive
Data Flow in Hive
Hive Data Modeling
Hive Data Types
Different Modes of Hive
Difference Between Hive and RDBMS
Features of Hive
Hive Demo

If you have had a look at the Hadoop EcoSystems, you may have noticed the yellow elephant trunk logo that says HIVE, but do you know what Hive is all about and what it does? At a high level, some of Hive's main features include querying and analyzing large datasets stored in HDFS. It supports easy data summarization, ad-hoc queries, and analysis of vast volumes of data stored in various databases and file systems that integrate with Hadoop. In other words, in the world of big data, Hive is huge.

?In this Hive tutorial, let's start by understanding why Hive came into existence.

History of Hive

Hive has a fascinating history related to the world's largest social networking site: Facebook. Facebook adopted the Hadoop framework to manage their big data. If you have read our previous blogs, you would know that big data is nothing but massive amounts of data that cannot be stored, processed, and analyzed by traditional systems.

As we know, Hadoop uses MapReduce to process data. With MapReduce, users were required to write long and extensive Java code. Not all users were well-versed with Java and other coding languages. Users were comfortable with writing queries in SQL (Structured Query Language), and they wanted a language similar to SQL. Enter the HiveQL language. The idea was to incorporate the concepts of tables and columns, just like SQL.?

Hive is a data warehouse system that is used to query and analyze large datasets stored in the HDFS. Hive uses a query language called HiveQL, which is similar to SQL.?

As seen from the image below, the user first sends out the Hive queries. These queries are converted into MapReduce tasks, and that accesses the Hadoop MapReduce system.?

In the next section of the Hive tutorial, let's now take a look at the architecture of the Hive.

Architecture of Hive

The architecture of the Hive is as shown below. We start with the Hive client, who could be the programmer who is proficient in SQL, to look up the data that is needed.?

The Hive client supports different types of client applications in different languages to perform queries. Thrift is a software framework. The Hive Server is based on Thrift, so it can serve requests from all of the programming languages that support Thrift.?

Next, we have the JDBC (Java Database Connectivity) application and Hive JDBC Driver.?

The JDBC application is connected through the JDBC Driver. Then we have an ODBC (Open Database Connectivity) application connected through the ODBC Driver. All these client requests are submitted to the Hive server.?

In addition to the above, we also have the Hive web interface, or GUI, where programmers execute Hive queries. Commands are executed directly in CLI. Up next is the Hive driver, which is responsible for all the queries submitted. It performs three steps internally:

Compiler - The Hive driver passes the query to the compiler, where it is checked and analyzed?
Optimizer - Optimized logical plan in the form of a graph of MapReduce and HDFS tasks is obtained
Executor - In the final step, the tasks are executed?

Metastore is a repository for Hive metadata. It stores metadata for Hive tables, and you can think of this as your schema. This is located on the Apache Derby DB. Hive uses the MapReduce framework to process queries. Finally, we have distributed storage, which is HDFS. If you have read our other Hadoop blogs, you'll know that these are on commodity machines and are linearly scalable, which means they're very affordable.

In this Hive tutorial, let's understand how does the data flow in the Hive.

Data Flow in Hive

Data flow in the Hive contains the Hive and Hadoop System. Underneath the user interface, we have driver, compiler, execution engine, and metastore. All of that goes into the MapReduce and the Hadoop file system.

The data flow in the following sequence:

We execute a query, which goes into the driver
Then the driver asks for the plan, which refers to the query execution?
After this, the compiler gets the metadata from the metastore
The metastore responds with the metadata
The compiler gathers this information and sends the plan back to the driver
Now, the driver sends the execution plan to the execution engine
The execution engine acts as a bridge between the Hive and Hadoop to process the query
In addition to this, the execution engine also communicates bidirectionally with the metastore to perform various operations, such as create and drop tables
Finally, we have a bidirectional communication to fetch and send results back to the client

Hive Data Modeling

That was how data flows in the Hive. Let's now take a look at Hive data modeling , which consists of tables, partitions, and buckets:

Tables - Tables in Hive are created the same way it is done in RDBMS?
Partitions - Here, tables are organized into partitions for grouping similar types of data based on the partition key?
Buckets - Data present in partitions can be further divided into buckets for efficient querying?

Fig: Hive Data Modelling

Now, as you have understood what is Hive data modeling, let us dive into the Hive data types in this Hive tutorial.

Hive Data Types

Now that you know how data is classified in Hive. Let us look into the different Hive data types. These are classified as primitive and complex data types.?

Primitive Data Types:

Numeric Data types - Data types like integral, float, decimal
String Data type - Data types like char, string
Date/ Time Data type - Data types like timestamp, date, interval
Miscellaneous Data type - Data types like Boolean and binary

Complex Data Types:

Arrays - A collection of the same entities. The syntax is: array<data_type>
Maps - A collection of key-value pairs and the syntax is map<primitive_type, data_type>
Structs - A collection of complex data with comments. Syntax: struct<col_name : data_type [COMMENT col_comment],…..>
Units - A collection of heterogeneous data types. Syntax: uniontype<data_type, data_type,..>

Different Modes of Hive

Next, let us move on to understand the modes Hive operates in. Hive operates in two modes depending on the number and size of data nodes. They are:

Local Mode - Used when Hadoop has one data node, and the amount of data is small. Here, the processing will be very fast on smaller datasets, which are present in local machines.
Mapreduce Mode - Used when the data in Hadoop is spread across multiple data nodes. Processing large datasets can be more efficient using this mode.?

Difference Between Hive and RDBMS?

RDBMS, which stands for Relational Database Management System, also works with tables. But what is the difference between Hive and RDBMS?

Hive

Hive enforces schema on reading
Hive data size is in petabytes
Hive is based on the notion of write once and read many times
Hive resembles a traditional database by supporting SQL, but it is not a database; it is a data warehouse?
Easily scalable at low cost

RDBMS

RDBMS enforces schema on write
Data size is in terabytes
RDBMS is based on the notion of reading and write many times
RDBMS is a type of database management system, which is based on the relational model of data
Not scalable at low cost

Features of Hive

Now that we have learned about the architecture of the Hive, the different data types of Hive, and Hive data modeling, let us look into the Hive's various features:?

领英推荐

SQL

Vincent Rainardi 4 个月前

WHAT IS SQOOP

Ashish Ranjan 1 年前

Sqoop

Darshika Srivastava 1 年前

The use of a SQL-like language called HiveQL in Hive is easier than long codes
In Hive, tables are used that are similar to RDBMS, hence easier to understand
By using HiveQL, multiple users can simultaneously query data?
Hive supports a variety of data formats

Hive Demo

Finally, we will go through a quick Hive demo, which will help us understand how HiveQL works. Before diving into the demo, you can have a quick look at the Hive website, which is hive.apache.org. Hortonworks provides a useful?Hive cheat sheet, too. It shows the different HiveQL commands and various data types.?

Now, let's run our Hive demo on a Hadoop cluster. We will use the Cloudera QuickStart, which has the Hadoop setup on a single node. Hadoop and Hive are designed to run across a cluster of computers, but here we are talking about a single node. Therefore, we will start with the Cloudera QuickStart.??

When you are in Cloudera, you can access the Hive in two ways. The first one is by using Hue, which has more visuals than code. The screen will look like this:

Once you click on Hive, as seen above, you can start writing queries in the query space. The downside of Hue is that it can be slow. Now, we will move on to the Linux terminal window and start writing commands. You have to start by typing the Hive; this will start the shell. Your screen will now look like the following:

Now, type this:

create database office; // We are creating a database called the office

show databases; // Shows the created database

drop database office; // Drops the office database as it is empty

drop database office cascade; // Drops the tables in the database when it is not empty

create database office; // We will recreate the database office

use office; // Sets office as the default database

Then, open another terminal window and type the following:

From the above image, you can see that we already have a file named Employee.csv, which we will be using. The content of the file is as seen above; the data has a header, and below it, all the values are given. These values are comma-separated. You will then type the following:

pwd // Displays the path

gedit Employee.csv // Lets you edit the content of the table and remove any extra spaces

Go back to the Hive shell and enter the following command:

create table employee // No ";" as we don't want to execute the line

In addition to this, type the following schema for our database. If you already have it ready, you can paste it.

We put a semicolon at the end and run these lines. Then type show tables; you will see the employee displayed. You can also type describe employee; you will see the data and its types described. After this, you can go back to the Linux terminal, and copy the path and enter the following commands:

In the above image, we put content into the table. Then we can type:

select * from employee; // Displays all the content

select count (*) from employee; // Starts the count and finally displays the number of rows

select * from office.employee WHERE Salary>25000; // Displays the below results?

Hive also gives you the ability to alter the table and rename it. You can then have a look at the renamed table:

Now, go back to the loading data and see the tables by navigating back to the terminal window. Then, go ahead and type the cat commands to display the data:

We just completed the above operation to join the different datasets.

If we completed the above steps correctly, we should be able to select the data and complete the following steps:

Now, to find some specific information related to these orders, type the following:

You will now have the final result, as displayed below:

This is very helpful if you have to find particular information. This is how the join works, and it is the widespread use of HiveQL. After this, you can also go ahead and perform the drop operation, along with cascade. You can also try out a few functions used in Hive. Let's have a look at a few of them:

hive> SELECT round(2.3) from temp; // Rounds off the value to the nearest highest integer -> 2.3 - 2

hive> SELECT floor(2.3) from temp; // Rounds off any positive or negative decimal value down to the next least integer value -> 2.3 - 2

hive> SELECT ceil(2.3) from temp; // This function is used to get the smallest integer which is greater than, or equal to, the specified numeric expression -> 2.3 - 3

This was the Hive demo as you saw; all the HiveQL queries are very similar to SQL. The commands are easy to understand and perform.?

Conclusion

We hope this has helped you gain a better understanding of Apache Hive. You have learned about the importance of Hive, what Hive does, the various data types in Hive, the different modes in which Hive operates, and the differences between Hive and RDBMS. You also learned how Hive works through a short demo.

要查看或添加评论，请登录

Girish V的更多文章

HBase Tutorial

2024年1月10日

HBase Tutorial

Table of Contents Introduction to HBase HBase History What is HBase? Why HBase? Characteristics of HBase Applications…
Sqoop Tutorial: Big Data on Hadoop

2023年12月14日

Sqoop Tutorial: Big Data on Hadoop

Table of Contents What is Sqoop and Why Use Sqoop? Sqoop Features Sqoop Architecture Sqoop Import Sqoop Export Sqoop…

1 条评论

Hive Tutorial:

Girish V

Top Data Engineering voice 2023 ???| Writes to 26K+ | Data Engineer Consultant @Tredence | Python|Spark|PySpark|Azure data factory|ADLS Gen2|Databricks|Azure Synapse|Deltalake

Table of Contents

History of Hive

Architecture of Hive

Data Flow in Hive

Hive Data Modeling

Hive Data Types

Primitive Data Types:

Complex Data Types:

Different Modes of Hive

Difference Between Hive and RDBMS?

Features of Hive

领英推荐

Hive Demo

Conclusion

Girish V的更多文章

社区洞察

其他会员也浏览了

Spark Vs Hadoop Map Reduce

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

HIVE

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

MapReduce, Spark, and SQL: Transforming Big Data Analytics

Hive

Beginners Guide to Apache HIVE.

APACHE HIVE

Sqoop

Table of Contents

History of Hive

Architecture of Hive

Data Flow in Hive

Hive Data Modeling

Hive Data Types

Primitive Data Types:

Complex Data Types:

Different Modes of Hive

Difference Between Hive and RDBMS?

Features of Hive

领英推荐

Hive Demo

Conclusion

Girish V的更多文章

HBase Tutorial

Sqoop Tutorial: Big Data on Hadoop

社区洞察

其他会员也浏览了

Spark Vs Hadoop Map Reduce

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

HIVE

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

MapReduce, Spark, and SQL: Transforming Big Data Analytics

Hive

Beginners Guide to Apache HIVE.

APACHE HIVE

Sqoop