登录查看更多内容

Hive Architecture in?Depth

Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

发布日期: 2018年11月6日

Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of large data systems in open source Hadoop platform. The tables in Hive are similar to tables in a relational database, and data units can be organized from larger to more granular units with the help of Partitioning and Bucketing.

As part of this blog, I will be explaining as to how the architecture works on executing a hive query. Details such as the execution of queries, format, location and schema of hive table inside the Metastore etc.

There are 4 main components as part of Hive Architecture.

Hadoop core components(Hdfs, MapReduce)
Metastore
Driver
Hive Clients

Let's start off with each of the components individually.

Hadoop core components:

i) HDFS: When we load the data into a Hive Table it internally stores the data in HDFS path i.e by default in hive warehouse directory.

The hive default warehouse location can be found at

We can create a hive table and load the data into it as shown in below images.

ii) MapReduce: When we Run the below query, it will run a Map Reduce job by converting or compiling the query into a java class file, building a jar and execute this jar file.

2. Metastore: is a namespace for tables. This is a crucial part for the hive as all the metadata information related to the hive such as details related to the table, columns, partitions, location is present as part of it. Usually, the Metastore is available as part of the Relational databases eg: MySql

You can check the DataBase configuration via hive-site.xml

The Metastore details can be found as shown below.

Metastore has an overall 51 tables describing various properties related to the table but out of these 51 the below 3 tables are the ones which provide majority of the information, that in-turn helps hive infer the schema properties to execute the hive commands on a table.

i) TBLS — Store all table information (Table name, Owner, Type of Table(Managed|External)

ii) DBS — Database information (Location of database, Database name, Owner)

iii) COLUMNS_V2 — column name and the datatype

Note: In real-time the access to metastore is restricted to the Admin and specific users.

3. Driver: The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore. The execution plan created by the compiler is a DAG of stages.

A bunch of jar files that are part of hive package help in converting these HiveQL queries into equivalent MapReduce jobs(java) and execute them on MapReduce.

To check if the hive can talk to the appropriate cluster. i.e for hive to interact, query or execute with the existing cluster. You can check details under core-site.xml.

You can verify the same from hive as well.

4. Hive Clients: It is the interface through which we can submit the hive queries. eg: hive CLI, beeline are some of the terminal interfaces, we can also use the Web-interface like Hue, Ambari to perform the same.

On connecting to Hive via CLI or Web-Interface it establishes a connection to Metastore as well.

The commands and queries related to this post are added as part of my GIT account mentioned below.

If you enjoyed reading it ,you can click the like, share button and let others know about it. If you would like me to add anything else, please feel free to leave a response ??

要查看或添加评论，请登录

Jayvardhan Reddy Vanchireddy的更多文章

Apache Spark-3.0 Sneek peak

2019年11月10日

Apache Spark-3.0 Sneek peak

Apache Spark has remained strong over the years and now is coming back with one of its major releases with its ongoing…

14 条评论
Working of Sqoop Incremental Load

2019年7月15日

Working of Sqoop Incremental Load

In my series of BigData Architecture, we have seen the internal working of Sqoop. Now as part of this article, we'll…

3 条评论
Deep-dive into Spark Internals & Architecture

2019年5月12日

Deep-dive into Spark Internals & Architecture

Apache Spark is an open-source distributed general-purpose cluster-computing framework. A spark application is a JVM…

12 条评论
Sqoop Architecture in Depth

2019年3月2日

Sqoop Architecture in Depth

Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and…

9 条评论
HDFS Architecture in Depth

2019年2月10日

HDFS Architecture in Depth

Hadoop consists of mainly two main core components HDFS, MapReduce. HDFS is the Hadoop Distributed File System ( HDFS )…

3 条评论
Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

2018年8月25日

Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

As a developer, we frequently debug the application during the development activities. The real time applications are…

1 条评论
5 Useful Tools for a Full-stack Developer

2018年8月12日

5 Useful Tools for a Full-stack Developer

The below tools will help you increase your productivity and reduce compilation issues on running a debug job…

1 条评论
Database Transaction Leak in Java Application

2018年8月5日

Database Transaction Leak in Java Application

In a real time application the Database leak occurs due to Unclosed transactions created by the programmers. The stakes…

1 条评论
Analysis of Memory Leak in Java Applications via Heap?Dump

2018年7月22日

Analysis of Memory Leak in Java Applications via Heap?Dump

Memory plays a vital role in any application performance and we cannot afford to waste the resources unnecessarily, as…

6 条评论
Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

2018年7月21日

Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

Every Programmer is bound to use these tools at some point of time, As it plays a vital role in optimizing the…

See all articles

Hive Architecture in?Depth

Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

Jayvardhan Reddy Vanchireddy的更多文章

社区洞察

其他会员也浏览了

WHAT IS SQOOP

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

Hive

Beginner's Guide to Big Data

Unleashing the Power of Big Data with Apache Hive

What is Hive?

Apache Hive Performance Tuning Best Practices

Golden Monkey Go March in! Revo R (updated)

Sqoop architecture

Bulk Data Load using Apache Sqoop

Jayvardhan Reddy Vanchireddy的更多文章

Apache Spark-3.0 Sneek peak

Working of Sqoop Incremental Load

Deep-dive into Spark Internals & Architecture

Sqoop Architecture in Depth

HDFS Architecture in Depth

Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

5 Useful Tools for a Full-stack Developer

Database Transaction Leak in Java Application

Analysis of Memory Leak in Java Applications via Heap?Dump

Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

社区洞察

其他会员也浏览了

WHAT IS SQOOP

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

Hive

Beginner's Guide to Big Data

Unleashing the Power of Big Data with Apache Hive

What is Hive?

Apache Hive Performance Tuning Best Practices

Golden Monkey Go March in! Revo R (updated)

Sqoop architecture

Bulk Data Load using Apache Sqoop