登录查看更多内容

HIVE - Partitioning and Bucketing with examples

Gaurav Kumar Singh

Software Engineer | Mentor | Teacher| JavaScript Pro | React Pro | Node Pro | Databases Expert | xQL expert | Python | Java | K8s | AWS | Bigdata | ES | AI-ML

发布日期: 2016年4月30日

Partitioning:

Partitioning is used to slice the data horizontally over the entire range or on a smaller range of values using one or more column . Partition concept is well known in RDBMS as well, so if you know the database then you should be aware of the concept and if not then also its not an issue as you will be well aware of the concept after reading this article. e.g if you are having hundred records in a table with student data in an university and you want to divide the entire record into Male and Female student so here it is nothing but partitioning and GENDER is my column which is used to split the records. Again, as I told, there can be more than one column used to split the records.

In real word scenario, if you want to analyze the log files of user activity on internet, it would be great if we can keep the data as per date and geographies. There are two types of partitioning in HIVE:1.Static Partitioning 2.Dynamic Partitioning The table DDL statement will be same in case of both the partitioning as shown below:

[Click Here for complete source code]

I have created a table T_USER_LOG with DT and COUNTRY column as my partitioning column. I have use hive script mode where "HivePartition.hql' is my script file. You can use hive shell command as well or whichever is feasible for you. As highlighted in the image, the partition columns appears in table schema like normal table column.

Static Partitioning: In Static Partition, we know the partition column before itself. So far so good, now when we load data there it makes the difference.

LOAD DATA LOCAL INPATH [path_name] OVERWRITE INTO TABLE [table_name] PARTITION(partition_column='value'....). Here we have to give the partition column value explicitly whenever we want to create new partition as shown below:

LOAD SCRIPTS.

As you can see, we have mentioned the partition column value in each load.

OUTPUT: Since our table T_USER_LOG is managed table so the data is loaded in hive warehouse path i.e /user/hive/warehouse/t_user_log.

Here you can check all other partition as well, it will have the file UserLog.txt. Total two level of partitioning is there in our example, one as DT and another as COUNTRY , then the final data will be stored inside. All partitions in hive is there as directories. Loading in hive is instantaneous process and it won't trigger a Map/Reduce job. That's why our file is stored as UserLog.txt instead of 00000_o file. Please follow the article as I will show in dynamic partition where we will LOAD table using another table where Map/reduce job is triggered.

DYNAMIC PARTITIONING: Let us see now the load script of Dynamic Partitioning. We will create new table T_USER_LOG_DYN for dynamic partition and also as we told earlier that we will load this table using a new table, let's create another table T_USER_LOG_SRC. Below is the data of sourse table:

[Click Here for complete source code]

Let's us check the hive script for table DDL.

We will see first the negative scenario and then the final load scripts. We have to set two hive properties as below:

1.SET hive.exec.dynamic.partition= true;

2. SET hive.exec.dynamic.partition.mode= nonstrict

Sorry for the small screenshot but it show the error if you try to load the table and achieve partitioning dynamically without setting above two properties. Here while loading the table, the partition will be created dynamically on all partition columns if hive.exec.dynamic.partition.mode= nonstrict is set. If it is strict [which is by default], it will need at least one partition column to be defined in load script. Now let see the final load script as below:

LOAD SCRIPT: Below is the screenshot for loading the table T_USER_LOG_DYN using the data from T_USER_LOG_SRC and creating dynamic partitions. LoadDynPartition.hql is my script to load the table dynamically as shown below:

hive -f LoadDynPartition.hql is used to execute the hive script.

OUTPUT: Once the script is run, the file will be created as below:

As you can see, here we need not to load the table multiple time to create the multiple partition as it was the case in static partitioning. Here all the partitions are created in T_USER_LOG_DYN based on number of distinct DT and COUNTRY column values in T_USER_LOG_SRC table dynamically at run time. Also we told earlier, since the data is loaded using another table and map/reduce job is executed by framework, output file is as 000000_0 but not by any name.

Advantage of Partitioning: Partitioning has its own benefit when it comes to its usage in HIVE. Its helps to organize the data in logical fashion and when we query the partitioned table using partition column, it allows hive to skip all but relevant sub-directories and files. This can lead to tremendous reduction in data required to read and filter in the initial map stage. This reduces the number of mapper, I/O operation and response time for the query.

BUCKETING in HIVE: When we write data in bucketed table in hive, it places the data in distinct buckets as files. Hive uses some hashing algorithm to generate a number in range of 1 to N buckets [as mentioned in DDL] and based on the result of hashing, data is placed in a particular buckets as a file. Let's create a hive bucketed table T_USER_LOG_BUCKET with a partition column as DT and having 4 buckets. We specify bucketing column in CLUSTERED BY (column_name) clause in hive table DDL as shown below in hive script files:

LOAD SCRIPT: [Click Here for complete source code]

Below screenshots shows the hive script execution to create and load Bucketed table:

As we know that the number of reduce task determine the number of output file so here we have defined table as four buckets so the number of reduce taks is 4 as highlighted in above screenshot.

OUTPUT: Here the output is partitioned on DT column and each partition will contain the four buckets as files. So all three partitions will have four files each as shown in below screenshots

Advantage of Bucketing:

Sampling: When we want to test a table which has huge amount of data or when we want to draw some patterns or when we want some aggregations [where accuracy is not out top priority] then we need to sample i.e we need to run the query on smaller set of data of the entire table data with evenly distributed sample. There are two types of sampling:

1.Bucket Sampling : e.g SELECT * FROM T_USER_LOG_BUCKET TABLESAMPLE (BUCKET 1 OUT OF 4 AT USER_ID).... It will select the data from the first buckets of each partition from T_USER_LOG_BUCKET table otherwise in normal select we can't select data in such distributed and evenly manner.

2.Block Sampling: e.g SELECT * FROM T_USER_LOG_BUCKET TABLESAMPLE (20 PERCENT);

Optimize Join Operation: We can optimize the join by bucketing the similar column values in one bucket so that during bucket to bucket join, hive can minimize the processing steps and reduce the data needed to parse and compare for join.

Please refer to the internet and books for more details on sampling.

[Click Here for complete source code]

Happy Learning.......... Gaurav Singh

Dinesh Yamali

Telecom Mediation (EMM | Nokia | CompTel)| Data Analytics | Continuous Learner

5 年

Good Article, Can you clarify, how files are distributed to Buckets (Assume 2 buckets) if column have date data type.?

Carnot Antonio Romero

Product Management - Data Management, Data Access Governance

5 年

Definitely useful. I've been trying to sort out for myself how these two things relate to each other. I'm working on a system that uses Hive (and Presto) to manage big data but that doesn't use standard Hadoop/Hive partitioning or bucketing... what we have is more like hash-based bucketing, but we call it partitioning, and this is confusing our users.? PS-- probably the first time I've seen screenshots that were really pictures of a screen :)?

DEEPESH NEMA

Big Data | GCP | HADOOP | SPARK Consultant

6 年

Hi Gaurav .. Thanks for this article !! I have query regarding data retrieval with high performance,? will bucketing give faster performance than Impala query execution ? ?

Kedar Divekar

Senior Specialist at LTI - Larsen & Toubro Infotech

6 年

What if we do not use OVERWRITE option? so there will be 4 files or 8?

1 次回应

SURESH J

BIG DATA ARCHITECT ★ CLOUD & APPLICATION DESIGNER ★ IT QA AUTOMATION ★ SOFTWARE ENGINEER & DEVELOPER

6 年

Great Job Gaurav.. I have 1 quick question.. How many files will be there if I insert same data into this tables? Will it create another 4 more files or data will be distributed across 4 existing files?

1 次回应

查看更多评论

要查看或添加评论，请登录

Gaurav Kumar Singh的更多文章

Build your first Chat GPT like app[Beginner]

2024年5月18日

Build your first Chat GPT like app[Beginner]

Introduction: As the world moves at a rapid pace, why shouldn't you keep up? You can either be left behind or ride the…

1 条评论
What's the hype surrounding Gen AI, LLMs, RAG and so on? It's crucial not to overlook this [Beginner]

2024年4月22日

What's the hype surrounding Gen AI, LLMs, RAG and so on? It's crucial not to overlook this [Beginner]

The purpose of this article is to provide you with a foundational understanding of Gen AI and its associated…

4 条评论
What is Full-Stack(MEAN, MERN etc)? How to get into web development? [Beginner-Advance]

2024年3月27日

What is Full-Stack(MEAN, MERN etc)? How to get into web development? [Beginner-Advance]

Introduction The term "full stack" has become increasingly popular, particularly within startup environments where…
Demystify ES6 class mocking in Jest (with examples)[Intermediate]

2024年3月5日

Demystify ES6 class mocking in Jest (with examples)[Intermediate]

Lets start with why are we here first. You already have the detailed explanations for ES6 class mocks here in Jest…

2 条评论
Closures in JavaScript[Beginner]

2018年1月29日

Closures in JavaScript[Beginner]

Introduction: A closure is just like any function that somehow remains available after the surrounding outer scope has…
In-memory Scope/Execution Context and Lexical Scope in JavaScript [Beginner]

2018年1月28日

In-memory Scope/Execution Context and Lexical Scope in JavaScript [Beginner]

Introduction: The example that I am showing here can be executed and seen in the Google Chrome browser's dev tool . I…
Hive Functions -- UDF,UDAF and UDTF with Examples

2016年4月25日

Hive Functions -- UDF,UDAF and UDTF with Examples

Hive Functions: User Defined Functions (UDFs) in hive are used to plug in our own logic in terms of code into hive when…

47 条评论
Custom Counters in Hadoop

2016年4月20日

Custom Counters in Hadoop

Click Here [for complete source code] Definition: As the name suggests, counters are way or channel or method to gather…

12 条评论
Distributed Cache in Hadoop with examples

2016年4月18日

Distributed Cache in Hadoop with examples

Source Code Click Here DISTRIBUTED CACHE: It is a facility which MapReduce framework provides to access small files…

7 条评论
Hadoop MapReduce Join End to End

2016年4月14日

Hadoop MapReduce Join End to End

Click here to get the source code Here is a small article on MR join in Hadoop-1. INPUT: We have two input files…

2 条评论

See all articles

HIVE - Partitioning and Bucketing with examples

Gaurav Kumar Singh

Software Engineer | Mentor | Teacher| JavaScript Pro | React Pro | Node Pro | Databases Expert | xQL expert | Python | Java | K8s | AWS | Bigdata | ES | AI-ML

Partitioning:

Gaurav Kumar Singh的更多文章

社区洞察

其他会员也浏览了

A Guide to Optimizing Your Queries

Give Your Data a Bath: Ingesting SQL as RDF

eKuiper 1.11.0 Released: Adds 50+ Functions and Significantly Improves Data Transformation

Best Practices for Designing Tables in Apache Pinot

Designing a Timeseries Analytics DB for Stock Market Analysis

Convergence of file types into open format tables for analytics

Implementation of indexes in a database

How to query your SQL data with AI

Snowflake: Row Pattern Recognition

Data Skipping -> Same Objective - Similar Concept - Different Names and Implementation

Partitioning:

Gaurav Kumar Singh的更多文章

Build your first Chat GPT like app[Beginner]

What's the hype surrounding Gen AI, LLMs, RAG and so on? It's crucial not to overlook this [Beginner]

What is Full-Stack(MEAN, MERN etc)? How to get into web development? [Beginner-Advance]

Demystify ES6 class mocking in Jest (with examples)[Intermediate]

Closures in JavaScript[Beginner]

In-memory Scope/Execution Context and Lexical Scope in JavaScript [Beginner]

Hive Functions -- UDF,UDAF and UDTF with Examples

Custom Counters in Hadoop

Distributed Cache in Hadoop with examples

Hadoop MapReduce Join End to End

社区洞察

其他会员也浏览了

A Guide to Optimizing Your Queries

Give Your Data a Bath: Ingesting SQL as RDF

eKuiper 1.11.0 Released: Adds 50+ Functions and Significantly Improves Data Transformation

Best Practices for Designing Tables in Apache Pinot

Designing a Timeseries Analytics DB for Stock Market Analysis

Convergence of file types into open format tables for analytics

Implementation of indexes in a database

How to query your SQL data with AI

Snowflake: Row Pattern Recognition

Data Skipping -> Same Objective - Similar Concept - Different Names and Implementation