HIVE - Partitioning and Bucketing with examples

HIVE - Partitioning and Bucketing with examples

Partitioning:

Partitioning is used to slice the data horizontally over the entire range or on a smaller range of values using one or more column . Partition concept is well known in RDBMS as well, so if you know the database then you should be aware of the concept and if not then also its not an issue as you will be well aware of the concept after reading this article. e.g if you are having hundred records in a table with student data in an university and you want to divide the entire record into Male and Female student so here it is nothing but partitioning and GENDER is my column which is used to split the records. Again, as I told, there can be more than one column used to split the records.

In real word scenario, if you want to analyze the log files of user activity on internet, it would be great if we can keep the data as per date and geographies. There are two types of partitioning in HIVE:1.Static Partitioning       2.Dynamic Partitioning The table DDL statement will be same in case of both the partitioning as shown below:

[Click Here for complete source code]

I have created a table T_USER_LOG with DT and COUNTRY column as my partitioning column. I have use hive script mode where "HivePartition.hql' is my script file. You can use hive shell command as well or whichever is feasible for you. As highlighted in the image, the partition columns appears in table schema like normal table column.

Static Partitioning: In Static Partition, we know the partition column before itself. So far so good, now when we load data there it makes the difference.

LOAD DATA LOCAL INPATH [path_name] OVERWRITE INTO TABLE [table_name] PARTITION(partition_column='value'....). Here we have to give the partition column value explicitly whenever we want to create new partition as shown below:

LOAD SCRIPTS.

As you can see, we have mentioned the partition column value in each load.

OUTPUT: Since our table T_USER_LOG is managed table so the data is loaded in hive warehouse path i.e /user/hive/warehouse/t_user_log. 

Here you can check all other partition as well, it will have the file UserLog.txt. Total two level of partitioning is there in our example, one as DT and another as COUNTRY , then the final data will be stored inside. All partitions in hive is there as directories. Loading in hive is instantaneous process and it won't trigger a Map/Reduce job. That's why our file is stored as UserLog.txt instead of 00000_o file. Please follow the article as I will show in dynamic partition where we will LOAD table using another table where Map/reduce job is triggered.

DYNAMIC PARTITIONING: Let us see now the load script of Dynamic Partitioning. We will create new table T_USER_LOG_DYN for dynamic partition and also as we told earlier that we will load this table using a new table, let's create another table T_USER_LOG_SRC. Below is the data of sourse table:

[Click Here for complete source code]

Let's us check the hive script for table DDL.

We will see first the negative scenario and then the final load scripts. We have to set two hive properties as below:

1.SET hive.exec.dynamic.partition= true; 

2. SET hive.exec.dynamic.partition.mode= nonstrict

Sorry for the small screenshot but it show the error if you try to load the table and achieve partitioning dynamically without setting above two properties. Here while loading the table, the partition will be created dynamically on all partition columns if hive.exec.dynamic.partition.mode= nonstrict is set. If it is strict [which is by default], it will need at least one partition column to be defined in load script. Now let see the final load script as below:

LOAD SCRIPT: Below is the screenshot for loading the table T_USER_LOG_DYN using the data from T_USER_LOG_SRC and creating dynamic partitions. LoadDynPartition.hql is my script to load the table dynamically as shown below:

hive -f LoadDynPartition.hql is used to execute the hive script.

OUTPUT: Once the script is run, the file will be created as below:

As you can see, here we need not to load the table multiple time to create the multiple partition as it was the case in static partitioning. Here all the partitions are created in T_USER_LOG_DYN based on number of distinct DT and COUNTRY column values in T_USER_LOG_SRC table dynamically at run time. Also we told earlier, since the data is loaded using another table and map/reduce job is executed by framework, output file is as 000000_0 but not by any name.

 Advantage of Partitioning: Partitioning has its own benefit when it comes to its usage in HIVE. Its helps to organize the data in logical fashion and when we query the partitioned table using partition column, it allows hive to skip all but relevant sub-directories and files. This can lead to tremendous reduction in data required to read and filter in the initial map stage. This reduces the number of mapper, I/O operation and response time for the query.

BUCKETING in HIVE: When we write data in bucketed table in hive, it places the data in distinct buckets as files. Hive uses some hashing algorithm to generate a number in range of 1 to N buckets [as mentioned in DDL] and based on the result of hashing, data is placed in a particular buckets as a file. Let's create a hive bucketed table T_USER_LOG_BUCKET with a partition column as DT and having 4 buckets. We specify bucketing column in CLUSTERED BY (column_name) clause in hive table DDL as shown below in hive script files:

LOAD SCRIPT: [Click Here for complete source code]

Below screenshots shows the hive script execution to create and load Bucketed table:

As we know that the number of reduce task determine the number of output file so here we have defined table as four buckets so the number of reduce taks is 4 as highlighted in above screenshot. 

OUTPUT: Here the output is partitioned on DT column and each partition will contain the four buckets as files. So all three partitions will have four files each as shown in below screenshots

Advantage of Bucketing:

Sampling: When we want to test a table which has huge amount of data or when we want to draw some patterns or when we want some aggregations [where accuracy is not out top priority] then we need to sample i.e we need to run the query on smaller set of data of the entire table data with evenly distributed sample. There are two types of sampling:            

1.Bucket Sampling : e.g SELECT * FROM T_USER_LOG_BUCKET TABLESAMPLE (BUCKET 1 OUT OF 4 AT USER_ID).... It will select the data from the first buckets of each partition from T_USER_LOG_BUCKET table otherwise in normal select we can't select data in such distributed and evenly manner.

 2.Block Sampling: e.g SELECT * FROM T_USER_LOG_BUCKET TABLESAMPLE (20 PERCENT);

 Optimize Join Operation: We can optimize the join by bucketing the similar column values in one bucket so that during bucket to bucket join, hive can minimize the processing steps and reduce the data needed to parse and compare for join.

Please refer to the internet and books for more details on sampling.

 [Click Here for complete source code]

Happy Learning.......... Gaurav Singh

Dinesh Yamali

Telecom Mediation (EMM | Nokia | CompTel)| Data Analytics | Continuous Learner

5 年

Good Article, Can you clarify, how files are distributed to Buckets (Assume 2 buckets) if column have date data type.?

回复
Carnot Antonio Romero

Product Management - Data Management, Data Access Governance

5 年

Definitely useful. I've been trying to sort out for myself how these two things relate to each other. I'm working on a system that uses Hive (and Presto) to manage big data but that doesn't use standard Hadoop/Hive partitioning or bucketing... what we have is more like hash-based bucketing, but we call it partitioning, and this is confusing our users.? PS-- probably the first time I've seen screenshots that were really pictures of a screen :)?

回复
DEEPESH NEMA

Big Data | GCP | HADOOP | SPARK Consultant

6 年

Hi Gaurav .. Thanks for this article !! I have query regarding data retrieval with high performance,? will bucketing give faster performance than Impala query execution ? ?

回复
Kedar Divekar

Senior Specialist at LTI - Larsen & Toubro Infotech

6 年

What if we do not use OVERWRITE option? so there will be 4 files or 8?

SURESH J

BIG DATA ARCHITECT ★ CLOUD & APPLICATION DESIGNER ★ IT QA AUTOMATION ★ SOFTWARE ENGINEER & DEVELOPER

6 年

Great Job Gaurav.. I have 1 quick question.. How many files will be there if I insert same data into this tables? Will it create another 4 more files or data will be distributed across 4 existing files?

要查看或添加评论,请登录

Gaurav Kumar Singh的更多文章

社区洞察

其他会员也浏览了