登录查看更多内容

Strict Mode HIVE

Prateek K.

Data Science

发布日期: 2017年3月24日

We have the partition in Hive to save lots of our time whereas performing analysis on the area only we want and not the entire dataset. Although it's true, partitioned off tables usually hold terribly massive data sets that may be growing quickly. And Hive needs to search the entire dataset before giving a result. AN unrestricted partition would possibly consume intolerably large resources over such an oversized table.

But what if the instruction itself is restricted to search partial dataset then provide a result.

This will increase the execution time tremendously. Users are prevented from queries which will scan all partitions.

Yes, this could be done by turning STRICT mode up.

Strict mode is a setting in Hive that forestalls users from issuing queries that would have

unintended and undesirable effects.

In different words, if you'd prefer to run a query across all partitions(full table scan) you'll have to set the mode to 'nonstrict'. Use this property with care, as a result, it triggers huge MapReduce jobs.

Setting the property hive.mapred.mode to strict disables 3 varieties of queries.

queries on partitioned tables are not permitted unless they include a partition filter in the WHERE clause
It restricts ORDER BY operation without a LIMIT clause ( since it uses a single reducer which can choke your processing if not handled properly.
The third type of query prevented is a Cartesian product

We will see these 3 restriction benefit with handson ahead in this blog.

We can do the setting in 2 ways that (a)Hive CLI(set hive.mapred.mode=strict) OR (b)Property file(Hive-default.xml)

To understand strict mode we need a partitioned table.Let’s create one now.

Commands to create partition table(strict) and inserting some data

create table par_user (first_name varchar(60),lastname varchar(60),roll_no varchar(60)) partitioned by(country varchar(60),state varchar(60)) row format delimited fields terminated by ',' stored as textfile;

load data local inpath '/home/acadgild/usr_info' into table par_user partition (country = 'us',state = 'fl');

load data local inpath '/home/acadgild/usr_info1' into table par_user partition (country = 'ca',state = 'au');

Readers can download data working in above code HERE.

Before the property is set to strict, Below in screenshot we see the data present in my table par_user.

We find our query execution without any interruption, Just imagine the time to execute the same on BigData(which is huge).

FIRST

Queries on partitioned tables are not permitted unless they include a partition

filter in the WHERE clause, limiting their scope.,

If your partitioned table is very large, you could block any full table scan queries by putting Hive into strict mode using the set hive.mapred.mode=strict command. In this mode, when users submit a query that would result in a full table scan (i.e. queries without any partitioned columns) an error is issued.

Commands

hive>set hive.mapred.mode=strict;

hive>select * from par_user where roll_no=4;

error

The following enhancement adds a partition filter—the table partitions—to the

WHERE clause:

hive> select * from par_user where roll_no=4 and state='au';

normal result

The second type of restricted query are those with ORDER BY clauses, but no LIMIT clause.

Because ORDER BY sends all results to a single reducer to perform the ordering, forcing the user to specify a LIMIT clause prevents the reducer from executing for an extended period of time:

hive>select * from par_user where state='au' order by roll_no;

error

To issue this query, add a LIMIT clause:

hive>select * from par_user where state='au' order by roll_no limit 3;

normal result

The third and final type of query prevented is a Cartesian product.

To perform this JOIN operation I have replicated the data in par_user to another table named par_user2

hive>select * from par_user JOIN par_user2;

error

Here is a properly constructed query with JOIN and ON clauses:

hive>set hive.mapred.mode=nonstrict;

hive>select * from par_user JOIN par_user2;

Hope you got the concept and when to use the strict option in hive to reduce execution time. This lead to optimization of hive query.

Srinivasarao Vasa

Senior Software Engineer at United Health Group Information Services India LTD.

7 年

Also I need some more clarification on Bucketing concept. As per my understand if the bucket size is enough to persist in memory will increase the query performance when we join the two tables which are bucketed on the same key column or column is that true??

Srinivasarao Vasa

Senior Software Engineer at United Health Group Information Services India LTD.

7 年

Excellent prateek. Thanks for your effort to clarify the concept.

查看更多评论

要查看或添加评论，请登录

Prateek K.的更多文章

Why Becoming a Data Scientist is the Next Logical Move?

2018年1月17日

Why Becoming a Data Scientist is the Next Logical Move?

Amongst the variety of job profiles, Data Scientist might not be the first job to come to your mind. But big companies…
Frequently Asked Hadoop Interview Questions in 2017 Part - 2

2017年9月20日

Frequently Asked Hadoop Interview Questions in 2017 Part - 2

Before going through this Hadoop interview questions part-2, we recommend our users to go through our previous post on…

2 条评论
Hive UseCase: Breast Cancer Data Analysis

2017年9月18日

Hive UseCase: Breast Cancer Data Analysis

This blog will help readers to learn data analytics using HIVE. The very common problem present in whole world coming…
Data Serialization with Avro in Hive

2017年8月16日

Data Serialization with Avro in Hive

This blog focuses on providing in-depth information of Avro in Hive. Here we have discussed the importance and…
File Formats in Apache HIVE

2017年7月31日

File Formats in Apache HIVE

This Blog aims at discussing the different file formats available in Apache Hive. After reading this Blog you will get…
Writable and WritableComparable in Hadoop

2017年7月28日

Writable and WritableComparable in Hadoop

This blog helps those people who want to build their own custom types in Hadoop which is possible only with Writable…

1 条评论
Frequently Asked Hadoop Interview Questions in 2017 – Part 1

2017年7月19日

Frequently Asked Hadoop Interview Questions in 2017 – Part 1

In this first Part of Hadoop interview Questions, we would be discussing various questions related to Big Data Hadoop…

1 条评论
Static vs dynamic partition in hive

2017年4月14日

Static vs dynamic partition in hive

In our Last post, we mentioned regarding Introduction to Hive Partition therein post we clearly mentioned the most…

1 条评论
Solving the Unstructured Data Dilemma

2017年3月12日

Solving the Unstructured Data Dilemma

A web browser generates tons of data every day. All of which may not be stored in a structured format.

See all articles

Strict Mode HIVE

Prateek K.

Data Science

Prateek K.的更多文章

社区洞察

其他会员也浏览了

The Mystery of NULL Values: Why They Matter and How to Tackle Them

Branches & Tags: Comparing Iceberg, Hudi, and Delta Lake Tables

Improve your Data Science workflow with Optimus

Data Wrangling of Electoral Data

A Beginner's Guide to Subqueries and CTEs

Clustered vs Non-Clustered Indexes

Linear Trees in InnoDB

A way to avoid the "void data type" in PySpark and Delta

A week in Kusto and SQL

Prateek K.的更多文章

Why Becoming a Data Scientist is the Next Logical Move?

Frequently Asked Hadoop Interview Questions in 2017 Part - 2

Hive UseCase: Breast Cancer Data Analysis

Data Serialization with Avro in Hive

File Formats in Apache HIVE

Writable and WritableComparable in Hadoop

Frequently Asked Hadoop Interview Questions in 2017 – Part 1

Static vs dynamic partition in hive

Solving the Unstructured Data Dilemma

社区洞察

其他会员也浏览了

The Mystery of NULL Values: Why They Matter and How to Tackle Them

Branches & Tags: Comparing Iceberg, Hudi, and Delta Lake Tables

Improve your Data Science workflow with Optimus

Data Wrangling of Electoral Data

A Beginner's Guide to Subqueries and CTEs

Clustered vs Non-Clustered Indexes

Linear Trees in InnoDB

A way to avoid the "void data type" in PySpark and Delta

A week in Kusto and SQL