登录查看更多内容

Quick Card On - Apache Hive Joins !

Kumar Chinnakali

Reimagining contact center as a hands-on architect bridging users, clients, developers, and business executives in their context.

发布日期: 2015年8月21日

Thank you for your valuable time & it’s much appreciated. This time i like to share the blog called “Quick Card On - Apache Hive Joins !” – a handy Apache Hive Joins reference card or cheat sheet.

An SQL JOIN clause is used to combine rows from two or more tables, based on a common field between them. In a similar line we’ve Hive Query Language(HQL or HiveQL) joins; which is the key factor for the optimization and performance of hive queries. Choosing the right join based on the data and business need is key principal to improve the Hive query performance.

Let’s try to understand that how join works in Hive execution. In general join operation will be compiled to a MapReduce task, involves a map stage and a reduce stage. A mapper reads from join tables and emits the join key and join value pair into an intermediate file. Hadoop sorts and merges these pairs in what’s called the shuffle stage. The reducer takes the sorted results as input and does the actual join work. The shuffle stage is very expensive; hence saving this stage will improves the task performance. In simple, the Join is a clause that combines the records of two tables (or Data-Sets).

Shuffle Join / Common Join:

When:

It’s a default choice and it always works

How:

Reads from part of one of the tables
Buckets and sorts on Join key
Sends one bucket to each reduce
Join is done on the Reduce side

Pointers:

Works all time

Map Join:

When:

One table should be small, which fits in memory
To save the shuffle & reduce stages
To do the join work only in the map stage
It’s suitable for small tables to optimize the task

How:

Reads small table into memory hash table
Streams through part of the big file
Joining each record from hash table
It is similar to a join but all the task will be performed by the mapper alone

Pointers:

Very fast, but it’s limited
Using the Distributed Cache solve the scaling problem limitations
Most improvement comes from removing the JDBM component
No need to use persistent hashtable for map join
query hint as MAPJOIN
RIGHT OUTER JOIN / FULL OUTER JOIN is not possible
It has no reduce task and it can handle only one key at a time
set hive.auto.convert.join; If true automatically converts the joins to mapjoins at run time
set hive.auto.convert.join.noconditionaltask; If true no longer a need to provide the map-join hint
set hive.auto.convert.join.noconditionaltask.size; It controls the size of table to fit in memory

Left Semi Join:

When:

To have functionality of IN/EXISTS subquery semantics

How:

It’s generic join and effective for inner joins
Once the match is found it will stop looking for the scan of the other records

Pointers:

Right hand side table should only be reference with ON clause, not with WHERE/SELECT clause
Right semi-joins are not supported in Hive

Bucket Map Join:

When:

Total table/partition size is big, not good for mapjoin
Non-sorted the same
Bucketd the same
Joining on the sort/bucketing on the multiple columns

How:

Work together with map join, and all join tables are bucketized
Each small table?s bucket number can be divided by big table?s bucket number
Bucket columns == Join columns
Only matching buckets of all small tables are replicated onto each mapper

Pointers:

set hive.optimize.bucketmapjoin; If true then Bucket Map Join is activated

Sort Merge Bucket (SMB) Join:

When:

Sorted the same
Bucketed the same
Joining on the sort/bucket on the same/equal columns

How:

Reads a bucket from each table
Process the row with the lowest value

Pointers:

Very efficient if applicable
Both Mapt & Reduce task are used
set hive.input.format; If it’s org.apache.hadoop.hive.ql.io.bucketizedhiveinputformat the SMB join is activated
set hive.auto.convert.sortmerge.join=true
set hive.optimize.bucketmapjoin = true
set hive.optimize.bucketmapjoin.sortedmerge = true
set hive.auto.convert.sortmerge.join.noconditionaltask=true

Sort Merge Bucket Map(SMB Map) Join:

When:

Sorted the same
Bucketed the same
Joining on the sort/bucket on the same/equal columns
No limit on file/partition/table size

How:

Partitioned table might be slows down due to each single key in map needs small chunk
Work together with bucket map join
Bucket columns == Join columns == sort columns Sort Merge
Small tables are read on demand
NOT hold entire small tables in memory
Can perform outer join

Pointers:

set hive.auto.convert.sortmerge.join=true
set hive.optimize.bucketmapjoin = true
set hive.optimize.bucketmapjoin.sortedmerge = true
set hive.auto.convert.sortmerge.join.noconditionaltask=true
sethive.auto.convert.sortmerge.join.bigtable.selection.policy=org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ

Skew Join:

When:

Needs to join on the two very large data tables
Join bottle necked on the reducer who gets the skewed key

How:

If a small number of skewed keys make up for a significant percentage of the data,
they will not become bottlenecks

Pointers:

Because of the partial results, the results also have to be read and written twice
The user needs to be aware of the skew in the data and manually do the above process

Cartesian Product Join:

When:

To generate all the set of records from all the tables in the applications

How:

Not optimized as in the MapReduce usage
Computes full cartesian, before WHERE clause is applied

Pointers:

set hive.mapred.mode=strict; help us to prevent from submitting unknowing Cartesian product query

Happy and looking forward to have your comments and suggestions.

Happy in sharing the learnings !

Carter Shanklin

9 年

A lot of information here could be updated. Just to name a few: Map joins don't require hints these days (and haven't since Hive 0.11). The single key restriction on map joins is not true today (if it was ever true). Increasing your container size lets you do larger map-side joins. noconditionaltask.size is ignored / set automatically with Hive 1.2+. With the introduction of the Grace Hash Join (HIVE-9277) hash tables can go beyond the size of memory without queries failing. Hive on Tez should be used in all cases. It is much faster than Hive on MapReduce. Use the CBO and stats, it will re-order joins and can convert shuffle joins into map joins and run much faster.

1 次回应

Murali Krishna Vysyaraju (TOGAF Certified)

Assistant Vice President - Genpact

9 年

Very good explantory

Reddi Bhaskar K

Working at Carelon Health on AWS and GCP and having 20 years experience. Actively looking for change

9 年

Thanks good one

H A R I R

Principal Software Engineer

9 年

Thank you ! Good one

查看更多评论

要查看或添加评论，请登录

Kumar Chinnakali的更多文章

Dad, What's Net Zero? Exploring a Greener World with Dad's Help!" ????

2023年9月16日

Dad, What's Net Zero? Exploring a Greener World with Dad's Help!" ????

Yesterday, my 13-year-old daughter Anuja Kumar looked at me, her eyes filled with curiosity, and asked, 'Dad, what…
Pioneering the Future of Telecom with Bio-Inspired Computation

2023年9月16日

Pioneering the Future of Telecom with Bio-Inspired Computation

Bio-Inspired Computation in Telecommunications refers to the application of principles and algorithms inspired by…
?? Closing the Loop: Navigating Data's End of Life in the Net Zero Era

2023年9月3日

?? Closing the Loop: Navigating Data's End of Life in the Net Zero Era

Introduction In a world increasingly concerned with sustainability and environmental responsibility, even our digital…
?? Book Review: Practical Sustainability by Corey Glickman and Jeff Kavanaugh - A Gateway to Net Zero Insights!

2023年8月26日

?? Book Review: Practical Sustainability by Corey Glickman and Jeff Kavanaugh - A Gateway to Net Zero Insights!

?? "Practical Sustainability: Circular Commerce, Smarter Spaces, And Happier Humans" is a true gem that illuminates the…

2 条评论
Unraveling ESG Frameworks & Standards and the Power of Rankers & Raters: Navigating the Path to Net Zero

2023年8月20日

Unraveling ESG Frameworks & Standards and the Power of Rankers & Raters: Navigating the Path to Net Zero

In the realm of sustainable finance and responsible investing, the terms "ESG Frameworks & Standards" and "Rankers &…

1 条评论
From LCA to LCSA: A Journey Towards Holistic Sustainability in the Net Zero Era!

2023年8月5日

From LCA to LCSA: A Journey Towards Holistic Sustainability in the Net Zero Era!

In this article, we'll explore how Life Cycle Analysis (LCA) is evolving into Life Cycle Sustainability Analysis (LCSA)…

3 条评论
The Sustainability Showdown: ESG vs. ECG - Exploring New Dimensions of Responsible Business Practices

2023年7月16日

The Sustainability Showdown: ESG vs. ECG - Exploring New Dimensions of Responsible Business Practices

In today's ever-evolving business landscape, sustainability and ethical practices are taking center stage. As…

2 条评论
Unlocking Climate Action: The Power of Carbon Offsetting

2023年6月25日

Unlocking Climate Action: The Power of Carbon Offsetting

?? Introduction: In the fight against climate change, the need for immediate action is more pressing than ever. One…

1 条评论
Beyond GDP: Exploring the Inclusive Wealth Index for Sustainable Development and Well-Being.

2023年6月11日

Beyond GDP: Exploring the Inclusive Wealth Index for Sustainable Development and Well-Being.

The Inclusive Wealth Index (IWI) is an indicator that attempts to measure the sustainable development and well-being of…

1 条评论
Reducing Data Waste for a More Sustainable Future: Strategies for Efficient Data Management and Net Zero Impact

2023年5月29日

Reducing Data Waste for a More Sustainable Future: Strategies for Efficient Data Management and Net Zero Impact

Data waste refers to the inefficient or unnecessary use of data, resulting in the accumulation of large amounts of…

1 条评论

See all articles

Quick Card On - Apache Hive Joins !

Kumar Chinnakali

Reimagining contact center as a hands-on architect bridging users, clients, developers, and business executives in their context.

Kumar Chinnakali的更多文章

社区洞察

其他会员也浏览了

FlameGraphs and eBPF (or perf) to troubleshoot Postgres performance

Lesson Plan and Free Resources

Developing Applications with Node.js and the MySQL Document Store

NestJS App with Authentication: Login and Signup

feat. Advanced Docker - Volumes

Building Real-Time Data Lakes with Apache Flink and Apache Paimon: A Hands-on Guide

I slowed mysql queries to speed it up

Back testing system

Getting Armitage Running on Kali with the Latest Free Metasploit Framework

Docker File to Create a MySql container

Kumar Chinnakali的更多文章

Dad, What's Net Zero? Exploring a Greener World with Dad's Help!" ????

Pioneering the Future of Telecom with Bio-Inspired Computation

?? Closing the Loop: Navigating Data's End of Life in the Net Zero Era

?? Book Review: Practical Sustainability by Corey Glickman and Jeff Kavanaugh - A Gateway to Net Zero Insights!

Unraveling ESG Frameworks & Standards and the Power of Rankers & Raters: Navigating the Path to Net Zero

From LCA to LCSA: A Journey Towards Holistic Sustainability in the Net Zero Era!

The Sustainability Showdown: ESG vs. ECG - Exploring New Dimensions of Responsible Business Practices

Unlocking Climate Action: The Power of Carbon Offsetting

Beyond GDP: Exploring the Inclusive Wealth Index for Sustainable Development and Well-Being.

Reducing Data Waste for a More Sustainable Future: Strategies for Efficient Data Management and Net Zero Impact

社区洞察

其他会员也浏览了

FlameGraphs and eBPF (or perf) to troubleshoot Postgres performance

Lesson Plan and Free Resources

Developing Applications with Node.js and the MySQL Document Store

NestJS App with Authentication: Login and Signup

feat. Advanced Docker - Volumes

Building Real-Time Data Lakes with Apache Flink and Apache Paimon: A Hands-on Guide

I slowed mysql queries to speed it up

Back testing system

Getting Armitage Running on Kali with the Latest Free Metasploit Framework

Docker File to Create a MySql container