Quick Card On - Apache Hive Joins !
Kumar Chinnakali
Reimagining contact center as a hands-on architect bridging users, clients, developers, and business executives in their context.
Thank you for your valuable time & it’s much appreciated. This time i like to share the blog called “Quick Card On - Apache Hive Joins !” – a handy Apache Hive Joins reference card or cheat sheet.
An SQL JOIN clause is used to combine rows from two or more tables, based on a common field between them. In a similar line we’ve Hive Query Language(HQL or HiveQL) joins; which is the key factor for the optimization and performance of hive queries. Choosing the right join based on the data and business need is key principal to improve the Hive query performance.
Let’s try to understand that how join works in Hive execution. In general join operation will be compiled to a MapReduce task, involves a map stage and a reduce stage. A mapper reads from join tables and emits the join key and join value pair into an intermediate file. Hadoop sorts and merges these pairs in what’s called the shuffle stage. The reducer takes the sorted results as input and does the actual join work. The shuffle stage is very expensive; hence saving this stage will improves the task performance. In simple, the Join is a clause that combines the records of two tables (or Data-Sets).
Shuffle Join / Common Join:
When:
- It’s a default choice and it always works
How:
- Reads from part of one of the tables
- Buckets and sorts on Join key
- Sends one bucket to each reduce
- Join is done on the Reduce side
Pointers:
- Works all time
Map Join:
When:
- One table should be small, which fits in memory
- To save the shuffle & reduce stages
- To do the join work only in the map stage
- It’s suitable for small tables to optimize the task
How:
- Reads small table into memory hash table
- Streams through part of the big file
- Joining each record from hash table
- It is similar to a join but all the task will be performed by the mapper alone
Pointers:
- Very fast, but it’s limited
- Using the Distributed Cache solve the scaling problem limitations
- Most improvement comes from removing the JDBM component
- No need to use persistent hashtable for map join
- query hint as MAPJOIN
- RIGHT OUTER JOIN / FULL OUTER JOIN is not possible
- It has no reduce task and it can handle only one key at a time
- set hive.auto.convert.join; If true automatically converts the joins to mapjoins at run time
- set hive.auto.convert.join.noconditionaltask; If true no longer a need to provide the map-join hint
- set hive.auto.convert.join.noconditionaltask.size; It controls the size of table to fit in memory
Left Semi Join:
When:
- To have functionality of IN/EXISTS subquery semantics
How:
- It’s generic join and effective for inner joins
- Once the match is found it will stop looking for the scan of the other records
Pointers:
- Right hand side table should only be reference with ON clause, not with WHERE/SELECT clause
- Right semi-joins are not supported in Hive
Bucket Map Join:
When:
- Total table/partition size is big, not good for mapjoin
- Non-sorted the same
- Bucketd the same
- Joining on the sort/bucketing on the multiple columns
How:
- Work together with map join, and all join tables are bucketized
- Each small table?s bucket number can be divided by big table?s bucket number
- Bucket columns == Join columns
- Only matching buckets of all small tables are replicated onto each mapper
Pointers:
- set hive.optimize.bucketmapjoin; If true then Bucket Map Join is activated
Sort Merge Bucket (SMB) Join:
When:
- Sorted the same
- Bucketed the same
- Joining on the sort/bucket on the same/equal columns
How:
- Reads a bucket from each table
- Process the row with the lowest value
Pointers:
- Very efficient if applicable
- Both Mapt & Reduce task are used
- set hive.input.format; If it’s org.apache.hadoop.hive.ql.io.bucketizedhiveinputformat the SMB join is activated
- set hive.auto.convert.sortmerge.join=true
- set hive.optimize.bucketmapjoin = true
- set hive.optimize.bucketmapjoin.sortedmerge = true
- set hive.auto.convert.sortmerge.join.noconditionaltask=true
Sort Merge Bucket Map(SMB Map) Join:
When:
- Sorted the same
- Bucketed the same
- Joining on the sort/bucket on the same/equal columns
- No limit on file/partition/table size
How:
- Partitioned table might be slows down due to each single key in map needs small chunk
- Work together with bucket map join
- Bucket columns == Join columns == sort columns Sort Merge
- Small tables are read on demand
- NOT hold entire small tables in memory
- Can perform outer join
Pointers:
- set hive.auto.convert.sortmerge.join=true
- set hive.optimize.bucketmapjoin = true
- set hive.optimize.bucketmapjoin.sortedmerge = true
- set hive.auto.convert.sortmerge.join.noconditionaltask=true
- sethive.auto.convert.sortmerge.join.bigtable.selection.policy=org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ
Skew Join:
When:
- Needs to join on the two very large data tables
- Join bottle necked on the reducer who gets the skewed key
How:
- If a small number of skewed keys make up for a significant percentage of the data,
they will not become bottlenecks
Pointers:
- Because of the partial results, the results also have to be read and written twice
- The user needs to be aware of the skew in the data and manually do the above process
Cartesian Product Join:
When:
- To generate all the set of records from all the tables in the applications
How:
- Not optimized as in the MapReduce usage
- Computes full cartesian, before WHERE clause is applied
Pointers:
- set hive.mapred.mode=strict; help us to prevent from submitting unknowing Cartesian product query
Happy and looking forward to have your comments and suggestions.
Happy in sharing the learnings !
A lot of information here could be updated. Just to name a few: Map joins don't require hints these days (and haven't since Hive 0.11). The single key restriction on map joins is not true today (if it was ever true). Increasing your container size lets you do larger map-side joins. noconditionaltask.size is ignored / set automatically with Hive 1.2+. With the introduction of the Grace Hash Join (HIVE-9277) hash tables can go beyond the size of memory without queries failing. Hive on Tez should be used in all cases. It is much faster than Hive on MapReduce. Use the CBO and stats, it will re-order joins and can convert shuffle joins into map joins and run much faster.
Assistant Vice President - Genpact
9 年Very good explantory
Working at Carelon Health on AWS and GCP and having 20 years experience. Actively looking for change
9 年Thanks good one
Principal Software Engineer
9 年Thank you ! Good one