登录查看更多内容

Apache Hive Performance Tuning Best Practices

Yogesh K.

Enabling organizations with their Data & GenAI transformation journey

发布日期: 2020年1月29日

+ 关注

While working on Apache Hive recently, learned that performance can be a real nightmare, sharing few tips:

Appropriate Schema Design

· Design schema after data is understand and how it will be retrieved, for example partitioning the table on the column that is frequently used in WHERE condition, but remember do not partition not more than two columns. Another example is lookup and tables referencing lookup data. Usually lookup tables are small and referencing tables are much larger, in older versions of Hive the issue was resolved by loading data of the smaller table into memory. New version of Hive does it for you, consider using complex data types for example use map data type (example country and respective code). In this example data remain with the large table avoiding join all together.

· In many cases, Hadoop HDFS storage is shared across multiple departments and teams. Hadoop HDFS usually acts as a data hub for entire organization. Creating structured and organized HDFS repository of your data in HDFS will provide many benefits. Such as:

· Standard directory helps to share data across multiple teams across organization.

· The structure helps to control accidental deletion or corruption.

· Use stage area before processing all data to target directory, this helps to make sure partial data is not processed.

Performance Considerations for Join Operations on Hive Tables

· Use Tez as a execution engine instead of Map-Reduce (very well known)

· Do a pilot, load data into the test tables and find the cost of the queries you need to execute, do cost based optimization.

· Use VECTORIZATION – immensely help searching data (very effective in Apache Spark as well)

Partitioning of Hive Tables

Partitioning Apache Hive table technique physically divides the data based on the different values in frequently queried or used columns in the Hive tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.

Choose Appropriate File Format for the Data

· Choosing right file format will improve the Hive performance. Hive supports ORC file, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression etc. ORC file improve response times for your HIVE queries.

· Avoid Loading too many Small Files (You can achieve it using Delta lake in Apache Spark, wish if we can automatically do it in hive)

· Hadoop works well with large files and it applies to Hive as well. You should avoid ingestion process that produces large number of small files. When producing outside Hive, use Text file format. Once you have data into Hive tables, then you can convert that to ORC or Parquet file format.

Also read my kidle book on Apache Hive Interview Questions on Amazon

Yogesh K.

Enabling organizations with their Data & GenAI transformation journey

4 年

Learn more about?vectorized queries https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution Hive - set hive.vectorized.execution.enabled = true;

要查看或添加评论，请登录

查看全部

Apache Hive Performance Tuning Best Practices

Yogesh K.

Enabling organizations with their Data & GenAI transformation journey

更多精彩文章

社区洞察

其他会员也浏览了

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

HIVE

Hive

APACHE HIVE

Unleashing the Power of Big Data with Apache Hive

What is Hive?

Hive

Getting started with Apache Spark

Golden Monkey Go March in! Revo R (updated)

Embracing Humanity: A Call for Kindness, Learning, and Understanding in a Complex World

2023年12月31日

Nurturing Growth: The Role of AI Centers of Excellence in Technology Consulting

2023年12月18日

ChatGPT - : A Look Back at the Power of Generative AI

2023年12月1日

Tech Giants Reshape Healthcare: A Data-Driven Era

2023年11月11日

Syncing delta tables in two different AZURE subscriptions in a controlled manner (Using Databricks)

2022年5月15日

Reduce Emergency Room waiting times with Azure Logic App and Databricks

2022年5月2日

Azure Synapse Analytics - First Impression - Part 2 - Spark Notebooks

2020年5月25日

Azure Synapse Analytics - First Impression - Part 1

2020年5月24日

Apache Spark - Tune cluster to take advantage of parallelism

2020年1月29日

Productionizing Machine Learning (ML) pipelines in real time - What can be learn from the past?

2019年3月2日

社区洞察

其他会员也浏览了

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

HIVE

Hive

APACHE HIVE

Unleashing the Power of Big Data with Apache Hive

What is Hive?

Hive

Getting started with Apache Spark

Golden Monkey Go March in! Revo R (updated)