Apache Hive Performance Tuning Best Practices
While working on Apache Hive recently, learned that performance can be a real nightmare, sharing few tips:
Appropriate Schema Design
· Design schema after data is understand and how it will be retrieved, for example partitioning the table on the column that is frequently used in WHERE condition, but remember do not partition not more than two columns. Another example is lookup and tables referencing lookup data. Usually lookup tables are small and referencing tables are much larger, in older versions of Hive the issue was resolved by loading data of the smaller table into memory. New version of Hive does it for you, consider using complex data types for example use map data type (example country and respective code). In this example data remain with the large table avoiding join all together.
· In many cases, Hadoop HDFS storage is shared across multiple departments and teams. Hadoop HDFS usually acts as a data hub for entire organization. Creating structured and organized HDFS repository of your data in HDFS will provide many benefits. Such as:
· Standard directory helps to share data across multiple teams across organization.
· The structure helps to control accidental deletion or corruption.
· Use stage area before processing all data to target directory, this helps to make sure partial data is not processed.
Performance Considerations for Join Operations on Hive Tables
· Use Tez as a execution engine instead of Map-Reduce (very well known)
· Do a pilot, load data into the test tables and find the cost of the queries you need to execute, do cost based optimization.
· Use VECTORIZATION – immensely help searching data (very effective in Apache Spark as well)
Partitioning of Hive Tables
Partitioning Apache Hive table technique physically divides the data based on the different values in frequently queried or used columns in the Hive tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.
Choose Appropriate File Format for the Data
· Choosing right file format will improve the Hive performance. Hive supports ORC file, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression etc. ORC file improve response times for your HIVE queries.
· Avoid Loading too many Small Files (You can achieve it using Delta lake in Apache Spark, wish if we can automatically do it in hive)
· Hadoop works well with large files and it applies to Hive as well. You should avoid ingestion process that produces large number of small files. When producing outside Hive, use Text file format. Once you have data into Hive tables, then you can convert that to ORC or Parquet file format.
Also read my kidle book on Apache Hive Interview Questions on Amazon
Enabling organizations with their Data & GenAI transformation journey
4 年Learn more about?vectorized queries https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution Hive - set hive.vectorized.execution.enabled = true;