Impala Performance Tuning – Best Practices
This article by Abdur Rashid Aleem explains the best practices for Impala Performance Tuning.
The following sections explain the factors affecting the performance of Impala features, and procedures for tuning, monitoring, and benchmarking Impala queries and other SQL operations. This section also describes techniques for maximizing Impala scalability. Scalability is tied to performance: it means that performance remains high as the system workload increases. For example, reducing the disk I/O performed by a query can speed up an individual query, and at the same time improve scalability by making it practical to run more queries simultaneously.
Recommendations for these configurations include:
-use_local_tz_for_unix_timestamp_conversions startup flag
-convert_legacy_hive_parquet_utc_timestamps startup flag
Set the –fe_service_threads startup option for the Impala daemon (impalad) to 256.
Improving the metadata loading performance
Increase the –num_metadata_loading_threads startup option to 64.
Detecting & Avoiding block skews –
The most convenient way to detect a block skew or a “slow-host” issue is to compare the Avg Time to the Max Time in the execution summary section of the query plan.
For each phase of the query, there is an Avg Time and a Max Time value, along with #Hosts, which indicates how many hosts are involved in that phase of the query. For all query phases involving more than one host, look for cases where the maximum time is substantially greater than the average time.
?
Slow hosts can be caused by a variety of reasons. To avoid slow hosts:
Avoiding small files –
To reduce the amount of memory used by the Catalog for metadata, avoid creating many small files in HDFS. Small files in HDFS can be caused by either having partitions that are too granular or by performing data ingestion too frequently. Cloudera recommends that you regularly compact small files. In Hive, you can compact small files with the following SQL commands:
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.size.per.task = 256000000;
SET hive.merge.smallfiles.avgsize = 134217728;
领英推荐
SET hive.exec.compress.output = true;
SET parquet.compression = snappy;
INSERT OVERWRITE TABLE db_name.table_name SELECT * FROM db_name.table_name;
Run <Refresh Table> in impala after the Hive job finishes.
Appropriate file formats –
For BI queries, the Parquet file format performs best because of its combination of columnar storage layout, compression, and encoding. The default setting for COMPRESSION_CODEC is snappy compression, but GZip compression is also supported.
Impala also supports reading ORC file formats from version 2.12 and onwards, however, expect query performance with ORC tables to be slower than it is with Parquet tables.
Text formats can be used when all columns are retrieved from a table. However, because compression on text is lower, HDFS I/O could be longer than when you use the Parquet file format.
Partitioning granularity recommendations –
Join query performance tuning –
Join queries are tuned by using the join order and using the appropriate type of join.
Join order:
Join types:
SET DEFAULT_JOIN_DISTRIBUTION_MODE=shuffle;
SET DEFAULT_JOIN_DISTRIBUTION_MODE=broadcast;
This article describes techniques for maximizing Impala scalability. Scalability is tied to performance: it means that performance remains high as the system workload increases. For example, reducing the disk I/O performed by a query can speed up an individual query, and at the same time improve scalability by making it practical to run more queries simultaneously.
References:
Strategic Architect @ Qlik I Big Data | Data Science | Data Governance | CICD
2 年Very detailed and informative