登录查看更多内容

Impala Performance Tuning – Best Practices

Edgematics Group

发布日期: 2022年6月7日

This article by Abdur Rashid Aleem explains the best practices for Impala Performance Tuning.

The following sections explain the factors affecting the performance of Impala features, and procedures for tuning, monitoring, and benchmarking Impala queries and other SQL operations. This section also describes techniques for maximizing Impala scalability. Scalability is tied to performance: it means that performance remains high as the system workload increases. For example, reducing the disk I/O performed by a query can speed up an individual query, and at the same time improve scalability by making it practical to run more queries simultaneously.

Recommendations for these configurations include:

Ensuring that the timestamps between Hive and Impala match, set the below two startup flags to true

-use_local_tz_for_unix_timestamp_conversions startup flag

-convert_legacy_hive_parquet_utc_timestamps startup flag

Always setting the –idle_session_timeout, –idle_query_timeout timeouts for the Impala daemon (impalad).
Ensuring that the setting for idle_session_timout is less than the setting for the timeout set for your load balancer.
Making the maximum number of concurrent client connections allowed.

Set the –fe_service_threads startup option for the Impala daemon (impalad) to 256.

Improving the metadata loading performance

Increase the –num_metadata_loading_threads startup option to 64.

Detecting & Avoiding block skews –

The most convenient way to detect a block skew or a “slow-host” issue is to compare the Avg Time to the Max Time in the execution summary section of the query plan.

For each phase of the query, there is an Avg Time and a Max Time value, along with #Hosts, which indicates how many hosts are involved in that phase of the query. For all query phases involving more than one host, look for cases where the maximum time is substantially greater than the average time.

Slow hosts can be caused by a variety of reasons. To avoid slow hosts:

Ensure that all Impala nodes have the same configuration.
Ensure that other workloads, which are not run by Impala, are evenly distributed by setting the yarn.scheduler.fair.assignmultiple property to false in the yarn-site.xml.
Ensure that HDFS data is balanced by running the HDFS balancer utility during the off-peak hours and then run the INVALIDATE METADATA on Impala metadata after you perform the rebalancing to refresh the metadata in the Hive metastore.
Ensure that the HDFS block sizes are optimal. For example, use 256MB blocks and use snappy compression in a splittable block, which is preferable to GZip.

Avoiding small files –

To reduce the amount of memory used by the Catalog for metadata, avoid creating many small files in HDFS. Small files in HDFS can be caused by either having partitions that are too granular or by performing data ingestion too frequently. Cloudera recommends that you regularly compact small files. In Hive, you can compact small files with the following SQL commands:

SET hive.merge.mapfiles = true;

SET hive.merge.mapredfiles = true;

SET hive.merge.size.per.task = 256000000;

SET hive.merge.smallfiles.avgsize = 134217728;

领英推荐

How databases are managed in production?

Arpit Bhayani 2 年前

ALTER TABLE taking down GitHub

Arpit Bhayani 2 年前

Database Internals - Write Ahead Log File Structure

Vivek Bansal 1 年前

SET hive.exec.compress.output = true;

SET parquet.compression = snappy;

INSERT OVERWRITE TABLE db_name.table_name SELECT * FROM db_name.table_name;

Run <Refresh Table> in impala after the Hive job finishes.

Appropriate file formats –

For BI queries, the Parquet file format performs best because of its combination of columnar storage layout, compression, and encoding. The default setting for COMPRESSION_CODEC is snappy compression, but GZip compression is also supported.

Impala also supports reading ORC file formats from version 2.12 and onwards, however, expect query performance with ORC tables to be slower than it is with Parquet tables.

Text formats can be used when all columns are retrieved from a table. However, because compression on text is lower, HDFS I/O could be longer than when you use the Parquet file format.

Partitioning granularity recommendations –

Choose a partitioning strategy that ensures there is at least 256 MB of data in each partition.
Over-partitioning causes query planning to take longer than necessary because Impala prunes the unnecessary partitions, which results in small files in each partition.
Cloudera recommends that you keep the number of partitions in tables under 30,000.
Always use integer data types for partition key columns.
Partition key values are turned into HDFS directory names so you can minimize memory usage by using numeric values for common partition key fields such as YEAR, MONTH, and DAY.
Use the smallest integer data type that holds the appropriate range of values. Typically, TINYINT for MONTH and DAY, and SMALLINT for YEAR. Use the EXTRACT() function to pull out individual date and time fields from a TIMESTAMP value, and CAST() the return value to the appropriate integer data type.

Join query performance tuning –

Join queries are tuned by using the join order and using the appropriate type of join.

Join order:

Make sure that table and column statistics are collected on joined columns. This allows Impala to create the most optimal join plan.
If table or column statistics are not available, join the largest table first. If table or column statistics are not available for some tables in a join, Impala reorders the tables. Impala places tables with statistics on the left side of the join order in descending order of cost, based on overall size and cardinality. Tables without statistics are treated as “zero-size,” which means they are always placed on the right side of the join order.

Join types:

Broadcast joins are the default join type. In this join type the right-hand table is considered to be smaller than the left-hand table, and its contents are sent to all the other nodes involved in the query.
Partitioned joins are more suitable for large tables of roughly equal size. With this technique, portions of each table are sent to other appropriate nodes where those subsets of rows can be processed in parallel.
The choice of broadcast or partitioned joins also depends on statistics being available for all tables in the join.
Join types can be set manually using the following query options:

SET DEFAULT_JOIN_DISTRIBUTION_MODE=shuffle;

SET DEFAULT_JOIN_DISTRIBUTION_MODE=broadcast;

This article describes techniques for maximizing Impala scalability. Scalability is tied to performance: it means that performance remains high as the system workload increases. For example, reducing the disk I/O performed by a query can speed up an individual query, and at the same time improve scalability by making it practical to run more queries simultaneously.

References:

https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_performance.html

Charanjit (CJ) G.

Strategic Architect @ Qlik I Big Data | Data Science | Data Governance | CICD

2 年

Very detailed and informative

1 次回应

要查看或添加评论，请登录

Edgematics Group的更多文章

2022年6月6日

Digital Transformation And The Growth Hacking Framework Explained

This research article by Rusheé Ramchuran explains The Lean Startup Methodology, and explores digital transformation as…

2 条评论

Impala Performance Tuning – Best Practices

Edgematics Group

领英推荐

Edgematics Group的更多文章

社区洞察

其他会员也浏览了

SELECT news FROM Yugabyte - May 24

Horizontal Partitioning, Scaling and Sharding - with Real Examples

Turbo-Charging Replication Re-initializations

Understanding Replication: Statement, WAL, Logical Log, and Trigger-Based Approaches

Best practices for running PostgreSQL on Kubernetes

PGAdmin4 and psql

Step-by-Step Guide: Monitoring MariaDB with GoLang

Quick start: YugabyteDB on Docker containers

A referential integrity solution pattern in the world of Microservices, document databases

Database Replication with Change Data Capture over Kafka

领英推荐

Edgematics Group的更多文章

Digital Transformation And The Growth Hacking Framework Explained

社区洞察

其他会员也浏览了

SELECT news FROM Yugabyte - May 24

Horizontal Partitioning, Scaling and Sharding - with Real Examples

Turbo-Charging Replication Re-initializations

Understanding Replication: Statement, WAL, Logical Log, and Trigger-Based Approaches

Best practices for running PostgreSQL on Kubernetes

PGAdmin4 and psql

Step-by-Step Guide: Monitoring MariaDB with GoLang

Quick start: YugabyteDB on Docker containers

A referential integrity solution pattern in the world of Microservices, document databases

Database Replication with Change Data Capture over Kafka