登录查看更多内容

How we got 50X faster Speed for querying Data Lake using Athena Query & saved Thousands of dollars | Case Study

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2022年4月11日

+ 关注

Author:

Soumil Nitin Shah (Data collection and Processing Team lead)

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as data collection and processing Team Lead at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search?

Hari Om Dubey (Consultant Software Engineer, Python developer)

I have completed Master’s in Computer Application and I have 5 years of experience in developing software applications using Python and Django framework. I love to code in Python and creating a solution for a problem by coding excites me. I have been working at Jobtarget for like past 2 months as a Software Engineer in a Data Team.

Himadri Chobisa (Data Engineer)

I recently graduated from UConn with a master’s in Business Analytics and Project Management with a Business Data Science concentration. I joined JobTarget as an intern in April last year and have been with the data team since then. I am a data enthusiast with experience of working in SQL, Python, PowerBi, Tableau, and Machine Learning. In my free time, I enjoy dancing and cooking.?

Birendra Singh (Python Developer | Search Engineer | Data Specialist)

Hi, I am Birendra Singh I have completed Bachelor in Electronic Engineering. I love and enjoy working on an elastic search I have 6 years of professional experience in software development lifecycle spanning across multiple sectors including Telecommunications, Geospatial, Oil, and Gas with a focus on quality and on-time delivery

April Love Ituhat (Software Engineer - Junior, Python)

I have a bachelor’s degree in Computer Engineering and have spent the last three years working on Research and development tasks involving diverse domains such as AWS, Machine Learning, Robot simulations, and IoT. I've been a part of the JobTarget data team since November 2021, and I usually work with python and AWS. It's exciting for me to see the applications come to fruition.

Thanks to Mentor Paul Allan?

Project

We ingest data from various sources such as MongoDB, DynamoDB, and SQL Server we have built an internal framework that can easily handle 1000+ jobs and easily scale up and down compute environment. Read More: https://www.dhirubhai.net/pulse/batch-frameworkan-internal-data-ingestion-framework-process-shah/

Explanation

We Run jobs on our internal framework the jobs pull the data from various sources and dump the data to the SQS queue for lambdas to process. The lambda flattens out the data and cleans the data before dumping it into the data lake. The batch framework takes care of CDC and only processes new data in an incremental fashion. Once the data is on Datalake (AWS S3) we run ?Glue jobs to compress the data and convert it into parquet with snappy compression. Once the Jobs is complete we run glue crawler and populate meta databases and query the data via Athena. Dashboard and Reports are built on top of Athena using AWS Quick Sight?

AWS Lake formation

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake lets you break down data silos and combine different types of analytics to gain insights and guide better business decisions. Setting up and managing data lakes today involves a lot of manuals, and complicated, and time-consuming tasks. This work includes loading data from diverse sources, monitoring those data flows, setting up partitions, turning on encryption and managing keys, defining transformation jobs and monitoring their operation, reorganizing data into a columnar format, deduplicating redundant data, and matching linked records. Once data has been loaded into the data lake, you need to grant fine-grained access to datasets, and audit access over time across a wide range of analytics and machine learning (ML) tools and services.

?Creating a data lake with Lake Formation is as simple as defining data sources and what access and security policies you want to apply. Lake Formation then helps you collect and catalog data from databases and object storage, move the data into your new Amazon Simple Storage Service (S3) data lake, clean and classify your data using ML algorithms, and secure access to your sensitive data using granular controls at the column, row, and cell-levels. Your users can access a centralized data catalog that describes available datasets and their appropriate usage. They then use these datasets with their choices of analytics and ML services, such as Amazon Redshift, Amazon Athena, Amazon EMR for Apache Spark, and Amazon QuickSight. Lake Formation builds on the capabilities available in AWS Glue.

We have read so many articles and watched countless videos to get the best of best practices.

Suggestion 1:

According to Author Hocains says one should always partition their data to reduce the amount of data scanned he mentioned that one of the best choices for partition would be date-based partition. The author gave a guide on the factor that will help decide the columns on which to partition. Columns that are used as filters are good candidates for partitioning. Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Partitioning too finely can wipe out the initial benefit. If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit.

We did lot of reading o which file format to use as they are various formats such as parquet, Avro. Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by employing column-wise compression, different encoding, compression based on data type, and predicate pushdown. They are also splittable. Generally, better compression ratios or skipping blocks of data means reading fewer bytes from Amazon S3, leading to better query performance.

This is one of 40+ tables we have on athena after compressing the files with parquet and using snappy compression we were able to get faster results

Here is Graph that shows we were able to reduce the data size from 13GB to 3GB which is big reduction in size

领英推荐

How to Use Python for Data Engineering [Use Cases with…

AnalytixLabs 8 个月前

Technologies in Data Science

Akash Jha 1 年前

Mastering the PySpark Developer Interview: Key…

KRISHNAKANT K. 1 个月前

Author further says parameter that could be tuned is the?block size?or?stripe size. The block size in Parquet or stripe size in ORC represents the maximum number of rows that can fit into one block in terms of size in bytes. The larger the block/stripe size, the more rows can be stored in each block. By default, the Parquet block size is 128 MB and the ORC stripe size is 64 MB. We recommend a larger block size if you have tables with many columns, to ensure that each column block remains at a size that allows for efficient sequential I/O.

Another parameter that could be tuned is the compression algorithm on data blocks. The Parquet default is Snappy, but it also supports no compression, GZIP, and LZO-based compression. ORC defaults to ZLIB, but it also supports no compression and Snappy. We recommend starting with the default compression algorithm and testing with other compression algorithms if you have more than 10 GB of data.

Logvinskiy, Valentin. “Which Columnar File Format to Select for Athena / BigQuery / Synapse Analytics.” LinkedIn. Accessed April 11, 2022. Mentions that parquet is the best file format to use for athena. Further mentions that avoid bigint (int64) Parquet data type for timestamps if you require to work with milliseconds (IoT use-cases) and replace them with date-time-milliseconds in string data format for the best combination of performance and price and says String is the most cost-efficient data type for AWS Athena.

Rade?i?,?Dario . “Towardsdatascience.” CSV Files for Storage? No Thanks. There’s a Better Option. Accessed April 11, 2022. https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d. Author shows the savings that can be achieved by just switching file formats?

Thanks to author for providing amazing insights about file format in their article.

As you can see at Jobtarget we work with massive Big data and small optimization like this can help us to save thousands of dollars annually

As we have learned Parquet is an open-source file format built to handle flat columnar storage data formats. Parquet operates well with complex data in large volumes. It is known for its both performant data compression and its ability to handle a wide variety of encoding types.?Parquet deploys Google's record-shredding and assembly algorithm that can address complex data structures within data storage. Some Parquet benefits include Fast queries that can fetch specific column values without reading full row data, Highly efficient column-wise compression, High compatibility with OLAP.

While CSV is simple and the most widely used data format (Excel, Google Sheets), there are several distinct advantages for Parquet, including:

Parquet is column-oriented, and CSV is row-oriented. Row-oriented formats are optimized for OLTP workloads while column-oriented formats are better suited for analytical workloads.

Column-oriented databases such as AWS Redshift Spectrum bill by the amount data scanned per query [4]

Therefore, converting CSV to Parquet with partitioning and compression lowers overall costs and improves performance [4]

Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improves scan and deserialization time, hence the overall costs[4]

I have included all articles and links in the references section; if I missed any, please let me know and I will gladly add it.

References

[1] Amazon. “Amazon Athena.” Accessed April 11, 2022. https://aws.amazon.com/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc.

[2] Hocanin, Mert . “Top 10 Performance Tuning Tips for Amazon Athena.” Amazon. Accessed April 11, 2022. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/.

[3] Logvinskiy,?Valentin . “Which Columnar File Format to Select for Athena / BigQuery / Synapse Analytics.” linkedin. Accessed April 11, 2022. https://www.dhirubhai.net/pulse/which-columnar-file-format-select-athena-bigquery-valentin-logvinskiy/.

[4] “WHAT IS PARQUET?” Accessed April 11, 2022. https://www.snowflake.com/guides/what-parquet#:~:text=Parquet%20is%20an%20open%20source,wide%20variety%20of%20encoding%20types.

要查看或添加评论，请登录

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论

See all articles

How we got 50X faster Speed for querying Data Lake using Athena Query & saved Thousands of dollars | Case Study

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

领英推荐

Soumil S.的更多文章

社区洞察

其他会员也浏览了

?? DATA Pill #146 - SQL is all you need, 30 Must-Know Tools for Python Development

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

Elastic Search Performance Tuning and Optimization How We Got 80X Faster Searches a Case Study

Going Multi Region with 0 Downtime and High Availability with Event Bridge Global Endpoints

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

JobTarget Internal Batch Framework that runs 5400 Jobs/Month 60,000 Jobs/Year on AWS Batch

How to Become A Senior Big Data Engineer

领英推荐

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

社区洞察

其他会员也浏览了

?? DATA Pill #146 - SQL is all you need, 30 Must-Know Tools for Python Development

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

Elastic Search Performance Tuning and Optimization How We Got 80X Faster Searches a Case Study

Going Multi Region with 0 Downtime and High Availability with Event Bridge Global Endpoints

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

JobTarget Internal Batch Framework that runs 5400 Jobs/Month 60,000 Jobs/Year on AWS Batch

How to Become A Senior Big Data Engineer