How we got 50X faster Speed for querying Data Lake using Athena Query & saved Thousands of dollars | Case Study

How we got 50X faster Speed for querying Data Lake using Athena Query & saved Thousands of dollars | Case Study

Author:

Soumil Nitin Shah (Data collection and Processing Team lead)

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as data collection and processing Team Lead at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search?

Hari Om Dubey (Consultant Software Engineer, Python developer)

I have completed Master’s in Computer Application and I have 5 years of experience in developing software applications using Python and Django framework. I love to code in Python and creating a solution for a problem by coding excites me. I have been working at Jobtarget for like past 2 months as a Software Engineer in a Data Team.

Himadri Chobisa (Data Engineer)

I recently graduated from UConn with a master’s in Business Analytics and Project Management with a Business Data Science concentration. I joined JobTarget as an intern in April last year and have been with the data team since then. I am a data enthusiast with experience of working in SQL, Python, PowerBi, Tableau, and Machine Learning. In my free time, I enjoy dancing and cooking.?

Birendra Singh (Python Developer | Search Engineer | Data Specialist)

Hi, I am Birendra Singh I have completed Bachelor in Electronic Engineering. I love and enjoy working on an elastic search I have 6 years of professional experience in software development lifecycle spanning across multiple sectors including Telecommunications, Geospatial, Oil, and Gas with a focus on quality and on-time delivery

April Love Ituhat (Software Engineer - Junior, Python)

I have a bachelor’s degree in Computer Engineering and have spent the last three years working on Research and development tasks involving diverse domains such as AWS, Machine Learning, Robot simulations, and IoT. I've been a part of the JobTarget data team since November 2021, and I usually work with python and AWS. It's exciting for me to see the applications come to fruition.

Thanks to Mentor Paul Allan?

Project

We ingest data from various sources such as MongoDB, DynamoDB, and SQL Server we have built an internal framework that can easily handle 1000+ jobs and easily scale up and down compute environment. Read More: https://www.dhirubhai.net/pulse/batch-frameworkan-internal-data-ingestion-framework-process-shah/

No alt text provided for this image

Explanation

Explanation

We Run jobs on our internal framework the jobs pull the data from various sources and dump the data to the SQS queue for lambdas to process. The lambda flattens out the data and cleans the data before dumping it into the data lake. The batch framework takes care of CDC and only processes new data in an incremental fashion. Once the data is on Datalake (AWS S3) we run ?Glue jobs to compress the data and convert it into parquet with snappy compression. Once the Jobs is complete we run glue crawler and populate meta databases and query the data via Athena. Dashboard and Reports are built on top of Athena using AWS Quick Sight?

AWS Lake formation

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake lets you break down data silos and combine different types of analytics to gain insights and guide better business decisions. Setting up and managing data lakes today involves a lot of manuals, and complicated, and time-consuming tasks. This work includes loading data from diverse sources, monitoring those data flows, setting up partitions, turning on encryption and managing keys, defining transformation jobs and monitoring their operation, reorganizing data into a columnar format, deduplicating redundant data, and matching linked records. Once data has been loaded into the data lake, you need to grant fine-grained access to datasets, and audit access over time across a wide range of analytics and machine learning (ML) tools and services.

?Creating a data lake with Lake Formation is as simple as defining data sources and what access and security policies you want to apply. Lake Formation then helps you collect and catalog data from databases and object storage, move the data into your new Amazon Simple Storage Service (S3) data lake, clean and classify your data using ML algorithms, and secure access to your sensitive data using granular controls at the column, row, and cell-levels. Your users can access a centralized data catalog that describes available datasets and their appropriate usage. They then use these datasets with their choices of analytics and ML services, such as Amazon Redshift, Amazon Athena, Amazon EMR for Apache Spark, and Amazon QuickSight. Lake Formation builds on the capabilities available in AWS Glue.

We have read so many articles and watched countless videos to get the best of best practices.

Suggestion 1:

According to Author Hocains says one should always partition their data to reduce the amount of data scanned he mentioned that one of the best choices for partition would be date-based partition. The author gave a guide on the factor that will help decide the columns on which to partition. Columns that are used as filters are good candidates for partitioning. Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Partitioning too finely can wipe out the initial benefit. If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit.

No alt text provided for this image

We did lot of reading o which file format to use as they are various formats such as parquet, Avro. Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by employing column-wise compression, different encoding, compression based on data type, and predicate pushdown. They are also splittable. Generally, better compression ratios or skipping blocks of data means reading fewer bytes from Amazon S3, leading to better query performance.

No alt text provided for this image

This is one of 40+ tables we have on athena after compressing the files with parquet and using snappy compression we were able to get faster results

No alt text provided for this image

Here is Graph that shows we were able to reduce the data size from 13GB to 3GB which is big reduction in size

Author further says parameter that could be tuned is the?block size?or?stripe size. The block size in Parquet or stripe size in ORC represents the maximum number of rows that can fit into one block in terms of size in bytes. The larger the block/stripe size, the more rows can be stored in each block. By default, the Parquet block size is 128 MB and the ORC stripe size is 64 MB. We recommend a larger block size if you have tables with many columns, to ensure that each column block remains at a size that allows for efficient sequential I/O.

Another parameter that could be tuned is the compression algorithm on data blocks. The Parquet default is Snappy, but it also supports no compression, GZIP, and LZO-based compression. ORC defaults to ZLIB, but it also supports no compression and Snappy. We recommend starting with the default compression algorithm and testing with other compression algorithms if you have more than 10 GB of data.

No alt text provided for this image

Logvinskiy, Valentin. “Which Columnar File Format to Select for Athena / BigQuery / Synapse Analytics.” LinkedIn. Accessed April 11, 2022. Mentions that parquet is the best file format to use for athena. Further mentions that avoid bigint (int64) Parquet data type for timestamps if you require to work with milliseconds (IoT use-cases) and replace them with date-time-milliseconds in string data format for the best combination of performance and price and says String is the most cost-efficient data type for AWS Athena.

No alt text provided for this image

Rade?i?,?Dario . “Towardsdatascience.” CSV Files for Storage? No Thanks. There’s a Better Option. Accessed April 11, 2022. https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d. Author shows the savings that can be achieved by just switching file formats?

No alt text provided for this image
No alt text provided for this image

Thanks to author for providing amazing insights about file format in their article.

No alt text provided for this image

As you can see at Jobtarget we work with massive Big data and small optimization like this can help us to save thousands of dollars annually

As we have learned Parquet is an open-source file format built to handle flat columnar storage data formats. Parquet operates well with complex data in large volumes. It is known for its both performant data compression and its ability to handle a wide variety of encoding types.?Parquet deploys Google's record-shredding and assembly algorithm that can address complex data structures within data storage. Some Parquet benefits include Fast queries that can fetch specific column values without reading full row data, Highly efficient column-wise compression, High compatibility with OLAP.

While CSV is simple and the most widely used data format (Excel, Google Sheets), there are several distinct advantages for Parquet, including:

Parquet is column-oriented, and CSV is row-oriented. Row-oriented formats are optimized for OLTP workloads while column-oriented formats are better suited for analytical workloads.

Column-oriented databases such as AWS Redshift Spectrum bill by the amount data scanned per query [4]

Therefore, converting CSV to Parquet with partitioning and compression lowers overall costs and improves performance [4]

Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improves scan and deserialization time, hence the overall costs[4]

I have included all articles and links in the references section; if I missed any, please let me know and I will gladly add it.




References

[1] Amazon. “Amazon Athena.” Accessed April 11, 2022. https://aws.amazon.com/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc.

[2] Hocanin, Mert . “Top 10 Performance Tuning Tips for Amazon Athena.” Amazon. Accessed April 11, 2022. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/.

[3] Logvinskiy,?Valentin . “Which Columnar File Format to Select for Athena / BigQuery / Synapse Analytics.” linkedin. Accessed April 11, 2022. https://www.dhirubhai.net/pulse/which-columnar-file-format-select-athena-bigquery-valentin-logvinskiy/.

[4] “WHAT IS PARQUET?” Accessed April 11, 2022. https://www.snowflake.com/guides/what-parquet#:~:text=Parquet%20is%20an%20open%20source,wide%20variety%20of%20encoding%20types.


要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了