登录查看更多内容

Managing Petabyte scale data in Hive at low cost without using EMR or HDInsight, a technical overview

Yogesh K.

Enabling organizations with their Data & GenAI transformation journey

发布日期: 2019年2月17日

+ 关注

What is the goal?

1. Separate real-time application data with the least frequently used old data.

2. If an application requires querying real-time data (may range 1-2 years) or older data (3-4 years), the application shouldn’t need to be aware if older data is stored in S3 or Blob storage.

3. Queries should be able to join the data located on EBS/virtual disk or S3/Blob storage.

The answer is simple if you are using Amazon Web Services (AWS), Elastic Map Reduce (EMR) or Azure HDInsight, both the vendors provide the capability to separate storage and compute resources. If you are not using EMR nor HDInsight but open source Apache Hadoop echo system, how can you separate compute from storage?

Brief History: When Hadoop was developed in 2005-06, the focus was to bring the compute to the data. The network infrastructure in the data center was not capable of moving large amounts of data between servers. The data had to be co-located with the compute. But now the network infrastructure is capable of moving a large amount of data with ease, this is no longer a bottleneck for Big Data computing. Public cloud vendors, for example, AWS, Azure, and Google are ahead in this thinking process as AWS, EMR support data access from S3a and similarly HDInsight supports data access from Blob storage. That is the reason you can see the shift in Hadoop version 3.1.2 and onwards where the core design supports decoupling of Hadoop Compute from Storage. Hadoop 3.1.2 has built-in capability to integrate with AWS S3a and Azure Blob Storage.

Coming to the point: How to scale data in Hive at a low cost without using EMR or HDInsight. When we are dealing with large data, as a best practice data is partitioned based on the size of data and how it is queried (on day or event types etc.). Hive manages data in directories. For each table there is one directory, each partition will have subdirectory and each bucket will have another subdirectory as shown in the diagram below.

Now the goal is to keep the newer data in HDFS and older or less used data to AWS S3.

Step 1: Copy the data for the partition being moved to S3. Use the hadoop distcp command as follows:

hadoop distcp /data/stocks/2017/01/19 s3a://bucket-name/2017/01/19

Step 2: Alter the table to point the partition to the S3a location in the first step. Alter command tells hive to look for data in a new location.

USE <hive db name>;

ALTER TABLE stocks PARTITION(year = 2017, month = 01, day = 19)

SET LOCATION 's3a://bucket-name/stocks/2017/01/19';

Step 3: Delete the HDFS copy of the partition using the hadoop fs -rmrcommand:

hadoop fs -rmr /warehouse/stocks/2017/01/19

The above three steps can be automated to copy data from EBS volume to S3 as an overnight job with the help of hadoop bash script on ec2 instances.

Conclusions: Without provisioning of new compute machines, older data can be copied to inexpensive storage, like Amazon’s S3 or Azure Blob storage while keeping newer data in HDFS, which saves costs without an Amazon Elastic MapReduce or HDInsight user.

Yogesh K.

Enabling organizations with their Data & GenAI transformation journey

6 年

Just to clarify the purpose: The first aim, customers who are using Hortonworks?or open source Hadoop can still take advantage of object storage (S3 or Blob) and scale storage without spinning new hardware. The second aim is to show how non EMR/HDInsight customers can also separate compute and storage plus take advantage of cloud storage. The third aim is to show how Hive client accessing the data (which resides on EBS volume and S3) via Hive Query Language (HQL) does not have to worry where the data resides.

1 次回应

要查看或添加评论，请登录

Yogesh K.的更多文章

Embracing Humanity: A Call for Kindness, Learning, and Understanding in a Complex World

2023年12月31日

Embracing Humanity: A Call for Kindness, Learning, and Understanding in a Complex World

Why I Felt the Need to Write This Article: As we approach the end of the year, let's take a moment to ponder the state…

1 条评论
Nurturing Growth: The Role of AI Centers of Excellence in Technology Consulting

2023年12月18日

Nurturing Growth: The Role of AI Centers of Excellence in Technology Consulting

In the dynamic realm of technology consulting, the emergence of AI Centers of Excellence (CoE) stands out as a crucial…

6 条评论
ChatGPT - : A Look Back at the Power of Generative AI

2023年12月1日

ChatGPT - : A Look Back at the Power of Generative AI

The last day of November 2023 will mark the first anniversary of ChatGPT, a tool that has helped understand the power…

1 条评论
Tech Giants Reshape Healthcare: A Data-Driven Era

2023年11月11日

Tech Giants Reshape Healthcare: A Data-Driven Era

I recently authored an article on Medium, shedding light on my transition to the healthcare domain and emphasizing the…
Syncing delta tables in two different AZURE subscriptions in a controlled manner (Using Databricks)

2022年5月15日

Syncing delta tables in two different AZURE subscriptions in a controlled manner (Using Databricks)

Challenges: Delta tables reside in two different Azure subscriptions Columns need to be added and dropped for delta…
Reduce Emergency Room waiting times with Azure Logic App and Databricks

2022年5月2日

Reduce Emergency Room waiting times with Azure Logic App and Databricks

Author : Yogesh Kumar Co-Author: Yogendra Bhardwaj Background: The average time people wait for an emergency room visit…

1 条评论
Azure Synapse Analytics - First Impression - Part 2 - Spark Notebooks

2020年5月25日

Azure Synapse Analytics - First Impression - Part 2 - Spark Notebooks

In the first part, we discussed the general point of views about why Synapse Analytics and what is still confusing in…
Azure Synapse Analytics - First Impression - Part 1

2020年5月24日

Azure Synapse Analytics - First Impression - Part 1

As everyone knows Azure Synapse Analytics is in public preview and has made everyone excited about it since Microsoft…
Apache Spark - Tune cluster to take advantage of parallelism

2020年1月29日

Apache Spark - Tune cluster to take advantage of parallelism

It is important to understand how Apache Spark takes advantage of parallelism while running tasks on a local machine or…

3 条评论
Apache Hive Performance Tuning Best Practices

2020年1月29日

Apache Hive Performance Tuning Best Practices

While working on Apache Hive recently, learned that performance can be a real nightmare, sharing few tips: Appropriate…

1 条评论

See all articles

Managing Petabyte scale data in Hive at low cost without using EMR or HDInsight, a technical overview

Yogesh K.

Enabling organizations with their Data & GenAI transformation journey

Yogesh K.的更多文章

社区洞察

其他会员也浏览了

"Distinguishing HDFS from Cloud Data Lakes: ADLS Gen2 and Amazon S3"

The Data Value Chain: Redefined

BIG DATA IN LITTLE SPACES: HADOOP AND SPARK AT THE EDGE

Top 5 Big Data Databases

Real-Time detection and alerting of unwanted credit card charges (Part 3 of 3)

The Evolution of Data Storage: From Data Lakes to Data Lakehouse

Join and analyze data across Azure to your Data Center

Microsoft Azure Data Lake

Expanding Data Lakes > >>

Azure Data Lake Storage

Yogesh K.的更多文章

Embracing Humanity: A Call for Kindness, Learning, and Understanding in a Complex World

Nurturing Growth: The Role of AI Centers of Excellence in Technology Consulting

ChatGPT - : A Look Back at the Power of Generative AI

Tech Giants Reshape Healthcare: A Data-Driven Era

Syncing delta tables in two different AZURE subscriptions in a controlled manner (Using Databricks)

Reduce Emergency Room waiting times with Azure Logic App and Databricks

Azure Synapse Analytics - First Impression - Part 2 - Spark Notebooks

Azure Synapse Analytics - First Impression - Part 1

Apache Spark - Tune cluster to take advantage of parallelism

Apache Hive Performance Tuning Best Practices

社区洞察

其他会员也浏览了

"Distinguishing HDFS from Cloud Data Lakes: ADLS Gen2 and Amazon S3"

The Data Value Chain: Redefined

BIG DATA IN LITTLE SPACES: HADOOP AND SPARK AT THE EDGE

Top 5 Big Data Databases

Real-Time detection and alerting of unwanted credit card charges (Part 3 of 3)

The Evolution of Data Storage: From Data Lakes to Data Lakehouse

Join and analyze data across Azure to your Data Center

Microsoft Azure Data Lake

Expanding Data Lakes > >>

Azure Data Lake Storage