Managing Petabyte scale data in Hive at low cost without using EMR or HDInsight, a technical overview
What is the goal?
1. Separate real-time application data with the least frequently used old data.
2. If an application requires querying real-time data (may range 1-2 years) or older data (3-4 years), the application shouldn’t need to be aware if older data is stored in S3 or Blob storage.
3. Queries should be able to join the data located on EBS/virtual disk or S3/Blob storage.
The answer is simple if you are using Amazon Web Services (AWS), Elastic Map Reduce (EMR) or Azure HDInsight, both the vendors provide the capability to separate storage and compute resources. If you are not using EMR nor HDInsight but open source Apache Hadoop echo system, how can you separate compute from storage?
Brief History: When Hadoop was developed in 2005-06, the focus was to bring the compute to the data. The network infrastructure in the data center was not capable of moving large amounts of data between servers. The data had to be co-located with the compute. But now the network infrastructure is capable of moving a large amount of data with ease, this is no longer a bottleneck for Big Data computing. Public cloud vendors, for example, AWS, Azure, and Google are ahead in this thinking process as AWS, EMR support data access from S3a and similarly HDInsight supports data access from Blob storage. That is the reason you can see the shift in Hadoop version 3.1.2 and onwards where the core design supports decoupling of Hadoop Compute from Storage. Hadoop 3.1.2 has built-in capability to integrate with AWS S3a and Azure Blob Storage.
Coming to the point: How to scale data in Hive at a low cost without using EMR or HDInsight. When we are dealing with large data, as a best practice data is partitioned based on the size of data and how it is queried (on day or event types etc.). Hive manages data in directories. For each table there is one directory, each partition will have subdirectory and each bucket will have another subdirectory as shown in the diagram below.
Now the goal is to keep the newer data in HDFS and older or less used data to AWS S3.
Step 1: Copy the data for the partition being moved to S3. Use the hadoop distcp command as follows:
hadoop distcp /data/stocks/2017/01/19 s3a://bucket-name/2017/01/19
Step 2: Alter the table to point the partition to the S3a location in the first step. Alter command tells hive to look for data in a new location.
USE <hive db name>;
ALTER TABLE stocks PARTITION(year = 2017, month = 01, day = 19)
SET LOCATION 's3a://bucket-name/stocks/2017/01/19';
Step 3: Delete the HDFS copy of the partition using the hadoop fs -rmrcommand:
hadoop fs -rmr /warehouse/stocks/2017/01/19
The above three steps can be automated to copy data from EBS volume to S3 as an overnight job with the help of hadoop bash script on ec2 instances.
Conclusions: Without provisioning of new compute machines, older data can be copied to inexpensive storage, like Amazon’s S3 or Azure Blob storage while keeping newer data in HDFS, which saves costs without an Amazon Elastic MapReduce or HDInsight user.
Enabling organizations with their Data & GenAI transformation journey
6 年Just to clarify the purpose: The first aim, customers who are using Hortonworks?or open source Hadoop can still take advantage of object storage (S3 or Blob) and scale storage without spinning new hardware. The second aim is to show how non EMR/HDInsight customers can also separate compute and storage plus take advantage of cloud storage. The third aim is to show how Hive client accessing the data (which resides on EBS volume and S3) via Hive Query Language (HQL) does not have to worry where the data resides.