Same but Different, Different but Same
Have you ever been confused by some terminologies, like DataProc, EMR, Hadoop, HDInsight? If you are feeling bad, then, don’t. I am getting confused a lot, specially in the beginning when I heard some terminologies. Just recently realized confusion is a necessary step to understand it.
Recently I got the exact question regarding DataProc and EMR. They are in the situation of same but different, different but the same. So here in this article, I am try to give some clarification with example of DataProc, EMR and basic cloud services. I hope it helps, a bit.
A brief history of Hadoop and different distributions
When a new solution comes out in the market, either open sources or commercial. It will be adapted by different players as soon as it become reasonably successful. Let’s take an example of Hadoop.
Hadoop first releases at April 1, 2006. ?Hadoop makes big data possible first time on commodity hardware. And it sooner get popularized in the market. And the big 3 started offering Hadoop service one after another.
AWS then started offering EMR (Elastic Map Reduce) ?in 2009 which is Amazon’s version of Hadoop.
Azure started offering HDInsight in 2013 ?which is again Microsoft’s version of Hadoop.
Google Cloud Platform started offering their version in 2015 , named DataProc.
Hadoop arrived its hype peak position. Quite many companies deploy their Hadoop solutions.
In the recent years, Hadoop has been falling sharply in the market. Mainly it is due to the large adaptence of cloud computing. Leave a comment below if you are interested why it happens.
Different but Same
AWS EMR, Azure HDInsight, Google DataProc are all based on the same underlying technology, Hadoop. They have a lot of same components, a little bit of different add-ons individually.
So they can conceptually the same.
Hadoop is falling sharply. The same apply to AWS EMR, Azure HDInsight, or Google DataProc. They are all falling.
If you expected things working out exactly the same, then the expectation is too high. They are different.
Same but Different
Each company, when they package Hadoop, they add a bit of their flavor in. Just like with the same recipe, different chef will cook a bit differently. For the better or worse.
领英推荐
In the technology space, different means, you can not just click a button and move over from HDInsight to DataProc. There are quite some detail differences need be handled.
Relationship of EMR, HDInsight, DataProc
Here is a simplified illustration of their relationship.
For simplicity, an extra layer of distributions like Hortonwork is abbreviated
Basic cloud service in the big 3 cloud providers
Why only list so few service? Because with the following basic services, powerful systems can be build.
AWSAzureGCPS3: Simple Storage ServiceADL: Azure Data LakeGCS: Google Cloud StorageEC2: Elastic Compute CloudVirtual MachineCompute EngineRDS: Relational Database ServicePostgresCloud SQLRouter 53Azure DNSCloud DNS
Summary
Hope you enjoy reading.
* Same but different
* Different but the same
* For the better or the worse
* For the grow of ecos
Just a recommendation, don’t use DataProc, EMR, HDInsight if you start a brand new project. Leave comment below if you wanna hear why and why Hadoop is fading out.
Thanks to?Ruben Laguna ?for prove reading
The article is originally published at https://knockdata.com/blog/same-but-different-but-same.html
Platform Engineer | NoSQL Databases @ Adyen | ASF contributor
2 年Have you considered adding #databricks to the party? Nice comparison btw! ??