登录查看更多内容

Is Hadoop dying or re-inventing…

Gaurav Shah

Executive Director @ Morgan Stanley | Big Data, Electronic Trading

发布日期: 2019年12月20日

The terms Hadoop and BigData have often been used interchangeably with bigdata often meaning using Hadoop to process large volumes of data efficiently. With recent turn of events and advent of newer technologies particularly Kubernetes, there have been innumerable posts pointing on demise of Hadoop and how a technology went away as quickly as it rose to fame. I cannot stop wondering why Hadoop should get so much negative publicity if some business models running on top of it couldn’t make the cut. Hadoop as a technology is far more than what meets the eye! If you think of Hadoop as only MapReduce and YARN (Yet Another Resource Negotiator) on HDFS(Hadoop Distributed File System), then sure it might have limited shelf life. But if you think of Hadoop as an ecosystem allowing multiple technologies to talk to one another in distributed architecture than the viewpoint changes to a design philosophy and not just one implementation of it. It was first ever way where we could leverage multiple commodity hardware to process TBs/PBs of data with multiple open source solutions like Hive/Oozie/Spark etc working on top of it in perfect co-ordination catering to every need. Each of these projects did something different; Hive gave a sql like query interface, Oziee simplified your ETL workloads, Spark boosted your map-reduce workloads. All these services neatly integrated with HDFS and YARN. You can loosely think of them as individual containers from Kubernetes world; these were the building blocks of an entire ecosystem. A whole host of projects got launched on its back like Storm, Zeppelin, Flume, Drill etc all participating in big Hadoop ecosystem. Companies like Hortonworks/Cloudera did some amazing work in making variety of different technologies run together in a common ecosystem to take BigData journey to an all new level.

When we started our BigData journey, I found MapReduce extremely cumbersome and difficult to write. A simple word count problem (Hello World of BigData) took decent amount of code to get it working. We never really bothered to learn MapReduce programming, but we surely understood the design thinking behind it. Spark, on other hand was gaining ground with simplistic api’s. I could now write word count in couple of lines of code. We adopted Spark straight away as primary way to write MapReduce programs. For me Spark did not mark the beginning of end of Hadoop, it marked an evolution from file-based map reduce to in memory spark processing. It was order of magnitude faster. It worked seamlessly with Hive to fetch my data from HDFS and could run my containers on machines where my data was stored (locality) using YARN significantly reducing data transfer time.

The ability to store TB’s worth of data structured or unstructured using HDFS gave tremendous boost to our data efforts on premise. Biggest competition here was availability of cloud storage at very low cost. However, the ability to compute where data is stored (data locality) is far superior advantage that one must let go with cloud model; a concept underrated by cloud providers by differentiating compute from store. In addition, coming from a banking world, you always want some data which you inherently prefer to sit on your on-premise machine. There is no better on-premise way to store such volumes with redundancy and recovery working natively with variety of open source technologies as HDFS and its ability to interact with it like a regular Linux filesystem. Sure, HDFS had its share of cons with redundancy increasing the amount of space required. For HDFS we need to store three times (default) the data size to achieve redundancy and tolerate two machine failures. A cloud will give you ever increasing 9’s reliability without paying for extra redundancy. However, with Hadoop 3 we now have erasure codes which significantly increases your storage capacity by 150-200%. Even after these improvements, it might seem like HDFS might lose out to cheap cloud storage, however from Hadoop ecosystem perspective its plugging out HDFS and plugging in S3/Azure with everything else remaining intact – the beauty of its design.

People often complain setting up Hadoop is difficult, and that was one reason to lose out to cloud providers who offer these as managed services. As an engineer, I felt it was one-time effort to understand how Hadoop architecture worked. If you used Cloudera or Hortonworks distribution it made it all the easier. It sure was challenging to begin, but so was learning C++/Java and writing my first ever trading system. It gave me good idea of how distributed system with variety of people contributing can work as a single unit. With managed services there is always a chance to lose out on some finer details of tuning due to abstractions and they will always take some time to catch up to latest versions, we moved to Hadoop 3 a year back!

With Kubernetes giving ability to orchestrate containers across machines it has addressed biggest drawback of Hadoop, the management and scheduling part! If namenode on Hadoop dies, you lose your whole cluster so Hadoop came up with HA mode with two namenodes. As soon as first namenode dies, second one takes over. But if second one goes down, you are again in a soup. You don’t face such issues with Kubernetes, because it will auto launch service on another machine as soon as it detects failure. So, Kubernetes can become a natural choice to launch namenodes where it acts as an enhancer to Hadoop namenode availability. Kubernetes scheduling can potentially take over YARN scheduling too but it still must bring an improvement in its scheduling mechanisms and achieve some intelligence of data locality to outperform YARN. YARN has been highly optimized for hadoop map reduce workloads but there is nothing stopping us from using YARN on kubernetes or different scheduler altogether.

I don’t see newer technologies as death nails for Hadoop, but as a service that address drawbacks of Hadoop and evolves it to more reliable and sustainable service. Kubernetes cannot do what Hive/Spark/HDFS or entire Hadoop ecosystem does. But HDFS/Hive/Spark sure can run as containers on Kubernetes to re-invent Hadoop ecosystem to a new age. AWS EMR and Azure HDInsight are perfect examples of how Hadoop can work in cloud for those who don’t want to deal with on premise cost. Cloudera has already launched bunch of products that takes Hadoop to the cloud from on premise. The Hadoop design already had them modular independent components talking to each other with well-defined apis, allowing you to plug and play be it in cloud or as containers. The Big Data eco system will continue to evolve and Hadoop core components will be present in one form or the other changing its avatar with evolving environment.

Amar Mehta

Associate Director | Project Management | Financial Market | Product Management

5 年

Very interesting and informative ????

1 次回应

Alok Gupta

Data Scientist at Black & Veatch

5 年

Quite true that with the passage of time, Hadoop especially on prem solutions of Cloudera and Hortonworks, are loosing it's charm. With the merger of Cloudera and Hortonworks, Cloudera market share went up, but due to cloud providers like aws and azure, on prem big data solutions needs to gain momentum. The evolution of big data with research paper of Sanjay Ghemawat from Google has led Doug cutting to invent HDFS.? ? ? ?

2 次回应

Srikanth SESHADRI

SmartConnect | UNFYD.COMPASS | CCaaS, CRM, Digital Transformation, GenAI Technologies

5 年

Interesting analysis, educative ....

1 次回应

Anshuman Das

Technology Leader

5 年

Very well written Gaurav

1 次回应

查看更多评论

要查看或添加评论，请登录

Gaurav Shah的更多文章

Scale with Simplicity

2020年6月19日

Scale with Simplicity

In today's complex world with increasing and dynamic load patterns having a system uptime without any scaling issues is…

Is Hadoop dying or re-inventing…

Gaurav Shah

Executive Director @ Morgan Stanley | Big Data, Electronic Trading

Gaurav Shah的更多文章

社区洞察

其他会员也浏览了

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

The 9 main applications of the Hadoop Ecosystem

Big Data Hadoop Alternatives: What They Offer and Who Uses Them

Hadoop: A Powerful Tool for Big Data Management

How "HADOOP" revolutionised Data Processing

Hadoop HDFS NameNode High Availability

Spark Or Hadoop : Which Is The Best Big Data Framework?

The Big ‘Big Data’ Question: Hadoop or Spark?

Let’s research and the world the know about the Myths of Hadoop

Play by Play: Hadoop.AI.ML.

Gaurav Shah的更多文章

Scale with Simplicity

社区洞察

其他会员也浏览了

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

The 9 main applications of the Hadoop Ecosystem

Big Data Hadoop Alternatives: What They Offer and Who Uses Them

Hadoop: A Powerful Tool for Big Data Management

How "HADOOP" revolutionised Data Processing

Hadoop HDFS NameNode High Availability

Spark Or Hadoop : Which Is The Best Big Data Framework?

The Big ‘Big Data’ Question: Hadoop or Spark?

Let’s research and the world the know about the Myths of Hadoop

Play by Play: Hadoop.AI.ML.