Is the end of Hadoop near?

Geert Meulenbelt

International Sales and Partners for applied AI solutions

发布日期: 2018年8月28日

+ 关注

Sharing the article I wrote early summer on the Saagie blog.

A significant stock price correction took place in April for two of the three major Hadoop distributions (Cloudera and Hortonworks – the third one, MapR, not being a publicly traded company). Since both companies took an almost simultaneous hit my take is that the underlying Hadoop technology explains the market correction to a large extent (see here and here other articles pointing this out).

Getting Hadoop to work for business is incredibly difficult and time-consuming (as a reference this excellent article explaining the efforts it took Uber to master big data). Corporates that invested massively in Hadoop-powered Datalakes are facing serious challenges to derive business value (to put it simply) or to transform themselves into data-centric organisations (as many consultants would phrase it).

The main reason for the lack of success is that the Hadoop distributed file system is only a building block of a fast-moving and complex eco-system of literally tens of different technologies, comprising storage, scheduling, real-time processing, query, deep-learning compute and analytics. These are cutting edge technologies, tend to be open source driven and change dramatically fast.

Datalakes versus Hadoop?

But does that mean that the era of datalakes is over? The answer is definitely no. Few would deny that bringing together heterogeneous data sources in a single location and perform powerful distributed analytics on top are key requirements for the modern enterprise.

However, the role of the underlying Hadoop (HDFS) technology is likely to decrease: on the one hand we’ll see commoditisation of Hadoop and on the other hand replacement by technologies that are more simple to use such as Amazon‘s S3.

And that is the real reason why the cathedrals (or silos if you wish) build around a Hadoop distribution have shaky foundations that do not justify their high stock market valuations.

So what’s needs to be done?

The objective is to move up the value chain, getting away from the storage, compute and infrastructure and concentrate on creating and industrializing intelligent, AI-driven applications that really deliver business value.

A new generation of companies is taking up this challenge: master the complex and open data eco-system and bridge the gap between IT (operationalization of analytics while taking into account compliance and security constraints) and business (self-service and collaboration).

Three key concepts will drive this revolution:

Embrace the eco-system – Innovation takes place at breakneck speed. Closed systems won’t be able to follow the pace of innovation. The market needs an open, agnostic and up-to-date technology toolbox with automated data pipelines to bring data experiments to production. At the same time such a platform should be extensive (by using Docker for instance) while being fully integrated within existing IT tools for scheduling, monitoring and DevOps. Reducing the complexity of making different data and analytics technologies work together is of vital importance. Resources can then be optimized for developing business use cases and the corresponding skill-sets that will drive short and long-term ROI.
Build a data community – Technology is an enabler. It’s the people that ultimately drive organisational change. Different types of users (analysts, data-engineers, data owners, DevOps, data-scientists, IT-ops…) need access to a shared and documented data portal that provides self-service tools and the capability to work collaboratively on data projects. These projects lead to smart applications with the community ensuring a virtuous feedback loop.
Deploy flexibly, govern globally – We are living in a complex world with lots of legacy systems, and a combination of cloud-based and on-premise requirements. The glue that links the various infrastructure deployment options together is based on containers, and in particular the emerging Kubernetes market standard. Solutions will be increasingly hybrid with different parts of an organisation using different ways to store and process data. A global governance on data projects is not only mandatory for compliance reasons but also ensures cross-fertilization between different business units and functional teams.

要查看或添加评论，请登录

Geert Meulenbelt的更多文章

Data and AI-driven Optimization

2020年1月15日

Data and AI-driven Optimization

Co-author: Léopold Shabaani Ardali The word Optimization is used everywhere: from search engines to manufacturing…
Disrupting Direct Enterprise Software Sales

2019年4月1日

Disrupting Direct Enterprise Software Sales

Nobody denies that direct sales is an easy job; it’s a constant battle that requires perseverance, communication…
Kubernetes - unlocking of the Public Cloud potential

2019年1月16日

Kubernetes - unlocking of the Public Cloud potential

Towards a Data Ops paradigm January tends to be a good time for European senior executives to get inspiration from…
Saagie @salonbigdata le 6 et 7 mars

2017年3月5日

Saagie @salonbigdata le 6 et 7 mars

Faites le plein des nouveautés #bigdata et #intelligenceartificielle chez Saagie, stand 422; bot conversationnel, data…
Getting big data projects into production - Technology, ROI for business and Methodology

2017年1月9日

Getting big data projects into production - Technology, ROI for business and Methodology

Only 15% of big data projects get deployed. How to get out of the crafting stage? This astonishing figure was released…
Is Spark 2.0 already available on your data platform?

2016年10月4日

Is Spark 2.0 already available on your data platform?

Here is a message for all those who purchased or build themselves a Big Data platform. Today you are facing the…

2 条评论
Three visions on Artificial Intelligence

2016年9月22日

Three visions on Artificial Intelligence

Last Friday I was most fortunate to participate at a round table organized by BPI France on the challenging topic of…
The five components of a full-stack data platform

2016年9月15日

The five components of a full-stack data platform

Data is a vast subject and making data easily accessible is a big challenge. In order to get there you ideally need an…
The origins of datalake provider Saagie

2016年8月23日

The origins of datalake provider Saagie

Saagie comes from the Japanese word ?サギ? [sagi] which means heron. During my recent holidays in Japan, I had the chance…
IOT Platforms - The case for not using IOT data (yet)

2016年2月1日

IOT Platforms - The case for not using IOT data (yet)

Machines, devices and also humans increasingly produce data. Few people question the importance of the Internet Of…

1 条评论

See all articles

Datalakes versus Hadoop?

So what’s needs to be done?

Geert Meulenbelt的更多文章

Data and AI-driven Optimization

Disrupting Direct Enterprise Software Sales

Kubernetes - unlocking of the Public Cloud potential

Saagie @salonbigdata le 6 et 7 mars

Getting big data projects into production - Technology, ROI for business and Methodology

Is Spark 2.0 already available on your data platform?

Three visions on Artificial Intelligence

The five components of a full-stack data platform

The origins of datalake provider Saagie

IOT Platforms - The case for not using IOT data (yet)

社区洞察