Is the end of Hadoop near?
Sharing the article I wrote early summer on the Saagie blog.
A significant stock price correction took place in April for two of the three major Hadoop distributions (Cloudera and Hortonworks – the third one, MapR, not being a publicly traded company). Since both companies took an almost simultaneous hit my take is that the underlying Hadoop technology explains the market correction to a large extent (see here and here other articles pointing this out).
Getting Hadoop to work for business is incredibly difficult and time-consuming (as a reference this excellent article explaining the efforts it took Uber to master big data). Corporates that invested massively in Hadoop-powered Datalakes are facing serious challenges to derive business value (to put it simply) or to transform themselves into data-centric organisations (as many consultants would phrase it).
The main reason for the lack of success is that the Hadoop distributed file system is only a building block of a fast-moving and complex eco-system of literally tens of different technologies, comprising storage, scheduling, real-time processing, query, deep-learning compute and analytics. These are cutting edge technologies, tend to be open source driven and change dramatically fast.
Datalakes versus Hadoop?
But does that mean that the era of datalakes is over? The answer is definitely no. Few would deny that bringing together heterogeneous data sources in a single location and perform powerful distributed analytics on top are key requirements for the modern enterprise.
However, the role of the underlying Hadoop (HDFS) technology is likely to decrease: on the one hand we’ll see commoditisation of Hadoop and on the other hand replacement by technologies that are more simple to use such as Amazon‘s S3.
And that is the real reason why the cathedrals (or silos if you wish) build around a Hadoop distribution have shaky foundations that do not justify their high stock market valuations.
So what’s needs to be done?
The objective is to move up the value chain, getting away from the storage, compute and infrastructure and concentrate on creating and industrializing intelligent, AI-driven applications that really deliver business value.
A new generation of companies is taking up this challenge: master the complex and open data eco-system and bridge the gap between IT (operationalization of analytics while taking into account compliance and security constraints) and business (self-service and collaboration).
Three key concepts will drive this revolution:
- Embrace the eco-system – Innovation takes place at breakneck speed. Closed systems won’t be able to follow the pace of innovation. The market needs an open, agnostic and up-to-date technology toolbox with automated data pipelines to bring data experiments to production. At the same time such a platform should be extensive (by using Docker for instance) while being fully integrated within existing IT tools for scheduling, monitoring and DevOps. Reducing the complexity of making different data and analytics technologies work together is of vital importance. Resources can then be optimized for developing business use cases and the corresponding skill-sets that will drive short and long-term ROI.
- Build a data community – Technology is an enabler. It’s the people that ultimately drive organisational change. Different types of users (analysts, data-engineers, data owners, DevOps, data-scientists, IT-ops…) need access to a shared and documented data portal that provides self-service tools and the capability to work collaboratively on data projects. These projects lead to smart applications with the community ensuring a virtuous feedback loop.
- Deploy flexibly, govern globally – We are living in a complex world with lots of legacy systems, and a combination of cloud-based and on-premise requirements. The glue that links the various infrastructure deployment options together is based on containers, and in particular the emerging Kubernetes market standard. Solutions will be increasingly hybrid with different parts of an organisation using different ways to store and process data. A global governance on data projects is not only mandatory for compliance reasons but also ensures cross-fertilization between different business units and functional teams.