Hadoop Project Proliferation Challenges Selection and Support
Summary
The speed of Hadoop innovation and its functional breadth are growing fast. Data and analytics leaders must manage the increasing uncertainties and risks to realize Hadoop's promise of cost-effective, distributed and scalable processing of large datasets.
Overview
Impacts
-
Mapping the continuing rapid expansion of Hadoop capabilities to the information needs of the organization as projects proliferate has become a complex process for data and analytics leaders.
-
Rapid Hadoop change and adoption challenge distribution vendors' immature support capabilities, creating confusion for data and analytics leaders about exactly what vendors do support.
-
Hadoop's innovation model challenges its users' ability to influence new projects that lack commercial support.
Recommendations
Data and analytics leaders should:
-
Track those Hadoop projects and functions that satisfy the organization's information requirements and use a tiered classification to rate their supportability.
-
Require commercial vendors to clearly state in writing what projects in their Hadoop distribution are supported, at which release level, and what "support" means.
-
Assess the deployment risk of early-stage Hadoop open-source projects that, however desirable, may lack sufficient maturity and commercial support for the organization's requirements.
Contents
- Analysis
- Impacts and Recommendations
- Mapping the continuing rapid expansion of Hadoop capabilities to the information needs of the organization as projects proliferate has become a complex process for data and analytics leaders
- Rapid Hadoop change and adoption challenges distribution vendors' immature support capabilities, creating confusion for data and analytics leaders about exactly what vendors do support
- Hadoop's innovation model challenges its users' ability to influence new projects that lack commercial support
- Mapping the continuing rapid expansion of Hadoop capabilities to the information needs of the organization as projects proliferate has become a complex process for data and analytics leaders
- Gartner Recommended Reading
Figures
Analysis
The open-source software framework known as Apache Hadoop has gained sizable acceptance in organizations, spurred on by the growing digital business appetite for big data. However, the pace and extent of changes to the stack of projects in the Hadoop framework — both by the open-source community that incubates it and the commercial vendors who package and support it (see "Market Guide for Hadoop Distributions" ) — create confusion and uncertainty for data and analytics leaders (see Figure 1 for our summary of the impacts and top recommendations for these leaders).
Hadoop is constantly in flux. 1 Information leaders have to choose between different distributions that have varied components — typically, Apache Software Foundation (ASF) projects, which are being independently updated, and new projects that are being added. It is challenging for organizations to map such a rapidly evolving software collection to their equally rapidly evolving needs. Desirable new additions or improvements to one Hadoop distribution may not be included in others. Your chosen distribution may lag in supporting a project with a feature you want, or may suggest you wait while they pursue a different project with similar objectives that is not yet ready to be supported.
Vendor or even community "support" is highly variable. Hadoop support can be as uncertain as the software itself — usually poorly documented, ill-defined and immature. Some projects, such as Spark, subsume several components (Spark Streaming, Spark SQL, Spark ML and others), and it may not be clear which pieces are supported and which are not.
There is a lack of broadly supported Hadoop standards beyond the emerging Linux Foundation Open Data Platform Initiative (ODPi), which after a year of effort includes only runtime specifications for the Hadoop Distributed File System (HDFS), MapReduce, and Yet Another Resource Negotiator (YARN) resource management — the same components listed by the Apache Software Foundation on its Hadoop page. The ASF accepts multiple projects for other common framework elements, and distributors can decide for themselves which to include in their offerings. This means that many promising new opportunities go commercially unsupported. Some projects are only supported by one vendor, others by none at all, and the ODPi itself is not supported by three of the four leading distributors — only Hortonworks is participating. Devising a predictable technology adoption plan to address the information needs of an organization is daunting, because things can change in the space of just a few months. A clear action plan is required that identifies business-value-driven needs and maps them to suppliers most likely to meet them.
Figure 1. Impacts and Top Recommendations for Data and Analytics Leaders
Source: Gartner (May 2016)
Impacts and Recommendations
Mapping the continuing rapid expansion of Hadoop capabilities to the information needs of the organization as projects proliferate has become a complex process for data and analytics leaders
Hadoop has been rapidly adopted by data and analytics leaders and by commercial vendors, with good reason; the Hadoop framework:
-
Stores all types of data, both structured and unstructured
-
Leverages a distributed architecture to use less costly network, storage and processing resources and software
-
Quickly processes very large datasets
-
Satisfies the digital business demand for rapidly proliferating big data use cases
Yet, the very quality that attracts the interest of data and analytics leaders — the rapid pace and breadth of Hadoop innovation — is now causing them a new problem. Mapping Hadoop projects and functions to the data needs of the organization is becoming increasingly complex, because the stack is changing so quickly.
The Hadoop components, and, within them, specific features, are the result of a confusing blend of open-source projects and distributor additions. The two sets are closely intertwined, with vendors working to enlist community support, and participating in the community, for favored Hadoop functions. For example, Hortonworks' distribution tends to have the least "nonstandard Apache" content, but its current Cloudbreak initiative for managing cloud-based Hadoop has not yet been accepted as an Apache project. Cloudera has its own management console that predates Apache Ambari. MapR is supporting Drill and Impala for SQL access, whereas the other distributors focus on their own solutions — though all support Spark SQL. All the players here are also pursuing differing approaches to security and governance. To add to the confusion, distributors sometimes propose their formerly separate pieces to the ASF — as Cloudera recently did with Impala, and Pivotal with Geode (the former GemFire in-memory datagrid.)
Data leaders are thus faced with a bewildering array of projects and features at various stages of development. Schedules are difficult to determine. New or updated components may not be included by the commercial Hadoop vendor you rely on; sometimes they are not incorporated by anyone but their promoter. For example, data scientists may be interested in leveraging Apache Zeppelin's notebook capabilities, but Amazon and Hortonworks are the only distributors to include Apache Zeppelin today. Will the innovations integrate smoothly with existing Hadoop deployments, or disrupt them? Who will maintain "aging" legacy MapReduce code if nobody uses MapReduce anymore? Open-source contributors and app developers don't want to work on legacy code or products. Some enterprises use multiple distributions to get around this issue; some have yet to select their first.
Plan on making a proactive effort to keep abreast of Hadoop changes using a three-tiered classification system. The tiers fall into a pragmatic classification for innovations or updates that are:
-
Supported in most leading commercial Hadoop distributions
-
Supported only by one or two distributors
-
Considered as early-stage development — not supported by any as yet, but might be needed anyway
An example of such a scheme can be found in the author's blog posts. 2
Recommendations:
-
Track developments in the Hadoop stack and assess their business value.
-
Identify new requirements not being met by your existing Hadoop providers and communicate them to those providers.
-
Since portability across commercial distributions is not trivial, do not migrate to another distribution for a single feature unless that feature is an immediate critical requirement for your project.
Rapid Hadoop change and adoption challenges distribution vendors' immature support capabilities, creating confusion for data and analytics leaders about exactly what vendors do support
One of the most persistent questions that Gartner clients raise in Hadoop conversations is, "Is the project I want, or have begun using, supported yet?" Mainstream data and analytics users are accustomed to products whose support is not fragmented across multiple "pieces" — if it's in there, it's supported. In the open-source world, however, bits are sometimes included with no clear commitment to support being made. Early adopters are accustomed to looking for support from other equally aggressive users on mailing lists and message boards — but commercial software users expect commitments and SLAs.
There are several challenges:
-
First, project support by commercial Hadoop vendors varies widely and is, typically, very poorly documented online.
-
Second, even in commercial products there are various meanings for the word "support," and it comes at different levels for different prices.
-
Finally, vendors who offer a package that includes and extends someone else's offering — as, for example, Microsoft and Pivotal do with Hortonworks' Hadoop distribution — may "pass through" some support requests, complicating the tracking process and potentially introducing delays.
Widespread chatter on open-source forums and social media about the status of various Hadoop vendors, and hints or promises of vendor support, are no alternative to contractual commitments. Data and analytics leaders should ask commercial vendors three questions:
-
What is your commitment to stay current and integrate new versions?
-
Do you support the specific Hadoop projects I care about? If not, will you, and when?
-
What do you mean by "support" and what specific SLAs do you offer?"
Online information for some distributors tends to be vague about project details, and it can be difficult to get answers unless you press for them. When a Hadoop component pops up an error message, a call to the vendor may reveal that it's "a feature that we're not supporting yet."
Your selection of a Hadoop distribution and vendor should be based on what is currently in the vendor's software; ask which projects are currently supported and get it in writing. A distributor's promised roadmap can be helpful in seeing what their priorities are, or are said to be; however, a roadmap is not production software. Ask for a commitment to update frequency — ideally, tied to the release of new Apache projects. Because the stack is composed of so many individual pieces, each with their own schedules and published on the Apache website, waiting for a monolithic release from your distribution vendor can leave you behind.
Be sure you review what the vendor means by "support." As the distributors have matured, their offerings have become more sophisticated, but they are delivered unevenly — skills challenges affect vendors too. Geography matters; clients report that it can be difficult to get local support, and even if the vendor offers "follow the sun, 24/7 availability," it can be tricky to diagnose and remediate issues that are not obvious. Identify what budget and staffing resources the vendor has dedicated to its customer support. Talk to customer references about support quality — local, if possible.
Finally, there are opportunities for users to take advantage of distributor expertise and of features designed to optimize your results. All distributors have setup and operational support at varying levels and prices. For example, for ongoing optimization, Cloudera offers Proactive Support and Hortonworks promotes its SmartSense — instrumentation to "phone home" to offer proactive monitoring and actionable recommendations. Claims of uniqueness can be ignored; focus on what is delivered and what value it might offer.
Recommendations:
-
Assess the usefulness, thoroughness and completeness of product documentation and the learning and training resources available to IT staff about the vendor's software stack.
-
Outline support scenarios necessary for your Hadoop deployments, and walk through them with your distribution vendor.
-
Ask your distributor which projects are supported and at what version level. Ask for a roadmap and a commitment to the frequency of distribution updates.
-
Establish internal escalation and change-management procedures for use when support is delayed or absent.
-
Assess local support delivery, and insist on SLAs that are comparable to those for other commercial software you depend on.
Hadoop's innovation model challenges its users' ability to influence new projects that lack commercial support
The (hopefully) direct line from customer demand to new features is challenged by the new open-source model. The politics of open-source software can be confusing to those outside the interrelated communities of coders and vendors, but customers need to influence the direction of open-source projects. Hadoop distributors have added outreach efforts to include customers in planning projects, but these efforts are limited and their success so far is not clear. Getting distributors to adopt projects can also be a challenge without a clear way to influence such decisions.
Many commercially unsupported projects have potential value: Apache Beam, Flink, Kite, Low Latency Application Master (LLAMA), Kylin, Myriad and Tajo are just a few promising initiatives among dozens. Some of these are official Apache projects or incubating (a preliminary status that may not include other contributors). Others are being supported entirely by one promoter until reaching a state of readiness to include other players.
This lack of customer influence raises the risks involved in adopting early-stage innovation. Even if they don't prove to be dead ends, single-vendor Hadoop projects ironically create one of the chief evils that open source hopes to overcome — locking users into the development plans and financial realities of a single vendor for a critical infrastructure element. Moreover, when projects are supported by a growing number of players, the specifications can change drastically, requiring, at times, a rewrite of existing code a customer has implemented.
Thankfully, customers don't need to master the intricacies of these relationships, nor do they have to be on the bleeding edge of all proliferating Hadoop projects. They do, however, need to understand Hadoop development priorities to track changes in the Hadoop stack and evaluate important emerging features in the Apache community, based on their likely business value. Being able to track these emerging features and projects, and their rate of vendor adoption, enables enterprises to talk with vendors about their development priorities and delivery roadmaps.
Recommendations:
-
Assess the deployment risk of early-stage Hadoop open-source projects that, however desirable, may lack sufficient commercial support for your data requirements.
-
Avoid production deployment of unsupported software unless you are willing to accept the risks, both of readiness and of future changes to the software that might "break" the stack you are using it in.
Gartner Recommended Reading
Some documents may not be available as part of your current Gartner subscription.
"Market Guide for Hadoop Distributions"
"Toolkit: RFP for Hadoop Distributions"
Evidence
In Gartner's "Market Share: All Software Markets, Worldwide, 2014," the three independent Hadoop distributors (Cloudera, Hortonworks and MapR) totaled $323.2 million in revenue in constant currency (which would be enough to make them eighth on the DBMS revenue list). IBM and Amazon also generate substantial though unspecified Hadoop revenue.
1 An Apache project is a documented piece of software with a clear purpose and defined interfaces, approved and maintained by an Apache Project Management Committee (PMC) with a minimum number of volunteers working from multiple organizations. Each committee has a leader and defined responsibilities.