登录查看更多内容

Why Big Data Hub Projects Fail

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

发布日期: 2016年1月20日

I have, over the years, been involved with a number of "Big Data" projects, especially those that seem to focus on data hubs. Most of them have involved hundreds or even thousands of people, with the avowed goal of making all of the data of an organization available not only to people within the organization but also people outside of it. Many of them have failed spectacularly, although a few recovered after some truly heroic intervention.

There are enough common threads among both the failures and the successes that it's worth exploring just what went wrong, and what went right. For a number of reasons, I will not mention names here, but all of these came from either direct experience or were related to me by people working on the projects in question. If you are an architect or programmer, your experiences may differ, and I'd love to hear from you in either case.

One final note - I've written this from the perspective of data hubs, but much of this is just as applicable to other software projects. Far more software projects fail because of poor management than they do to poor technology.

What's a Data Hub?

Hadoop proponents have a favorite term - data lake - which I've long had problem with, because it assumes that all data is the same and that a data hub is just a big database that anyone can dip into to get their glass of data. I usually find that the best way of defining a data hub is that it is a project that provides the following services:

Data comes from disparate sources. Much of the time, these will be relational databases, but sometimes they will be XML or JSON datafeeds, CSV files, Excel documents, PDFs.
Data is stored in a common repository. This repository may be federated, but typically the hub contains the relevant data.
Data is available through a variety of API services. Information is queried from the hub, not the sources.
Data may feed back to the sources. This isn't always a requirement, though is becoming more and more important.
Data is mapped to a master ontology. This means that there is an enterprise data model in place, and as such information is mapped to that model as it is ingested.

In this regard, data hubs are essentially facades that abstract the data from their core data sources. Long term, a data hub should in theory replace the source databases, but that's usually not immediately feasible.

These requirements appear on the surface to be straightforward, but in practice these kinds of projects face a number of hurdles.

System Architecture Trumps Data Architecture

The system architect will take a look at this problem and will see it as a problem involving pipes. Move data from data store A to data store B. Do a little ETL. Write some services in front of store B. Rinse and repeat.

The data architect will look at this problem and shudder. Each source represents a different ontology - different field names, different relationships, different implementations. Beyond getting information into a common format (by itself always a point of contention, as every system architect feels that their database of choice is the best one to use, regardless of whether it is or not), there is almost invariably at least one and in many cases two stage transformation process, the latter to map to a potentially common model, and THESE processes are perhaps the most expensive ones that a development team needs to perform.

In most organizations, system architects - the guys that control the hardware - will tend to have more political clout than data architects. This often means that data architecture gets short shrift almost from the beginning, with an attitude that almost invariably runs "let's get it into the system first, then we'll muck with making the data useful". For some reason, the latter half of the process doesn't get done anywhere near as often as it needs to, because by then, the project has burned through its funds.

We Don't Need No Stinkin' Data Models

There's a corollary to system architecture taking the driver's seat. There are a number of purposes for building data models for data hubs. One of the most important is making sure that the data is in a form that is most useful for developers to use. The reality with most data sources is that the information within a given database was not designed for broad enterprise consumption, it was designed to solve one specific application.

There are a number of implications from this. Field names may be almost meaningless, or may have specialized meanings that are not obvious based upon the name itself. Date formats will be all across the board. Data may be missing, or erroneous, or incomplete. Tables may have been created then abandoned, may be flat and sparse, may have been written by someone who probably shouldn't be building databases. Reference data codes may (indeed, almost certainly will) be opaque or incomprehensible.

Beyond this, most relational ER diagrams do not utilize inheritance in any meaningful fashion, which means that a database may end up calling the same thing by multiple field names. Labels are wildly inconsistent. Finally, once you have two or more disparate data systems, the same object will almost certainly have different identifiers, so building a comprehensive master data management solution is both necessary and often difficult to work with once you run into the potential for misidentification.

One role of a canonical model is to provide both a target that makes transformations feasible and a way of determining what is (and what isn't) in score for a data hub. Yet in my experiences canonical models are rare in most data modeling projects. One reason for this is that once you identify and establish a model you are also by default establishing boundaries around what information is canonical and what is not, and this in turn has political overtones as different groups worry either about their budget being tapped or being left out of the modeling process overall.

Consequently, developing a canonical model usually does not become a priority until after coding has begun, often with the consequence that different development teams end up either working on different versions of a model or work on a model of their own choosing in the absence of anything else. Once code is written, changing the underlying models - logical or physical - will result in rework, and no one wants to rewrite working code.

Coders Gotta Code

I call this one the general contractor curse. A data hub sounds like its a big project - lots of data, lots of prestige, lots of chance to screw up. You're the general manager on the project, and you don't want to mess up. So you call out to your friendly neighborhood placement service, and say "I have this big data project, lots of visibility, give me your best people."

The placement service goes into high gear, starts pulling in anyone who's even looked at data funny in the last five years, polishes up their resumes, and by the time of the kick-off, you have a whole bunch of developers, a whole bunch of architects (those are developers with halos) and maybe a data modeler or ontologist if you're really lucky. Everyone swears to the god of agile, and next thing you know your development teams are off and running, busily running around like bees .... and getting absolutely squat done.

They're writing code, but it's likely the architects have at best only a vague notion of what they're trying to create a that point. There's no data, because there's no data model, and so your coders end up using what's available or just make things up on the fly. Your developers are writing APIs, but no one has actually established a set of standards that they are writing those APIs too.

The thing about software development is that it both has a natural cycle and it should exist in order to create a product or services that people will use. Break that cycle, and things go bad quickly.

Coders take baby-steps towards implementing things, because that's the Agile way, but Agile in general works best when there's already an implicit understanding of what you are trying to build. Most "agile" development efforts are in fact waterfall methods with a morning scrum meeting. As these baby steps begin to showcase just how much seemingly there is to go, the project manager begins to get worried, and so adds most staff, most of whom are subsequently sent to work attempting to get caught up, and 10 minute scrum calls end up taking an hour.

The talent providers are of course delighted with this turn of events, and turn up the heat even more in order to get more pants in seats. The project goes through an inflationary stage where it balloons outward, and communication channels become more and more attenuated. People begin to forget what they're building, and become more fixated upon what the scrum master rather than the project leader tells them. The scrum-master, meanwhile, is trying to make sure that he can get hard objectives done in two week blocks (one sprint) because too many undone things carried over reflects badly upon them. This also usually ends up at the point where you see new managers brought in, usually with too little information, and the original goals of the project slip farther and farther away.

At this point, costs are skyrocketing, but very little seems to be getting done that gets people to desired goals. The data designers and business analysts do finally manage to get their modeling finished, but at this point there's six months development effort already underway, and nobody is following the design work anymore.

A New Black Box

This is also the time that the original system architects (or program managers) are usually displaced, and someone new comes in, typically with a better black box that they used in their last effort. Right now, that black box is typically a "Big Data" solution like Hadoop, but that's largely just a factor of time - they key is that it represents a major change in coding. This may also end up creating two alternative paths, with two different teams trying to come up with better ways of doing things using competing solutions.

This will typically have some measure of success, if only because after six to nine months a lot of the design work that hadn't been there before now exists, but it is very seldom the new black box that's the saviour here - it's the fact that you know have data, some kind of plan, and a lot of prototype work that can be thrown away taking only the lessons learned (which is, in the end, what a prototype should do).

Of course, if the first technology was the better solution, its use has now been thoroughly discredited not because it didn't work, but because it had been used badly, and a lot of developer knowledge that gets picked up now gets thrown out the windows.

Of course with the new technology, there's a new ramp up, and pretty soon, the same problems that plagued the first group plague the second, with the additional caveat that now a significant proportion of the budget has been blown and you've lost half a year or more.

The result is that many features get thrown out, including ones that were actually achievable with the first implementation. The number of data systems involved get pushed back, dramatically, and what was designed increasingly bears no resemblance to what is delivered. Meanwhile, upper management is nervous, now at a year in, because what they promised their stakeholders isn't happening. Developers are worked harder and harder, going into a forced death march, and quality is slipping as a result.

This is true of a lot of software, not just data hubs, but with respect to hubs, this usually has a fairly clear result. Only a few systems are integrated, with significant holes in the model. The APIs are complex and fragmented, the system is only just queryable for fairly simple queries on a few primary objects, and anything of interest is outside of the scope of the queries. You'd almost have been better off with just writing a wrapper around the original database.

Too Many Cooks

Perfectly working code does you no good if it doesn't solve your business requirements. Most programmers tend to focus on their part of the problem, out of context to the larger picture. If they understand the requirements and design, this compartmentalization is useful, but especially when you have large teams of developers brought in late in the cycle, such compartmentalization can mean chaos when integration happens. More than one program has been scuttled because two teams failed to properly coordinate.

I'd place the blame of this on the architecture team. On a large scale project like a data hub, you will often get a number of architects. They bill higher than programmers do, yet in many cases they provide absolutely no value to the team whatsoever.

The role of mid-level architects sounds simple: they are there to insure that the software that is being produced will do what it needs to do. The reality is that this is not an easy thing to accomplish. Architects need to spend time moving from development team to development team, from business analysts and CEOs to coders and back again.

They provide the communication glue to make sure that programmers don't get so compartmentalized that they don't understand what they are putting together, they act to make sure that when good ideas or problems emerge, that other teams can be brought into the loop, they not only establish standards but also check for compliance. In many respects they correspond to editors in a television production.

At the same time, they are not managers. When used properly, architects float. An architect is not in either the management or the programming chain of command. They are there to provide a sanity check, to catch problems before they can derail the whole project, and to constantly assess the priorities for what needs to be done on the project. They also need to communicate with one another, to pass this information on.

For data hubs, the following are generally useful:

Systems Architect. Responsible for establishing and coordinating development environments, services and out of the box software packages. They are responsible for pipes and stores.
Data Architect. Coordinates the interchange of ontologies, establishment and maintenance of canonical (conceptual), logical and physical data models, messaging formats and API requirements. They are responsible for what goes through those pipes and into and out of the stores.
UX Architect. Responsible for the user experience - the look and feel - of the application.
Application Architect. In those situations where the data hub is fronted by a rich middle layer, the application architect may end up managing the overall integration. These is quite frequently a chief architect role.

I've occasionally seen specific system architects - a MarkLogic architect or a Hadoop architect employed, but to me, this misses the whole point of architects. They are not meant to be compartmentalized. A data architect needs to be aware of what goes into the Hadoop systems and the MarkLogic systems. An architect is not a super-programmer, but a meta-programmer - someone who can stand outside the concerns of the programmer and remain objective about the needs of the project, while still possessing programmer-like knowledge.

When the number of architects on a project grows beyond a handful, you have too many architects. At that point it simply becomes a debating society for under-utilized programmers.

Summary

Critically, most of the reason why data hub projects fail stem less from technical problems and instead have more to do with poor management - failure to allow enough time for data design and discovery, putting too much emphasis on business objects semantics rather than their semantics as whole ontologies, putting too much emphasis on getting data into data systems and not enough on getting it into a form that is usable by a large number of clients.

Additionally there is a temptation to turn deploying such hubs a programming problem, when reality it is a modeling problem, one where fluid, dynamic and multi-dimensional "soft" schemas are better than the fixed hard schemas of traditional relational design. Once you start mixing ontologies from multiple sources, you also have to manage identifiers (both between systems and by extension when attempting to determine the likelihood that to distinct data-sets describe the same entities). While Master Data Management (MDM) solution tool-sets can be used there, semantic based solutions are often just as capable and typically can be better integrated into a generalized semantic solutioesn.

Finally, shifting the operational bias away from business objects (or business organization) and towards a more functional approach (data, systems, ux, application) can go a long way towards developing modular libraries, common and consistent code, and significantly reduced timelines for development. Moreover, by deploying your architects as communication glue between such teams, you can reduce the risk of unexpected surprises, especially if the focus is shifted from building ad-hoc modular demos and instead building iterative (and increasingly rich) proofs of concept.

In my next post, I'll be looking at factors that can make data hub deployments more successful.

Want to build a Better Data Hub? Check out https://www.dhirubhai.net/pulse/building-better-data-hub-kurt-cagle, the next article in this series.

Kurt Cagle is the founder of Semantical, LLC.

George Brennan

5 年

I'd add that It's actually quite simple. Realise that it's a vision/business problem first That leads on to the data architecture. The technology and tooling is just an implementation OF the solution and NOT the solution. DB2/oracle/SQL Server/Hadoop - who cares it's irrelevant SSIS/PLSQL/C++/Informatica - who cares it's irrelevant

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

9 年

Linked In - I was rather taken with it myself. Hmmm ....

Olivier D.

Directeur des Systèmes d'Information

9 年

Thanks a lot Kurt Cagle to bring all these true issues in one article! All I can add is that semantic ontologies seems to me the next step to get those big data projects on rail. This will definitely bring meaning and shared understanding of "what is that data about" that is currently missing. As it has been said previously, all the data in the world makes no value if you cannot understand it. Value in data is like it's encrypted... no one is able to use it unless you can model it in an universal way for the organisation, which is actually the REAL challenge: getting people agreeing on a shared model... it's an organisational/people problem, a not technical problem. I strongly believe ontologies helps to union people's models to build a common understanding of what data exists.

1 次回应

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

9 年

I'll definitely look into it.

查看更多评论

要查看或添加评论，请登录

Kurt Cagle的更多文章

Reality Check

2025年2月22日

Reality Check

Copyright 2025 Kurt Cagle / The Cagle Report What are we seeing here? Let me see if I can break it down: ?? Cloud…

14 条评论
MarkLogic Gets a Serious Upgrade

2025年2月15日

MarkLogic Gets a Serious Upgrade

Copyright 2025 Kurt Cagle / The Cagle Report Progress Software has just dropped the first v12 Early Access release of…

14 条评论
Beyond Copyright

2025年2月9日

Beyond Copyright

Copyright 2025 Kurt Cagle / The Cagle Report The question of copyright is now very much on people's minds. I do not…

5 条评论
Beware Those Seeking Efficiency

2025年2月8日

Beware Those Seeking Efficiency

Copyright 2025 Kurt Cagle / The Cagle Report As I write this, the Tech Bros are currently doing a hostile takeover of…

86 条评论
A Decentralized AI/KG Web

2025年2月1日

A Decentralized AI/KG Web

Copyright 2025 Kurt Cagle / The Cagle Report An Interesting Week This has been an interesting week. On Sunday, a…

48 条评论
Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

2025年1月26日

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

I am currently working on Deepseek (https://chat.deepseek.

41 条评论
The (Fake) Testerone Crisis

2025年1月15日

The (Fake) Testerone Crisis

Copyright 2025 Kurt Cagle/The Cagle Report "Testosterone! What the world needs now is TESTOSTERONE!!!" - Mark…

22 条评论
Why AI Agents Aren't Agents

2025年1月15日

Why AI Agents Aren't Agents

Copyright 2025 Kurt Cagle/The Cagle Report One of the big stories in 2024 was that "2025 Would Be The Year of Agentic…

22 条评论
What to Study in 2025 If You Want A Job in 2030

2025年1月12日

What to Study in 2025 If You Want A Job in 2030

Copyright 2025 Kurt Cagle/The Cagle Report This post started out as a response to someone asking me what I thought…

28 条评论
Ontologies and Knowledge Graphs

2025年1月9日

Ontologies and Knowledge Graphs

Copyright 2025 Kurt Cagle/The Cagle Report In my last post, I talked about ontologies as language toolkits, but I'm…

53 条评论

See all articles

Why Big Data Hub Projects Fail

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

What's a Data Hub?

System Architecture Trumps Data Architecture

We Don't Need No Stinkin' Data Models

Coders Gotta Code

A New Black Box

Too Many Cooks

Summary

Kurt Cagle的更多文章

社区洞察

其他会员也浏览了

Understanding the Apache Iceberg Manifest File

Understanding Apache Iceberg's Metadata.json

3 Reasons Data Engineers Should Embrace Apache Iceberg

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance

The Data Story of Powerplay (Part-2)

Data Flow : Building Scalable and Resilient Systems as a Data Engineer

Reverse Engineering a Source System - Data Model (1 of?5)

Capture Data Changes in Azure Data Factory and Azure Synapse Analytics

Pioneers in Data Architecture: Influential Figures and Ideas that Shaped the Field

What's a Data Hub?

System Architecture Trumps Data Architecture

We Don't Need No Stinkin' Data Models

Coders Gotta Code

A New Black Box

Too Many Cooks

Summary

Kurt Cagle的更多文章

Reality Check

MarkLogic Gets a Serious Upgrade

Beyond Copyright

Beware Those Seeking Efficiency

A Decentralized AI/KG Web

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

The (Fake) Testerone Crisis

Why AI Agents Aren't Agents

What to Study in 2025 If You Want A Job in 2030

Ontologies and Knowledge Graphs

社区洞察

其他会员也浏览了

Understanding the Apache Iceberg Manifest File

Understanding Apache Iceberg's Metadata.json

3 Reasons Data Engineers Should Embrace Apache Iceberg

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance

The Data Story of Powerplay (Part-2)

Data Flow : Building Scalable and Resilient Systems as a Data Engineer

Reverse Engineering a Source System - Data Model (1 of?5)

Capture Data Changes in Azure Data Factory and Azure Synapse Analytics

Pioneers in Data Architecture: Influential Figures and Ideas that Shaped the Field