Building a Better Data Hub

Building a Better Data Hub

In a recent post, Why Big Data Hub Projects Fail, I spent some time looking at the (many) factors that can make data hubs fail, and promised that a future post would look at one can be done to make such hubs succeed. This post is very long, but I hope to cover a number of very key points in it. 

Figure Out What You're Trying to Build

This may seem simple advice, but it is actually one of the biggest factors that trip up many data hubs. The role of a data hub generally shouldn't be to replicate all of the databases that you currently have, unless your primary purpose is to create an archival store.

Instead, your objective should be to get the information that the database contains into a central hub that lets it relate to all of the information that your enterprise has. A good illustration would be to consider two systems in a health insurance context - one which contains individual demographic information about its members, the second which contains membership information and associated plans. If you're lucky, the two may end up with the same keys for individuals (i.e., a membership plan would apply to one or more people as enrollees and another person as a subscriber).

If you're really lucky, then the keys in each database are the same for the same entities. More than likely, however, they aren't - requiring a master data management engine that will resolve from one to the other. You could store the keys separately and do the matching at run time (which is an example of effectively duplicating records) or you could make an effort to process the information so that there is a clear, known relationship between the individuals and the memberships. This means that you've created a new model, rather than simply duplicating the old.

By building up a common model that works well within the data hub, you can radically simplify the kinds of queries you need to write, and by extension can make many of these queries more automated. However, this necessitates that you go beyond a one-t0-one mapping of data source and data hub, and instead see the data hub as a way at getting at the informational content of the source database, not just its internal representation.

Good Governance Is Essential

Governance has gone from being mundane (and frankly, boring) to becoming one of the most important aspects of good data architecture design. Part of the reason for this shift is that data systems (especially NoSQL systems) are increasingly able to store metadata about the data, and as such governance is moving away from a largely management-focused activity to instead becoming an integral part of automated data design and development.

Typically in integrated systems, there's a flow that emerges - you cannot have two systems of record that cover the same data, without some question of both governance and priority  - which system takes precedence when a conflict occurs. Governance also determines whether a system can migrate content back into a source database. In a purely relational world, the answer depends almost solely upon whether a system is designated as the system of record or not, but in less traditional media, it becomes possible to work with data at a more granular level, making managing governance a considerably more complex process. 

Indeed, there is an innate flow of information in most large scale enterprise systems, and unless part of the role of the data hub is to also be a governable data source, it is usually far more difficult (and often rather pointless) to work against this flow.

Governance plays a big part in this. In the case of a single database, governance is largely an academic issue. Once you get into multiple database questions, however, governance begins to impact architecture in a big way. In traditional integration projects involving multiple data stores, you can only have one data store of record - the authoritative store. If it is an authoritative store, the database can only act as a source. If it is a non-authoritative store, it can only act as a sink. 

However, this gets more complex when you start talking about virtual data hubs, such as a semantic store. Certain information within the hub may be considered to be non-authoritative, it only represents data that comes decisively from an external source. In other cases, some information may actually be resident within the hub, and other systems then reflect the state of that hub (reference data is a very typical example of this). 

This means that a semantic data hub in particular does not obey the normal rules of governance, because it may end up incorporating both governed (authoritative) and ungoverned (non-authoritative) content within the same basic data structures. 

One approach that can help here is to make use of graphs. In a purely semantic database, a graph can be thought of as a named collection of statements, but in many NoSQL databases the same idea can be extended to collections of documents. The key idea is that everything within the graph satisfies a particular governance rule - the data is either authoritative (and consequently malleable) or is non-authoritative and consequently read only (or write only) to external processes. 

As a general rule of thumb, when a hub is part of a multi-store system, its more than likely that entities - things - will make up the bulk of the data that comes from the outside. For instance, in a health insurance setting, patients, providers, companies, insurance plans and so forth would all be considered "things" in your system, and more than likely are already tracked by one or more existing systems. These may, of course, track such entities in different ways with different identifiers, but it's likely that any representation of these entities within the hub will be reflective of some external system.

Reference data (or categorical information), on the other hand, generally will be managed within a hub. Put more accurately, the category concepts themselves will be stored in the hub, but may link to external source content.

In both cases, the hub performs a conceptual master data management (MDM) role. As an example, suppose that each source system identified a given person (using the redoubtable - and adorkable - Zooey Deschanel as our prototype here). Typically, each system will have its own identifier. A hub system identifies a universal identifier (typically a GUID or uniform resource identifier (URI) to identify that particular person within the hub, then creates associations to the representation of that individual within each distinct "graph" of information.

 The arrows here may seem a little backwards - you may think it would make more sense to have the arrows pointing from the hub to each of the individual records. However, in this case the arrows indicate which data set - the hub or the source, contains the "tail" of the link. In this case you can interpret the arrow as the operation "has master concept of ", with the reverse direction being "is source concept of".

By having a central hub concept, you can minimize the complexity of dealing with master data - instead of needing one link between each entity and each other entity (here requiring six links total) you can get away with four, with two hops from any instance to any other instance. The alternative might look more like the following:

This concept hub approach has a number of advantages, beyond reducing complexity. For instance, system A and system B will most likely have different information about the same individual; one may have demographic information (names, address and contact information as an example) while a second may have activity information (when were they on the application, what did they do there, what did they purchase).

By using a conceptual hub (a core subject, as one client described it) for each entity, you can also make it possible to determine the governance rules for data at a granular level.  If the core is unadorned (the data resides in the graphs of the individual sources, rather than attached to the "master" record) then you can clearly mark some of those graphs as "read-only", some as "read-write" and a few as "write-only".

If on the other hand, the core contains additional metadata, this metadata should be treated as fully derived from the source subjects (even if one of those sources is a read-write database hosted within the hub itself). Put another way, hub metadata is determined by a transformation on the associated source data, and is never altered directly.

This serves two purposes. First, it provides a way of setting governance priorities - if you have two sources for some piece of data (such as a birth date), then the transformation lets you determine which of the two sources is considered the most authoritative, which is the second and so forth. This can be especially useful with dirty or periodically incomplete records - if (the more authoritative) source A doesn't have this information, then source B would provide it's own. If at some future point source A is updated with this information, then it would become the datum of record.

The second purpose is that it makes it possible to create a read-write source within the hub itself while not disrupting the core subject's supremacy. If you have a service that communicates with this secondary read-write graph (say a user interface updating records), then when the document gets updates, this will trigger a transformation that will update the core record but only with what's deemed the most authoritative content (it also provides for a workflow that allows for approvals before such a change is committed).

You Don't Know What You Don't Know

In building data hubs, one of the first assumptions made is that inbound data will cleanly map to outbound data. The reality, usually encountered when you open up the second database to be mapped, is that data is messy. Critical data may be missing, may be duplicated, may have unexplained placeholders. Controlled vocabularies are (incomplete) sequences of terms. Keys may be absent or may require heavy processing to make useful. Fields may be created then never used. Data blocks may be repeated multiple times, each with slightly or radically different information theoretically describing the same thing.  Data fields may consist of "FN","MN",GN" and other seemingly meaningless names.

This means that as you bring data streams in, you have to assume that all data is unknown until proven otherwise. Indeed, the mechanics of data integration - getting data from one system to the next - are often the easiest part of a data integration project. The hard part is dealing with incomplete, dirty, poorly modeled sludge and then extracting viable meaning from that. It is one reason why few ETL tools are really all that useful - the "T" part of that acronym, Transformation, is usually expensive, time consuming and risky. 

Another reason that I like semantic hubs is that it provides a means for moving data from external systems into a single consistent format, then provides means by which internal (source) models can be mapped in real time to a canonical target without having to commit to this process at ingest time. I don't need to know what the source formats are, beyond some structural basics, to bring the data in. Once the data IS in, however, it can be researched, clarified and refined, and these actions can be done at the model level rather than within individual fields. 

Between Normal and Abby Normal

I've have been working with XML pretty much from its inception in the late 1990s. At one time, I would have argued vociferously that XML was THE form in which to store information. I don't think that any longer, at least not for certain types of data.

There's a tendency in the database world to talk about structured and unstructured data. Structured data is usually a synonym for relational data - information that is normalized into tables and columns and connected by key references. Unstructured data is everything else. However, this view tends to hide the fact that very little information in the data world has no structure. A CSV file is structured, albeit very simply. A web page has a clear structure, but with a lot of variation within that structure, as does a DITA file or a Microsoft word document or a spreadsheet.

All of these are mathematically forms of network graphs, which underlie most of computer programming data structures. RDF is one way of describing such graphs, and it's significant in that you can in theory subsume all other database forms (except for certain other types of graph databases) using RDF, because RDF can be infinitely normalized. This means that, among other things, you can represent any (and every) relational database in RDF without losing information.

Now, if that is the case, then why can't you represent a book (or even a web page) in RDF. The simple answer is that you can, but it would be unwieldy, because the one thing that distinguishes a book from a database record is sequentiality. A book is a narrative structure. One paragraph follows the next, one chapter follows the next. RDF (and for that matter, most relational databases) do not have an implicit concept of ordering.

It is possible to create weighted ordering in RDF (adding a number date or text collation to indicate order). It is also possibly to create a metalayer of ordering using RDF reification, where each triple has its own identifier and a structure of the form "statement one is followed by statement two". However, this overhead adds considerably to complexity and hence computational overhead. 

These "unstructured" documents consequently work great in XML databases, since XML does have a concept of ordering. Note that in the most heavily used XML databases, internally, the XML structures are stored in almost exactly the way that I described above - they have already been sharded and normalized, but exposing this meta-layer at a higher level would be problematic at best. 

However, once you break the requirement for implicit ordering, it turns out that the XML structure has some major limitations of its own. It has a preferred root node. If I wanted to search across all addresses in an set of XML documents (say of individuals), I would first end up having to find the path to the Address element, retrieve the document that holds this address, then apply the path filter a second time on the document. This makes things like outer joins very complex to build. 

Additionally, while XML is very good for container/contained relationships, it's not really good at all when dealing with linkages between resources - especially when the same resource is referenced from more than one structure (as happens quite frequently with relational data sources). XML structures, however, do very well when you want to create a "local" snapshot of a data model, if you aren't necessarily worried maintaining external duplication of objects.

As it turns out, you can serialize out sets of RDF triples as a rather ungainly looking XML structure, one that has the added benefit of being fully normalized. You can also (more to the point), take this normal form and use it to reconstruct an XML document that is renormalized it - without needing to know more than very rudimentary semantics about the data model of the document.  

Now, this may not seem like a big deal, but there are actually huge benefits to working with XML in this manner, as something not so much stored as generated. In essence, the idea is that for any relational data model, there is in fact one denormalized XML scheme that requires no expensive transformations, no one writing XSLT or XQuery code beyond a single "zipper" function that zips the XML together. This may not necessarily be what a group of data modelers would put together, but it does represent a ground state that will usually very closely match an entity relationship diagram from a relational database. Transforming from this to a desired external form is likely to be far easier and less time consuming than trying to build these manually.

The only other thing that would make this even more universal would be a filter that would present a semantic triple store as a relational database for both read and write operations. There are some triple stores that do that already (the Virtuoso database is a pretty decent example of this). This isn't necessarily the most optimal use of a triple store, but until such tools as Tableau start consuming SPARQL natively, this would be a critical next step for any hybrid data store (looking especially at MarkLogic on this one).

Solving the Operational Gap

How long does it take one skilled programmer to write a semantic data hub? The answer is about twelve weeks, from my own experiences. How long does it take for fifty programmers to create a semantic data hub? The average seems to be between three and four years, if it gets written at all.

This is a huge discrepancy. Part of this is that when you have fifty programmers, chances are good to near certain that you are dealing with large numbers of Java programmers trying to do very specialized tasks on a Hadoop system. Hadoop has no native semantics layer, and few Java programmers regularly work with semantics, even though there are a number of perfectly serviceable Java triples stores (including the Apache Jena/Fuseki stack, which while having limitations for scalability is a perfect tool for prototyping).

However, there is also a discrepancy that exists between a functional and an operational way of dealing with data that can turn a relatively straightforward programming exercise into a death march. The paradigm I'm going to use as a counter to Hadoop here is (not surprisingly) MarkLogic, but it can be adapted with something like node.js, eXist-db and Jena/Fuseki (and maybe Saxon).

Most semantic hubs perform four primary functions:

  • Ingest content from two or more databases (relational or otherwise) and convert these into a one-to-one semantic representation within the hub itself. Each of these become the source graph.
  • Identify upon ingest inbound equivalents of primary objects - individuals, products, organizations and so forth, and then perform a master data management function to either locate a likely related core subject or to create a new core subject if one doesn't exist. These are entity nodes.
  • Map fields from each subject in the source graphs to the hub graphs. These will usually be accomplished through some kind of inferencing rule-set. Much of these will be identifying related mappings between controlled-vocabularies.
  • Generate and expose queries for navigating and finding detailed information for various objects, perhaps with customization for more complex entities.
  • There may be some kind of related GUI interface for navigating content, a graphic visualization layer and for reading/writing to an internal graph.
  • There may also be an semantic enrichment layer for processing documents, images and so forth, which would add some complexity, but not a huge amount.

That's it. Beyond the core division between entities, categories and properties, there are no business objects or logic that is specific to business objects within the application layer.

 Now, once this process is done, most of the work consists of refinements to and mapping of the semantic core model (the canonical model), writing edge-case queries and establishing richer inferences. In practice this refinement can go on for years, but its evolutionary - in general, once you establish the core objects, most of the work comes down to filling in the details and determining what has priority in terms of governance. 

So where does this go wrong? Usually, the biggest reason tends to be that people fixate on the business objects and business model way too early, and they want to do too much of this work outside the data hub. There is usually a huge amount of contextual information that is in the existing data hub - controlled vocabularies, logical model relationships, bindings between entities and so forth.

For some reason, people want to do master data management outside such hubs, when in point of fact the largest collection of information about a given entity will likely already be within the hub itself. People want to do transformations using Java code, which is property-centric, rather than XSLT or XQuery, which are template-pattern centric, and even when they do use XSLT they want to use the default fourteen year old Xalan XSLT 1.0 package, which is an order of magnitude more primitive than the Saxon XSLT 2 or, especially, XSLT 3 transforms. 

To make this plain - the success or failure of a data hub will be based largely upon the ability of that hub to transform content. The internal format of storage needs to be flexible enough to drive that transformation, but the reality of the world in 2016 is that your customers expect your data in the format which most closely matches their needs, not what happens to be most convenient for you.

 

The Best Hubs are Grown, Not Built

All too often, people want data hubs to be super-databases - ones that will be responsible for updating their other data systems. That's generally not a cost effective strategy.

The best hubs usually start out with reference data management (Referenced Data Management), working with controlled vocabularies, then building out from there to pull in different entities. RDM also lets you lay the foundation for Master Data Management *MDM) - both managing entity terms from different ontologies and providing comparison analysis (the likelihood that two individuals with a similar set of characteristics are the same person).  With both RDM and MDM in place, you have the ability then to grow systems with new data.

One other consequence of choosing a growth strategy is that over time your system becomes more intelligent. Different controlled vocabularies provide more entry points to core subjects, interconnections become more robust and informative, and you develop a stronger underlying meta-model as sub properties and classes become evident.

Your system gets to the point where you can reduce the overall number of properties because (as an example)  a manufacturer is also a corporation which is also an organization. This ability to do inferencing provides both fine grain control over specific properties and broader queries common to a base class (all organizations have an address, as an example, but only manufacturers have product lines).

Keep Expectations Focused and Manageable

Too many data hubs die before they're born because the start out as enterprise wide initiatives, and the data hubs that do survive quite often are the ones that start out as skunk-work projects.

Prove out the ingestion pipeline on different data systems before announcing it to all comers. Check performance, load and scalability on test data before going to live data, and never assume that you'll get the live data you want when you're ready for it. Build out test applications that use the data feeds so that people can see what can be done with the project, rather than trying to sell people on how great it will be. 

One reason that I like working with semantic hubs is that, since most of the business logic and even the metalogic resides within the data, you can smart out small on a system like Jena, then move up to a more powerful system in a way that scales gracefully. Specialized query functions may need to be rewritten, but in general SPARQL queries are surprisingly portable in a way that's not true of most other data languages.  (Indeed, that mix of data and model portability has long been a feature of RDF). SPARQL Update is even more portable, in that you can not only load in data but you can also systematically apply data transformations and constraints in a generally uniform and consistent  way from one system to the next.

A final point here - many larger organizations tend to have development teams and efforts that are resistant to change; a new data platform can be seen as a threat both to developers of a different platform and those managers who traditionally have controlled the flow of data. Hubs by their very nature do both. This is one of the reasons why you should start a data hub within a fairly controlled environment: by giving the hub a chance to mature and find applications naturally, you can often avoid dealing with larger scale political issues.

Wrap Up

There are a lot of fairly deep technical issues that go into making a data hub (especially a semantic one) work, and this is not the venue to go into those in depth. However, in general, the issues that play data hubs and data lakes are very similar that plagued data warehouses twenty years ago: insufficient data architecture planning, overestimation (and consequent resource deployment) for the systems side over the data side, the assumption that building a central data system should be a large, manpower intensive operation (when in fact the opposite is usually the case), and attempting to build a large scale data system to do everything when a smaller adaptive system that can scale is usually both technically and politically better.

Kurt Cagle is the founder of Semantical, LLC.

Tim Shear

Founder, CTO at Dataparency, LLC

9 å¹´

I would propose an architecture where all data is stored in a hierarchical database, structured by root-level entities containing all entity attributes within a hierarchy grouped by semantic relevance/ownership. A meta-model of the entity could describe the data structure and governance issues such as privacy/security. This hierachical model of the entity would be freely extensible to meet future data needs. A graph query such as GraphQL could be used to marshal data in and out of the database. Alternatively, REST or RDF-type queries could be implemented. Entities would have unique identifiers and be 'sharded'/assigned across multiple data servers allowing for hundreds of millions of entities and performant access. All existing data could be marshalled in or out of the database through existing ETL techniques. Additionally, a model called a 'data custodian' could be implemented on top of the database to manage privacy/security/governance per the rules declared by the data owner. This would allow data sharing among businesses for exploitation of co-operative endeavours, for example, a manufacturer sharing production schedules with a logisitics supplier, allowing the logistics supplier to better allocate resources and offer a reduced price. This is the technology we're creating at Dataparency.

赞
回复
Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

9 å¹´

Paul, a data factory model is actually quite appropriate here. It's one of the reasons as well why I think that at least partial normalization of data makes sense - it is generally easier to work with decomposed data content and denormalize it - XMLify or JSONify it, if you will - then it is to go the other way around.

赞
回复
Dennis Hamilton

Golden Geek: Independent Computer-Science Scholar

9 å¹´

Intense. I must reflect on this a great deal. I'll speak without having done that. First observation: the amount of "semantic" knowledge required to work this involves considerable craft and especially subject-matter knowlege in getting things right and curating the ongoing system. The "what is it about, for, and behalf of" will take some brain-wrenching analysis beyond mere data wrangling (if there is anything mere about it). I suspect that the principles you raise about non-destructive handling of sources (if I have that right) will be quite important, especially when first tries have to be refactored a few times. Second observation: I wonder how this works when a source corpus contains versions in flight, so that not all records are of the same structure or perhaps not even notionally equivalent in all respects. I'm thinking about when access is infrequent and/or there are other reasons why recoding is not done on older materials, perhaps because they are indelible. I think this might add a different dimension on all this. Thinking of such a source as a mutating composite might be too heavy-handed? A friend of mine is teaching an I-School master's-level course involving data and I wonder what he'll make of this.

赞
回复
Athanassios Hatzis

IT Solutions/Systems Expert - Researcher

9 å¹´

Nice try Kurt, but you are missing one of the most important solutions based on associative kind of data modelling. Principles of this data modelling, i.e. single instance, omnipotent view from any data point to any other data point, fully normalized, fully connected (bidirectional), exist in QlikView, AtomicDB now X10SYS.com, and SentencesDB. This data model is covered fully on my conceptual data modeling framework of R3DM/S3DM. Stay tuned for my post on this.

赞
回复

要查看或添加评论,请登录

Kurt Cagle的更多文章

  • Reality Check

    Reality Check

    Copyright 2025 Kurt Cagle / The Cagle Report What are we seeing here? Let me see if I can break it down: ?? Cloud…

    14 条评论
  • MarkLogic Gets a Serious Upgrade

    MarkLogic Gets a Serious Upgrade

    Copyright 2025 Kurt Cagle / The Cagle Report Progress Software has just dropped the first v12 Early Access release of…

    14 条评论
  • Beyond Copyright

    Beyond Copyright

    Copyright 2025 Kurt Cagle / The Cagle Report The question of copyright is now very much on people's minds. I do not…

    5 条评论
  • Beware Those Seeking Efficiency

    Beware Those Seeking Efficiency

    Copyright 2025 Kurt Cagle / The Cagle Report As I write this, the Tech Bros are currently doing a hostile takeover of…

    86 条评论
  • A Decentralized AI/KG Web

    A Decentralized AI/KG Web

    Copyright 2025 Kurt Cagle / The Cagle Report An Interesting Week This has been an interesting week. On Sunday, a…

    48 条评论
  • Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

    Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

    I am currently working on Deepseek (https://chat.deepseek.

    41 条评论
  • The (Fake) Testerone Crisis

    The (Fake) Testerone Crisis

    Copyright 2025 Kurt Cagle/The Cagle Report "Testosterone! What the world needs now is TESTOSTERONE!!!" - Mark…

    22 条评论
  • Why AI Agents Aren't Agents

    Why AI Agents Aren't Agents

    Copyright 2025 Kurt Cagle/The Cagle Report One of the big stories in 2024 was that "2025 Would Be The Year of Agentic…

    22 条评论
  • What to Study in 2025 If You Want A Job in 2030

    What to Study in 2025 If You Want A Job in 2030

    Copyright 2025 Kurt Cagle/The Cagle Report This post started out as a response to someone asking me what I thought…

    28 条评论
  • Ontologies and Knowledge Graphs

    Ontologies and Knowledge Graphs

    Copyright 2025 Kurt Cagle/The Cagle Report In my last post, I talked about ontologies as language toolkits, but I'm…

    53 条评论

社区洞察

其他会员也浏览了