Is Hadoop EDW's Rowdy Child?
Charlie Greenberg
Strategic Product Marketing Leader with expertise in go-to-market strategies, content creation, and sales enablement for enterprise software solutions.
The “Father of Data Warehousing” has often said that a centralized, corporate data model is essential; in my opinion this means that an open-sourced framework like Hadoop would be considered the rowdy child of enterprise data warehousing (EDW).
Over the last 20-plus years, Bill Inmon has been recognized by the information processing industry and academia as the “Father of Data Warehousing.” He literally wrote the book - or at least one of them: “Building the Data Warehouse,” his first on the subject in 1992.
While fathering the EDW, Mr. Inmon has long been at odds over EDW architecture with data scientist, Robert Kimball. The “Kimbalites” (as discussed in “Data Warehouse Design - Inmon Vs. Kimball”), advocate a federated data mart architecture for integration across heterogeneous systems and departments. This challenges Bill Inmon’s more centralized EDW approach which is built on a single, corporate data model.
Federation, Dr, Kimball says, provides the more agile solution for incrementally adding in data sources, thereby creating a scale-as-you-go methodology for the addition of very large data volumes – or Big Data.
He also views his federated approach as a transferable, best practice for successfully supporting Hadoop.
Dr. Inmon does not agree, believing his centralized (though much more complex), uber-model is the only way to produce an enterprise-wide, single version of truth. While recognizing that even today’s EDW must include all manner of metadata and unstructured data, Bill Inmon still insists on a centralized, corporate data model to fuse it all together. Without that centralized approach, Hadoop is viewed as EDW’s rowdy child.
Intrinsic Data Quality?
But when it comes to managing data quality, both EDW and Hadoop face new and difficult challenges – not only due to greater volumes of structured operational data – but also because of the large expansion of unstructured data generated through the Internet of Things (IoT).
Regarding traditional data, many companies consolidating enterprise databases within data warehouses still depend on DQ Engines and ETL to address data quality. But since the databases remain siloed within EDW, these particular data quality functions cannot perform holistically, or create a true and governed, single version of truth.
MDM (Master Data Management) has long partnered with EDW to consolidate, merge and govern disparate databases, supporting EDW’s mission to generate accurate and trusted reporting and analytics.
But can MDM support Hadoop? Or, as some IT enterprises are beginning to wonder, are Big Data, Hadoop and the goal of achieving good and consistent data incompatible?
In fact, MDM is now evolving toward the development of Big Data and Hadoop use case strategies, including:
- Providing out-of-the-box Hive connectivity to import, manage, organize and synchronize not only traditional structured operational data, but also unstructured IoT data, including technical metadata, sensor data, audio and video structures.
- Categorizing Hadoop’s Big Data sprawl in order to move toward a single version of truth.
- Creating logical/physical technical and business hierarchies
- Providing bi-directional synchronization between Hadoop, business systems and the MDM hub.
Arguably, IoT has helped bring about Big Data and, in turn, Big Data has given us Hadoop. Conversely (and to borrow from Star Trek), both now insist that Master Data Management go where no MDM has gone before.