Data Lakes become Swamps when you don’t think about control
At Strata Hadoop Singapore next week I’m presenting on “Stopping your Data Lake becoming a Swamp” and the point of this is all about how Data Lakes aren’t magic. Just “putting all your data into Hadoop” isn’t going to solve your information access problems – it’s simply going to give you a digital landfill problem where you have one place where you can’t find anything useful.
Control of ingestion is at the heart of a strong Data Lake; the need to verify and catalog information as it is added, is where governance has to start. Once data is ingested it’s far too late to add governance; you’ve already accepted that information will be adhoc and retrofitting structure is an incredibly painful task, often proving itself not possible at all.
That brings us to the second important part of governance – standardization. It’s all too easy in the area of open source technologies to take a ‘whatever is shiny today’ strategy and download the latest and greatest tool that was blogged about, but that misses the whole point: we live in a world where Excel is the #1 BI tool by a considerable margin. That world doesn’t need you to keep throwing new technology at it; it needs you to industrialize to get the same level of responsiveness that the business gets from Excel. It doesn’t need you to waste hours, days or even weeks of time trying to get two technologies to work together, it needs you to focus on getting the job done, not getting the technology working.
This latter point is where ODPi comes in, getting companies together to agree on what is actually required and providing a firm foundation for product and Big Data developers to build from. By reducing the risk of the moving parts, it becomes possible to shift the effort towards the outcome and away from the technology.
That is how you stop your lake becoming a swamp: govern on ingestion and standardize the technology. They aren’t the only things, but they are the two most important.
Driving GenAI in Healthcare
8 年All the best Steve Jones! When I visualize governing the ingestion process, I see the equivalent of controlling the flow of water from a melting glacier. It is not easy. Channeling them, categorizing them based on use case and applicability, applying policies and measures for control is a gigantic task which like you mentioned is being missed these days turning the lake in to a swamp. One of the visible barriers to the governance of ingestion is the urgency from business to offload their legacy database environment on to a big data platform just to save on cost of hosting the data which cascades in to multitude of problems downstream. And getting off excel is a battle every person in our work stream has been fighting for decades. Operationalizing BI on a new platform with a mandate to match the flexibility of a tool they have lived with all their life is a tough sell but a must to have investment. Open source has a plethora of features that one cannot miss out on and should be leveraged to gain the competitive advantage.
Data Warehousing | Big Data | Analytics Specialist
8 年Agree with you Steve governance is a important factor
Reimagining contact center as a hands-on architect bridging users, clients, developers, and business executives in their context.
8 年Looking forward to hear your speec, to understand what are the other two vital things to consider to stop our lake becoming a swamp?