Data Lakes become Swamps when you don’t think about control

Data Lakes become Swamps when you don’t think about control

At Strata Hadoop Singapore next week I’m presenting on “Stopping your Data Lake becoming a Swamp” and the point of this is all about how Data Lakes aren’t magic. Just “putting all your data into Hadoop” isn’t going to solve your information access problems – it’s simply going to give you a digital landfill problem where you have one place where you can’t find anything useful.

Control of ingestion is at the heart of a strong Data Lake; the need to verify and catalog information as it is added, is where governance has to start. Once data is ingested it’s far too late to add governance; you’ve already accepted that information will be adhoc and retrofitting structure is an incredibly painful task, often proving itself not possible at all. 

That brings us to the second important part of governance – standardization. It’s all too easy in the area of open source technologies to take a ‘whatever is shiny today’ strategy and download the latest and greatest tool that was blogged about, but that misses the whole point: we live in a world where Excel is the #1 BI tool by a considerable margin. That world doesn’t need you to keep throwing new technology at it; it needs you to industrialize to get the same level of responsiveness that the business gets from Excel. It doesn’t need you to waste hours, days or even weeks of time trying to get two technologies to work together, it needs you to focus on getting the job done, not getting the technology working.

This latter point is where ODPi comes in, getting companies together to agree on what is actually required and providing a firm foundation for product and Big Data developers to build from. By reducing the risk of the moving parts, it becomes possible to shift the effort towards the outcome and away from the technology.

That is how you stop your lake becoming a swamp: govern on ingestion and standardize the technology. They aren’t the only things, but they are the two most important.


Ramachandran Venkat

Driving GenAI in Healthcare

8 年

All the best Steve Jones! When I visualize governing the ingestion process, I see the equivalent of controlling the flow of water from a melting glacier. It is not easy. Channeling them, categorizing them based on use case and applicability, applying policies and measures for control is a gigantic task which like you mentioned is being missed these days turning the lake in to a swamp. One of the visible barriers to the governance of ingestion is the urgency from business to offload their legacy database environment on to a big data platform just to save on cost of hosting the data which cascades in to multitude of problems downstream. And getting off excel is a battle every person in our work stream has been fighting for decades. Operationalizing BI on a new platform with a mandate to match the flexibility of a tool they have lived with all their life is a tough sell but a must to have investment. Open source has a plethora of features that one cannot miss out on and should be leveraged to gain the competitive advantage.

回复
Anand Agrawal , Architect ,TOGAF Certified

Data Warehousing | Big Data | Analytics Specialist

8 年

Agree with you Steve governance is a important factor

回复
Kumar Chinnakali

Reimagining contact center as a hands-on architect bridging users, clients, developers, and business executives in their context.

8 年

Looking forward to hear your speec, to understand what are the other two vital things to consider to stop our lake becoming a swamp?

回复

要查看或添加评论,请登录

Steve Jones的更多文章

  • Cost Management for Generative AI

    Cost Management for Generative AI

    How to make the business case work for your Generative AI projects thanks to Capgemini’s framework for LLM Cost…

    4 条评论
  • Its time to talk about Systems of Action

    Its time to talk about Systems of Action

    In IT we've talked about Systems of Record, Systems of Transaction, Systems of Process, and Data Warehouses. These have…

    6 条评论
  • Eventually Correct v Eventually Consistent

    Eventually Correct v Eventually Consistent

    We all know about eventually consistent, the idea that the system might be in an inconsistent state right now but it…

    2 条评论
  • "It's just research" isn't good enough in AI Ethics -p

    "It's just research" isn't good enough in AI Ethics -p

    There was a little bit of noise around some papers being rejected on ethical grounds at the NeurIPS 2020 Conference…

  • No-code? Again?

    No-code? Again?

    I'm old, like so old I remember setting a baud rate on a VT220, "telnet localhost 25" and emoticons before they were…

    6 条评论
  • The risk of AI mechanizing Fake News

    The risk of AI mechanizing Fake News

    Now is exactly the time when companies, governments and NGOs should be thinking about the limits and constraints on AI.…

  • Excel and Power BI from a DevOps guy: Part 1

    Excel and Power BI from a DevOps guy: Part 1

    Over the past month I've been getting my hands dirty with Power BI because I needed some dashboards for internal…

    25 条评论
  • See the future, but focus on the MVP

    See the future, but focus on the MVP

    Data Warehouses were the last bastion of Waterfall delivery, the single schema was the last artifact in IT that…

    3 条评论
  • Clouds are the Telcos of Data

    Clouds are the Telcos of Data

    "We should build our own national mobile phone network" said Dave. The CEO turned around confused "What?" she said.

  • The rise of AI APIs - AI for the rest of us

    The rise of AI APIs - AI for the rest of us

    Data Science is hard, Quantum Computing is spectacularly hard. While there are movements to make Data Science more…

    5 条评论

社区洞察

其他会员也浏览了