Are the robots really coming? Pt. 2 (Data Strategy)

Are the robots really coming? Pt. 2 (Data Strategy)

Audience: Non-technical Reading time: 6 mins

In Pt. 1 we explored at high level what Machine Learning (ML) actually is, its accessibility through the Cloud Service Providers (CSPs) and some basic Data Science terminology along with analytics outcomes. In order for Data Scientists to deliver the more advanced outcomes, they are completely dependant on reliable access to big data of a consistent and sufficient quality, as well as appropriate tooling. This data lifecycle fits more into the domain of data engineering than data science.

Even steering clear of the evolution of big data and plethora of technologies, tools and methodologies involved, it’s an understatement to say we are dealing with an expansive and ever changing landscape here ! In order to understand the data lifecycle, forgive me this simplified modern data pipeline, which all 3 CSPs broadly follow in terms of architecture and workflow. 

No alt text provided for this image
  1. Ingest - From either streaming or batch sources
  2. Store - As well as the raw original data, its history
  3. Process - Transformed from raw to actionable data, aka ETL (extract, transform & load)
  4. Consume - For insights, visualisation and advanced analytics

Whilst the four stages above correspond closely with the technology, its the people and process which converge with tech in the form of a holistic (2-part) Data Strategy, in order to enable an organisation to master its most useful asset. The first, more theoretical element is a Data Governance framework, which requires clear business ownership to ensure the accessibility, security, integrity and quality of the data is optimal and future-proofed. The second element, Data Management is the practice, otherwise known as Data Operations, which ensures data quality is maintained, a searchable catalogue provides discoverability across the enterprise and the lifecycle is effectively orchestrated through automation and policy. Such measures will ensure the ultimate archiving or deletion of data follows governance policies and rules, therefore ensuring compliance

Enterprises tend to encourage structured silos, which naturally follows through to data. Siloed data means a lack of accessibility to other BUs, likely duplication across the business and therefore no single source of the truth. Rather than a more traditional data warehouse (structured data), CSPs favour the use of a centralised Data Lake to cope with all data types including unstructured (ie blog content). Data governance is not the responsibility of IT, but designated business stakeholders and without it, a data lake is likely to become a data swamp which delivers little or no business value despite its operating cost. More importantly, any decisions made around data, its governance and its lifecycle without having defined business owners (data stewards) represents the notion of data debt. This accrues interest in terms of effort to unpick dependencies in order to make good. 

Having hopefully made the case for having a joined-up data strategy with robust data governance, what does Data Operations culture look like in Digital Natives (ie Airbnb and Netflix) for aspiring enterprise businesses? Data hoarding is a sackable offence. Data is gathered wherever possible into a well-orchestrated data platform to manage its entire lifecycle. Automation is used with militancy - for instance the Airbnb Operations Team consists of just 5 people (see AWS case study). For the business, secure self-service and near real-time information exchanges provide well catalogued searchable datasets to underpin data-driven decision making and enable advanced Data Science analytics outcomes (ie prescriptive and predictive). They embrace fail-fast and fail at scale, a cultural tenant which is fundamental to learning and underpins their ability to innovate so effectively. Such businesses tend to have Chief Data Officer (CDO) instead of a CIO, a in order to pervade data-driven culture from the top down. 

Many enterprises don’t however have the luxury of starting from scratch with critical business functions and revenue generating lights-on services. Their budgetary cycles have been at least one contributory factor in struggling to adopt agile working effectively, instead falling back to waterfall programmes of work to transform their business. Operating models appear more of an inconvenient afterthought and contractor flux doesn’t seem to help with designating responsibility and considered decision-making. More common still is for Data Scientists to be spending far too much of their valuable time repeatedly preparing and wrangling data, in order to develop ML models.

Without a clear top down data-driven mandate from a leader who can share a vision people believe and buy-in to, providing Data Science teams in the enterprise with useful data represents a huge challenge. Some enterprises have already realised the importance of a data strategy and have explored with data engineering Proofs of Concepts (PoCs) to learn enough about their current problems to effect lasting data lifecycle management change. Progressive CDOs who have been leading their organisation towards self-service and discoverable data, have also invested in centralised Data Science teams with the tooling, skills and training to deliver advanced analytics outcomes. Other agile enterprises have even favoured distributed Data Science teams at BU level to master the local data domain and collaborate for success.

So we’ve had a look at data engineering, understood the importance of a data strategy and governance in order to empower enterprise Data Science teams with useful data to develop ML models that promise delivering real business value. In Pt. 3, I’m going to have a look at the challenge of deploying models or ML Operations in order to deliver ongoing business value, beyond a one-off proof of concept.

Raza Sheikh

Data & Digital Architect | Consultant

1 年

Jonathan, thanks for sharing!

回复
Martin Harwar

Product Director

5 年

I’m so glad you are talking about this stuff in a way that can be understood by business folks, not just IT!

回复

要查看或添加评论,请登录

Jonathan Lanyon的更多文章

  • The Trouble with Generative AI: Pt.2

    The Trouble with Generative AI: Pt.2

    Level: Foundational Reading time: 9mins In Pt.1, we looked at the rise of GenAI, investment and costs, Machine Learning…

  • The Trouble with Generative AI: Pt.1

    The Trouble with Generative AI: Pt.1

    Level: Foundational Reading time: 9mins In case you missed it, OpenAI’s ChatGPT generative AI (GenAI) tool was…

    14 条评论
  • Are the robots really coming? Pt. 3 (ML Operations)

    Are the robots really coming? Pt. 3 (ML Operations)

    Audience: Non-technical Reading time: 8 mins Pt. 1 gave us a brief introduction to Machine Learning (ML) and covered…

    5 条评论
  • Are the robots really coming? Pt. 1 (ML Introduction)

    Are the robots really coming? Pt. 1 (ML Introduction)

    Audience: Non-technical Reading time: 5 mins Its difficult to avoid news articles on a daily basis about artificial…

    1 条评论

社区洞察

其他会员也浏览了