Modern Data Stack: Looking into the Crystal Ball

Modern Data Stack: Looking into the Crystal Ball

Co-authored by Apoorva Pandhi and Chad Sanderson

In September 2021 Databricks announced that it had raised $1.6B in fresh capital, getting it to an eye-watering $38B valuation. According to a company press release, the “Series H funding will be used to accelerate innovation and adoption of the lake-house as the data architecture’s popularity across data-driven organizations continues to grow at a rapid pace” In addition to Databricks emerging as one of the leading unified data lake platforms, cloud data warehouses like Snowflake are gaining ubiquity as they make it possible to store and process massive amounts of data with a focus on less technical users (Snowflake had a y-o-y revenue growth of 102% and a net retention of 178% in FY2022). All this indicates that the data/ML infrastructure market is growing at a rapid pace.

While the market growth is promising, the data/ML landscape has become more complex than ever. This is because of two reasons. First, the choice between a unified platform and bespoke tools presents a dichotomy: On one hand, a unified platform is considered a ‘one-stop shop’, but on the other, there are trade-offs such as vendor lock-in, high cost of ownership and platform complexity. And second, as more enterprises adopt the modern data stack, they realize that they need additional layers that unified platforms alone cannot address.

As a result, now is an exciting time for founders to not only discover newer horizons that are currently unaddressed, but also, help enterprises navigate the complexity of the existing stack. Before we go into market trends and opportunities for founders, let’s briefly discuss the data stack in its current avatar.

Current Avatar: The Modern Data Stack Ecosystem

While the modern data architecture is evolving continuously, I see the following foundational elements of the stack today (see visual below):?

1. Data Connectors & Workflow Orchestration: With businesses growing their digital footprint and investing in multiple SaaS applications, there is a pressing need to collect, transform and normalize data across sources, while managing these workflows programmatically.

  • Connectors: Data connectors collect data from a multitude of “sources” (DBs, Cloud Storage, SaaS apps, webapps, mobile apps, APIs, internal operational data etc.) and load it to multiple “sinks” (Cloud storage, Data warehouses, SaaS apps). Companies utilize ETL/ELT (e.g., Fivetran), CDP (e.g., Segment), Streaming (e.g., Kafka), CDC (e.g., Debezium) etc.
  • Workflow orchestrators: Orchestrators help to schedule, monitor and run concurrent workflows (Airflow, Argo etc.).

2. Metadata Management: As the data warehouse gains ubiquity, there is an emergence of a new metadata stack on top to make data useful for downstream applications. This stack includes multiple layers including data transformation, cataloguing, quality, governance and metrics to make the data useful for downstream analytics or ML use-cases

3. Data Storage: This collected data is either stored, structured and processed in the warehouse or resides in a data-lake (S3, Azure, Hadoop etc.) in its native format. Data lake is a low-cost storage repository that holds a vast majority of data whereas a data warehouse becomes expensive for massive volumes of data.

4. Data Querying, Processing and AI/ML: Once data is ready downstream of the meta data stack, it can be utilized for as-is/historical analytics or predictive/ML use-cases.

  • Analyze and process: Downstream querying, joining and processing of data can be accomplished with distributed processing engines (e.g., Spark), SQL querying engines (e.g., Trino) or low latency OLAP engines (e.g., Druid, Clickhouse).
  • Predict: Data can be utilized within ML models either directly from the warehouse/ data-lake or downstream of the processing and querying layer.

5. Data Consumption Layer: End users (whether data-science, analyst, business or developer) can utilize this processed data for analytics or ML use-cases through different tools and consumption interfaces. E.g., BI tools (e.g., Mode), Operational intelligence (e.g., Outlier), Embedded Analytics (e.g., Cube) or Custom Apps (e.g., Retool)

No alt text provided for this image

Future Vision: Key Areas Ripe for Disruption

1. Data-scientists and analysts will become increasingly self-sufficient thereby reducing dependency on data engineering teams – With more data than ever and urgency to take advantage of this data, the role of the data scientist and analyst is changing. While these personas don’t have software engineering backgrounds, they are expected to own a bigger piece of the data stack to accomplish their use-cases. This trend has implications on parts of the stack that need to be abstracted away and parts that continue to be owned by data engineers. We are excited for the potential of frameworks/tooling that not only make data scientists and analysts more autonomous, but also, unleash tremendous productivity for data engineers. This can manifest in different ways:

  • Empower data-scientists to experiment with new ML ideas and deploy promising ML prototypes without worrying about the underlying ML infrastructure (i.e., you don't need to become a data engineer and a (Spark)SQL expert to produce value), thereby allowing effective model training and iteration at scale.
  • Enable data analysts to explore, analyze or predict data scenarios by abstracting the complexity of the data pipeline
  • Activate data analysts and scientists to A/B test hypotheses before sharing insights with business teams

2.?Data contracts” will become foundational to data pipelines for minimizing reverse engineering efforts and maximizing availability of ready-to-use data for ML and analytics use-cases – While ELT promises availability and scalability over ETL, the data lands in the warehouse in a total mess. Subsequently consumers (ML practitioners, analysts or business users) use tools to reverse engineer business metrics and insights while wrestling with a multitude of challenges around query complexity, data quality, lineage visibility, governance controls, varying model definitions or broken semantic relationships. These challenges require tremendous engineering effort to make it, downstream analytics, or ML ready. All this still doesn’t guarantee “ready-to-use” data. We envision a future where there will be a “data contracts” layer upstream of the warehouse to enable stronger collaboration between data producers (e.g., data engineers) and data consumers so that business needs become foundational to the data pipeline. In essence, we are emphasizing on minimizing downstream efforts and maximizing “ready-to-use” data by extracting business metrics, schema and semantic relationships from data sources directly.

3.?Greater emphasis on decoupling of the underlying data architecture so that data can move in and out freely, thereby allowing various systems to interoperate seamlessly – As the size and scope of data grows, data teams are becoming increasingly strapped for time. Data engineers, especially, are finding themselves at crossroads between traditional support models and architecting the data stack of the future. Data teams are thus prioritizing ownership, flexibility, and scale with data, while supporting low storage cost and simplicity of adoption irrespective of the architectural framework (Data Warehouse or Data Lake). As a result, the distinction between the data warehouse and data lake is becoming increasingly obscure. Examples of potential opportunities in this domain:

  • Empower data teams to adopt a unified & consistent view of the data across the organization by reimagining metadata management (including governance) across the warehouse and the lake.
  • Enable data engineers to save their time and focus on things that matter. For instance, saving time wasted on tuning parquet files, sorting data or re-ordering joins
  • Activate data analysts and data scientists to work with the data lake and get the flexibility to use any query engine based on use-case

4.?Use-cases for streaming data will become increasingly mainstream and online ML use-cases will follow – The need for streaming data use-cases is continuously growing. These use-cases range from user experience, customer activation, fraud detection, network security etc. all of which can be augmented substantially through streaming data infrastructure. Companies will emerge to address this opportunity in different ways:

  • Empower data and ML engineers to deploy data ingestion, storage, transformation, and metadata management layers (e.g., focus on embeddings or large matrices, super sparse data operations etc.) without having to build this stack ground up
  • Enable data scientists to iterate with model logic and features without having to worry about underlying infra
  • Activate businesses to adopt fully integrated verticalized solutions (e.g., recommendations) instead of horizontal components

5.?Simple, user-centric standards that follow software engineering practices will emerge to establish a blueprint of the data stack for different organizations, thus boosting productivity of data engineers – As companies are embracing the modern data stack, we need simple, user-centric standards to work with existing tools to build a best-of-breed stack with organization’s needs and potential extensibility as core pillars. These best practices and standards will govern how we design, orchestrate, and manage our data stack going forward and will draw from fundamentals of the modern software delivery lifecycle.

These opportunities are just the tip of the iceberg. By investing adequate time and resources in building the foundation of their data capabilities, companies can become truly enduring businesses. The value of a next generation data/ML innovation is in its modularity because of two core reasons: Firstly, teams need more maturity at different stages of the company and secondly, everyone's data needs are different. It is critical for each component to become a vector for innovation and as a result, the need for flexibility rises substantially. The role of the data platform team in that scenario, is to serve as the connective tissue between products: internal, external, and open source - and swap those products in and out like lego blocks as the business scales and new needs arise for customers. This changing role means there is a new 'primary buyer' for data infrastructure products, which will necessitate bigger data budgets. This is an exciting market shift that has created opportunities for early-stage founders to build category defining products. If you’re a founder or a practitioner thinking deeply about pain-points in the data and ML space, we’d love to brainstorm!

Thanks to the brain trust, who provided feedback and contributed ideas, including Ville Tuulos, Savin Goyal, Nikhil Garg, Abhay Bothra, Timothy Chen, Sid Trivedi, Jasleen Singh and Pete Soderling.

Cameron Price

Founder | Senior Data Executive | 30 Years of Leadership in Data Strategy & Innovation | Executive Director | Sales Executive | Mentor | Strategy | Analytics | AI | Gen AI | Transformation | ESG

3 个月

Great insights, Apoorva! With the ongoing evolution in the data and ML space, which areas do you believe present the most opportunity for growth or innovation? Looking forward to more discussions on this.

回复

I think ML needs to be separated from Modern Data Stack. MDS is now commoditised enough that little change can be expected. ML on the other hand is an unsolved problem. Current "architecture-based" MLOps is a dead end as all examples are purpose-built by organisations with plenty of engineering talent and there is little built-in generality in these solutions. Recently established MLOps SaaS components reimplement parts of these BigTech architectures. Usually by the original creators now funded by VC money. I welcome the call for the introduction of software engineering principles in Point 5. Building a large stack upfront is the antithesis of agile software development. No surprise that practitioners are squeamish about adopting any of the MLOps stacks. Lacking agility in early adoption can lead to fragile infrastructure and reduced adoption in the enterprise. Currently, there is no widely accepted paradigm for agile ML product development. Partly because data scientists are simply not trained well enough in programming to think in this context. Our company Hypergolic is changing this.

回复
Neeraj S.

Ask Me About AI Apps & Agents | Governance, Security & Compliance | Co-Founder at PAIG.ai

2 年

Prophecy goes into Transformation too

Dave Kellogg

EIR at Balderton Capital, Independent Consultant, Author of Kellblog, and Co-host of SaaS Talk.

2 年

One of the best articles I've read on the topic of MDS

Walter Rowland

$1-$50m+ CRO SMB to Enterprise w/ international exp. SVP Revenue @ Yodeck | Sales, Sales Dev, Partners, Success, RevOps

2 年

Love this - there is a new breed of "powered by" SaaS "connected applications" that I would assume fall under your "Custom Apps" category - where the data layer and the application layer are separated - and the data layer is something like a Snowflake and it is provided by / managed by the brand themselves. Omer Singer writes a good blog article describing this - https://www.snowflake.com/blog/powered-by-snowflake-building-a-connected-application-for-growth-and-scale/ And my firm - MessageGears - happens to be an example of a "connected application" for B2C MarTech - specifically in the areas of customer data segmentation and activation

要查看或添加评论,请登录

Apoorva Pandhi的更多文章

  • Welcome Annelies to Zetta!

    Welcome Annelies to Zetta!

    This article is also live on Zetta's website here We are thrilled to announce that Annelies Gamble has joined Zetta…

    25 条评论
  • Unveiling AI Predictions for the Year Ahead

    Unveiling AI Predictions for the Year Ahead

    The article is also live on Zetta's website here. As we wrap up a truly transformative year for AI, it’s an…

    6 条评论
  • Impact of LLMs on the evolving data + ML stack

    Impact of LLMs on the evolving data + ML stack

    Co-authored by Apoorva Pandhi and Manjot Pahwa The enterprise data stack has seen several waves of transformation over…

    5 条评论
  • The DevOps Landscape: Past, Present, and Future

    The DevOps Landscape: Past, Present, and Future

    The enterprise stack is constantly evolving—cycling from simplicity to complexity and back again—until, every few…

    14 条评论
  • What to Look for in Next-Generation APM Solutions

    What to Look for in Next-Generation APM Solutions

    Following Splunk’s acquisition of SignalFx last year, the obvious question for many venture investors has been, What’s…

    10 条评论
  • Chasing unicorns: How healthcare VCs can use data to identify the next big thing

    Chasing unicorns: How healthcare VCs can use data to identify the next big thing

    In my past role as an early-stage venture investor focused on the India market, I was always on the lookout for the…

    6 条评论
  • Big Data reveals the ‘Sound’ of Music

    Big Data reveals the ‘Sound’ of Music

    My first memories of music are from listening to the radio with my family. While my mother loved the ‘surprise element’…

    16 条评论

社区洞察

其他会员也浏览了