Forrester changed the way they think about data catalogs. Here’s what you need to know.

Forrester changed the way they think about data catalogs. Here’s what you need to know.

As we?predicted ?at the beginning of this year, metadata is hot in 2022 —?and it’s only getting hotter. But this isn’t the old-school idea of metadata we all know and hate.

The data industry is in the middle of a fundamental shift in how we think about metadata. Now, in the latest sign of this shift, Forrester scrapped its Wave report on “Machine Learning Data Catalogs” to make way for one on “Enterprise Data Catalogs for DataOps”.

Here’s what you need to know about where this change came from, why it happened, and what it means for modern metadata.

??Spotlight: What are Enterprise Data Catalogs for DataOps, and why should you care?

One of the biggest challenges with Data Catalog 2.0 was adoption — no matter how it was set up, companies found that people rarely used their expensive data catalog. For a while, the data world thought that machine learning was the solution. That’s why, until recently, Forrester’s reports focused on evaluating machine learning data catalogs.

However, in early 2022, Forrester dropped machine learning in its?Now Tech report . It explained that even as ML-based systems became ubiquitous, the problems they were meant to solve persisted. Although machine learning allowed data architects to get a clearer picture of the data within their organization, it didn’t fully address modern challenges around data management and provisioning.

The key change —?”Data engineers need a data catalog that does more than generate a wiki about data and metadata”. Instead, data teams need a catalog built to enable DataOps. This requires in-depth information about and control over their data to “build data-driven applications and address data flow and performance”.

So what actually is an enterprise data catalog for DataOps (EDC)? According to?Forrester , “[enterprise] data catalogs create data transparency and enable data engineers to implement DataOps activities that develop, coordinate, and orchestrate the provisioning of data policies and controls and manage the data and analytics product portfolio.”

There are?three key ideas ?that distinguish EDCs from the earlier Machine Learning Data Catalogs.

Handles the diversity and granularity of modern data and metadata

Today a company’s data isn’t just simple tables and charts. It’s a wide range of data products and associated assets, such as databases, pipelines, services, policies, code, and models —?each with its own metadata. EDCs are built for this complex portfolio of data and metadata.

Rather than just storing a “wiki” of this data, EDCs act as a “system of record” to automatically capture and manage all of a company’s data through the data product lifecycle. This includes syncing context and enabling delivery across data engineers, data scientists, and application developers.

Provides deep transparency into data flow and delivery

A key idea in DataOps is CI/CD, a software engineering principle to improve collaboration, productivity, and speed through continuous integration and delivery. For data, implementing CI/CD practices rely on understanding exactly how data is moved and transformed across the company.

EDCs provide granular data visibility and governance with features like column-level lineage, impact analysis, root cause analysis, and data policy compliance. These should be programmatic, rather than manual, with automated flags, alerts, and/or suggestions to help users keep on top of complex, fast-moving data flows.

Designed around modern DataOps and engineering best practices

With data growing beyond the IT team, data engineering tools can no longer just focus on the data warehouse and lake. DataOps merges the best practices and learnings from the data and developer worlds to help diverse data people work together better.

EDCs are a critical way to connect the “data and developer environments”. Features like bidirectional communication, collaboration, and two-way workflows lead to simpler, faster data delivery across teams and functions.

Read more about enterprise data catalogs for DataOps in the blog here. ???

The future of metadata is active ??

All of these ideas — from Forrester’s championing data catalogs for DataOps to Gartner scrapping its?Magic Quadrant for Metadata Solutions ?— point to the importance of?active metadata. We first wrote about?this idea ?in January 2021, and we’ve seen it explode since then.

From DataOps to the data mesh, modern data concepts are fundamentally based on being able to collect, store, and analyze metadata. However, data catalogs lagged behind for years, acting as static, siloed systems in a world of fast-moving, interconnected data. In a world where metadata is approaching “big data” and it is critical for a range of modern use cases, the standard way of storing metadata is no longer enough. As Forrester said, we need more than a wiki for our data.

The solution is “active metadata”, which is a key component of modern data catalogs. Instead of just collecting metadata from the rest of the data stack and bringing it back into a passive data catalog, active metadata makes a two-way movement of metadata possible. It sends enriched metadata and unified context back into every tool in the data stack, and enables powerful programmatic use cases through automation.

Here are a few examples of what active metadata looks like in action:

  • Purge stale or unused assets: Use active metadata to periodically calculate when each data asset was last used and how many people used it, and then flag or purge neglected assets.
  • Allocate compute resources dynamically: Imagine that 90% of users log in to a BI tool during the last week of a financial quarter —?automatically scale up compute resources just before that week and scale them down again afterward.
  • Enrich user experience in BI tools: Instead of making business users switch between a BI tool and data catalog, push important metadata (like business terms, descriptions, owners, and lineage) directly into the BI tool.
  • Notify downstream consumers: Check data pipelines for issues when a data store changes and notify downstream data users about potential breaking changes (e.g. the addition or removal of a column).

Learn more about active metadata here. ???

???More from my reading list

I’ve also added some more resources to my data stack reading list. If you haven’t checked out the list yet, you can find and bookmark it?here .

See you next week!

P.S. Liked reading this edition of the newsletter? We'd love it if you could take a moment and share it with your friends on social.

Nathan Greenhut

Head of Strategic Accounts & Solutions Experienced in data, AI, ML, MLOps and Gen AI. Strong at sales director, delivery, client partner, business development, technology strategy, CIO, CTO, CDAO and CDO functions.

2 年

This change makes a lot of sense. Thank you Prukalpa ? for sharing. Enterprise data catalogues and intelligent metadata help companies to streamline and gain productivity and insights quicker. There is always a need for this and a thirst for this no matter what company and size from what I have seen over the past 20 or more years. I don’t think this will change. I think the difference going forward is how quickly companies can adapt to change, given environment, economic and global political pressures speeding up their cycles of ups and downs. Your work at Atlan is impressive and I highly commend you and your team’s efforts and success!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了