Data catalogs

Data catalogs

Metadata is the foundation for Data Governance

Metadata, often defined as “data about data”, refers to a set of data that describes and gives information about other data. It’s all about context for the data. Having a metadata management layer in your data ecosystem helps your organization to discover, understand, and trust the data assets you own. What metadata should be collected and managed can be best described by the ABC model* of metadata.

So, what is this ABC model for metadata?

Three broad types of metadata fit in this ABC model.

Application Context – information needed by humans or applications to operate. This includes the collection of metadata about the existence of data, and description, semantics, tags associated with the data.

  • Where is the data?
  • What are the semantics of the data?

Behavior — information about how the data is created and used over time. This includes information about ownership, creation, common usage patterns, people, or processes that are frequent users of data, provenance, and lineage.

  • Who is using the data?
  • Who created the data?

Change — information about how the data is changing over time. This captures information about the evolution of data (for example, schema evolution for a table) and the processes that create it (for example, the related ETL code for a table).

  • How is data evolving?
  • How is the code that generates the data is evolving?

Capturing metadata based on the ABC model and using metadata to drive Data Governance applications such as Data Catalog and Data Quality is a key strategy that is being adopted by many fast-growing companies like Uber, Airbnb, LinkedIn, Lyft and others.

Enable self-service analytics using Data Catalog

Thinking of Data Catalog analogous to a library management system for your data is a very narrow view.

What is a data catalog?

Data Catalog is an inventory of available data + metadata often combined with a search tool. It helps data users to easily discover data and evaluate the fitness of data for the intended use.

A data catalog focuses first on data sets and connects those data sets with rich metadata information. Data sets are the files and tables that data teams need to find and access. They may reside in a data lake, warehouse, master data repository, or any other shared data resource.

Why do you need a data catalog?

The greatest value of the data catalog is that it improves the productivity of data teams and enables collaboration. Because in most organizations data and technology exists in silos, data teams are often working blind, without visibility into the data sets that exist. They spend too much time finding and understanding data, often recreating data sets that already exist. Here is a diagram that shows data analysis workflow with and without a Data Catalog.

Data Teams in organizations without a data catalog often rely on Tribal Knowledge or Documentation. This reliance on knowledge of people often leads to a lot of back and forth conversations between team members impacting productivity.

Because data workflows are complex, enabling transparency and collaboration among data users using a Data Catalog is an optimal solution to reduce tribal knowledge dependency and enable self-service analytics.

However, Data Catalog and Metadata associated with data catalogs should be continuously curated and enriched to ensure the information stays updated.

All Data Catalogs are not created equal

As demand for Data Catalogs is growing among enterprises, a lot of Data and Analytics solution providers and Cloud infrastructure providers such as Azure, AWS, and Google are offering Data Catalogs embedded into their solution offerings. These Data Catalogs are called Embedded Data Catalogs.

There are Stand-alone Data Catalogs that are generalist, independent, and business-oriented Data Catalogs for broader use in data management, analytics, and Data Governance. The major difference between Embedded and Stand-alone Data Catalogs are:

Stand-Alone Data CatalogsEmbedded Data CatalogsAutomated discovery, profiling and tagging.Use case specific or tool specific like AWS Glue Catalog that is specific and works well with AWS services like AWS Glue and Redshift.Can be used for any use case and with any data and analytics tools.Leads to metadata silos that are difficult to integrate and use beyond specific use cases.Horizontally scalable for many use cases and tools.Vertically scalable within specific tool or cloud providers infrastructure.Open APIs and Interoperable.Vendor lock-in.

I advise you to investigate and adopt a stand-alone Data catalog when your requirements for Data cataloging transcends a single tool, environment, or use case and when your teams would like to catalog data assets across several use cases.

Modern Data Catalogs are ML Augmented

Machine Learning is significantly impacting data catalogs. Modern machine learning augmented Data Catalogs automate metadata discovery and profiling. ML Augmented Data Catalogs provide an AI-driven search and discovery of data assets including recommendations. Modern Data Catalogs establish a semantic relationship between data using knowledge graphs. They also provide data anomaly detection to identify sensitive PII information flagging risky data assets and outliers.

ML augmented Data Catalogs enable the pervasive use of metadata not just for Data Governance but also to automate data integration, data preparation, data quality, and many other data management activities. This next-generation Data Catalog can, therefore, accelerate time to insights by helping data teams automate most of the data discovery, tagging, and collaboration.

When implementing a Data Catalog solution, it’s important for the organization to deploy a modern ML Augmented Data Catalog solution that will be critical to the success of Data Governance and data & analytics initiatives.

Recommendations to implement a Data Catalog

  • Arrange organization-wide discovery and educational sessions with teams that heavily rely on internal and external distributed data assets about the needs and benefits of ML-augmented Data Catalog to curate the inventory of data.
  • Analyze user case requirements to which requirements would benefit from tactical Data Catalog deployments (cataloging of data in a data lake scenario, for example) versus the ones that need a more strategic implementation (cataloging of data across a hybrid cloud or multi-cloud ecosystem, for example).
  • Understand that tools specific-embedded Data Catalogs (for example, Data Catalogs delivered as part of cloud provider tools such as AWS Glue Data Catalog will improve data usability and trust only in the context of that tool.
  • Avoid Data Catalogs that cannot scale beyond narrow (or tactical) use-case requirements, and those that do not have AI/ML augmentation on the roadmap to automate various parts of Data Cataloging.
  • Identify, source and deploy ML-augmented Data Catalog to curate the inventory of data assets.

What are the business benefits of having a Data Catalog?

The two major business benefits of implementing a Data Catalog are:

  1. Improved productivity of your data teams. The Data Catalog helps anyone in the organization to find the right data for their use case quickly. It is often said that data scientists and data analysts spend only 20% of their time doing data analysis work, with 80% consumed by data issues. With the Data Catalog, the ratio of data issues time vs. data analysis time, potentially reversing the numbers to 20% data time and 80% analysis time.
  2. Data trust and compliance. The Data Catalog helps data teams to trust the data the comes from a reliable source such as reliable data owner, most frequently used data sets. Also, Data Catalog helps data teams spot data compliance issues such as identifying PII data.
  3. Accelerate Time to Insight: Modern Data Catalogs accelerate time to insights for Data & Analytics use cases.

Discovering and identifying data that delivers value, governing the quality, and security of data are some of the biggest challenges that organizations have to face in years to come. ML augmented Data Catalogs will be a must to have solution in their data ecosystem for organizations to keep making data-driven decisions with trust and confidence.

Quick recap

  • Data Catalogs and Data Quality are important aspects of Data Governance to consider.
  • Implementing Data Catalog can help organizations improve the productivity of data teams and accelerate time to insights.
  • Data Catalogs play a critical role to improve Data Quality, Security and Compliance.
  • When choosing a Data Catalog, implement a standalone modern ML augmented Data Catalog.

Data Quality is another important aspect of Data Governance. Kindly read my next blog post in this series on how to implement a data quality solution in your data ecosystem and empower your Data Governance practice to improve data quality and bring trust and confidence to your data-driven decisions.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了