Data catalogs
Darshika Srivastava
Associate Project Manager @ HuQuo | MBA,Amity Business School
Metadata is the foundation for Data Governance
Metadata, often defined as “data about data”, refers to a set of data that describes and gives information about other data. It’s all about context for the data. Having a metadata management layer in your data ecosystem helps your organization to discover, understand, and trust the data assets you own. What metadata should be collected and managed can be best described by the ABC model* of metadata.
So, what is this ABC model for metadata?
Three broad types of metadata fit in this ABC model.
Application Context – information needed by humans or applications to operate. This includes the collection of metadata about the existence of data, and description, semantics, tags associated with the data.
Behavior — information about how the data is created and used over time. This includes information about ownership, creation, common usage patterns, people, or processes that are frequent users of data, provenance, and lineage.
Change — information about how the data is changing over time. This captures information about the evolution of data (for example, schema evolution for a table) and the processes that create it (for example, the related ETL code for a table).
Capturing metadata based on the ABC model and using metadata to drive Data Governance applications such as Data Catalog and Data Quality is a key strategy that is being adopted by many fast-growing companies like Uber, Airbnb, LinkedIn, Lyft and others.
Enable self-service analytics using Data Catalog
Thinking of Data Catalog analogous to a library management system for your data is a very narrow view.
What is a data catalog?
Data Catalog is an inventory of available data + metadata often combined with a search tool. It helps data users to easily discover data and evaluate the fitness of data for the intended use.
A data catalog focuses first on data sets and connects those data sets with rich metadata information. Data sets are the files and tables that data teams need to find and access. They may reside in a data lake, warehouse, master data repository, or any other shared data resource.
Why do you need a data catalog?
The greatest value of the data catalog is that it improves the productivity of data teams and enables collaboration. Because in most organizations data and technology exists in silos, data teams are often working blind, without visibility into the data sets that exist. They spend too much time finding and understanding data, often recreating data sets that already exist. Here is a diagram that shows data analysis workflow with and without a Data Catalog.
Data Teams in organizations without a data catalog often rely on Tribal Knowledge or Documentation. This reliance on knowledge of people often leads to a lot of back and forth conversations between team members impacting productivity.
Because data workflows are complex, enabling transparency and collaboration among data users using a Data Catalog is an optimal solution to reduce tribal knowledge dependency and enable self-service analytics.
领英推荐
However, Data Catalog and Metadata associated with data catalogs should be continuously curated and enriched to ensure the information stays updated.
All Data Catalogs are not created equal
As demand for Data Catalogs is growing among enterprises, a lot of Data and Analytics solution providers and Cloud infrastructure providers such as Azure, AWS, and Google are offering Data Catalogs embedded into their solution offerings. These Data Catalogs are called Embedded Data Catalogs.
There are Stand-alone Data Catalogs that are generalist, independent, and business-oriented Data Catalogs for broader use in data management, analytics, and Data Governance. The major difference between Embedded and Stand-alone Data Catalogs are:
Stand-Alone Data CatalogsEmbedded Data CatalogsAutomated discovery, profiling and tagging.Use case specific or tool specific like AWS Glue Catalog that is specific and works well with AWS services like AWS Glue and Redshift.Can be used for any use case and with any data and analytics tools.Leads to metadata silos that are difficult to integrate and use beyond specific use cases.Horizontally scalable for many use cases and tools.Vertically scalable within specific tool or cloud providers infrastructure.Open APIs and Interoperable.Vendor lock-in.
I advise you to investigate and adopt a stand-alone Data catalog when your requirements for Data cataloging transcends a single tool, environment, or use case and when your teams would like to catalog data assets across several use cases.
Modern Data Catalogs are ML Augmented
Machine Learning is significantly impacting data catalogs. Modern machine learning augmented Data Catalogs automate metadata discovery and profiling. ML Augmented Data Catalogs provide an AI-driven search and discovery of data assets including recommendations. Modern Data Catalogs establish a semantic relationship between data using knowledge graphs. They also provide data anomaly detection to identify sensitive PII information flagging risky data assets and outliers.
ML augmented Data Catalogs enable the pervasive use of metadata not just for Data Governance but also to automate data integration, data preparation, data quality, and many other data management activities. This next-generation Data Catalog can, therefore, accelerate time to insights by helping data teams automate most of the data discovery, tagging, and collaboration.
When implementing a Data Catalog solution, it’s important for the organization to deploy a modern ML Augmented Data Catalog solution that will be critical to the success of Data Governance and data & analytics initiatives.
Recommendations to implement a Data Catalog
What are the business benefits of having a Data Catalog?
The two major business benefits of implementing a Data Catalog are:
Discovering and identifying data that delivers value, governing the quality, and security of data are some of the biggest challenges that organizations have to face in years to come. ML augmented Data Catalogs will be a must to have solution in their data ecosystem for organizations to keep making data-driven decisions with trust and confidence.
Quick recap
Data Quality is another important aspect of Data Governance. Kindly read my next blog post in this series on how to implement a data quality solution in your data ecosystem and empower your Data Governance practice to improve data quality and bring trust and confidence to your data-driven decisions.