Data Catalogue and Meta Data Management

Data Catalogue and Meta Data Management

Data Engineering System Design 2 : Data Catalogue and Meta Data Management

What are we solving in this newsletter ?

As the company grows so is the data. Data can come from a variety of SQL and no-sql databases, microservices etc. It is a tedious job for data analysts/scientists/engineers to identify the right dataset and keep their schema upto date and build data pipelines. It requires a great collaboration between downstream and upstream teams to track data lineage and metadata.??

In this newsletter, I am covering the best practices and right tools to cover MetaData management.

Ever wondered what MetaData is ?

  1. Metadata is data about data - basically any information about data that we want to store for later usage.
  2. It's not the actual data like a file/video rather detailed information about data.
  3. It can be divided into
  4. Technical data - metadata of schemas, tables, columns etc
  5. Business data - comments, descriptions, ratings, classifications etc

What is metadata management ?

It is the process of capturing metadata changes (for eg table schema), storing and making it searchable.

What is data Catalog?

  1. Inventory of data assets within an organization.
  2. Data catalog uses metadata to help businesses manage the data they produce to aid data discovery and data governance

What is data classification?

  1. We can classify the data based on various categories
  2. Classification could be PII, EXPIRES_ON, DATA_QUALITY, SENSITIVE

In the above paragraphs I talked a lot about some buzzwords (Metadata, Data catalog, Classification) in the data world. But, is there an out of the box solution that can solve these challenges?

Yes, Apache Atlas!!!

Apache Atlas provides open metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team.

Push and Pull model

  1. Apache atlas provides out of the box Kafka integrations for Hive, SQOOP, Spark and storm to capture metadata and lineage information.
  2. Atlas exposes REST APIs to perform CRUD operations on metadata.

How do I get notified if a schema changes on the source side?

Thanks to atlas as it captures every change and pushes into a Kafka topic called “ATLAS_ENTITIES” so that downstream can listen to the changes and act on it quickly.

Architecture of Metadata and lineage management with Atlas <This is just high level based on my current understanding and there can be changes to it>

No alt text provided for this image

Please note that there are various other alternatives like Google data catalog, Amundsen and Datahub.

#data #datamanagement #dataengineering #design

Tani Chhabra

Data Engineering

5 个月

Great content ????

回复
Madhurima Paul

Associate Art Director | Creative Designer | visualiser | Branding | Communication Designer | Graphic Designer | Social Media Designer | Mainline Designer

1 年
回复
SAURABH SHARMA

Big data architect, Data engineer, Multi Cloud (AWS/Azure) Architect

1 年

Wow nice one

Rahul Sharma

Vice President at Goldman Sachs

1 年

Great work. Thanks for sharing

回复

要查看或添加评论,请登录

Isha Rani的更多文章

社区洞察

其他会员也浏览了