登录查看更多内容

Data Catalogue and Meta Data Management

Isha Rani

Microsoft (Head of Engineering) | AI, Data and Software Engineering Leader | Leadership Coach | Career Accelerator Program

发布日期: 2022年12月6日

+ 关注

Data Engineering System Design 2 : Data Catalogue and Meta Data Management

What are we solving in this newsletter ?

As the company grows so is the data. Data can come from a variety of SQL and no-sql databases, microservices etc. It is a tedious job for data analysts/scientists/engineers to identify the right dataset and keep their schema upto date and build data pipelines. It requires a great collaboration between downstream and upstream teams to track data lineage and metadata.??

In this newsletter, I am covering the best practices and right tools to cover MetaData management.

Ever wondered what MetaData is ?

Metadata is data about data - basically any information about data that we want to store for later usage.
It's not the actual data like a file/video rather detailed information about data.
It can be divided into
Technical data - metadata of schemas, tables, columns etc
Business data - comments, descriptions, ratings, classifications etc

What is metadata management ?

It is the process of capturing metadata changes (for eg table schema), storing and making it searchable.

What is data Catalog?

Inventory of data assets within an organization.
Data catalog uses metadata to help businesses manage the data they produce to aid data discovery and data governance

What is data classification?

We can classify the data based on various categories
Classification could be PII, EXPIRES_ON, DATA_QUALITY, SENSITIVE

Andrew C. Madson 6 个月前

Myths and Misconceptions Data Mesh and Data Warehousing

Lyftrondata 1 个月前

Understanding the Data Vault Model: ABC to Advanced…

Krishna Srikanth K 7 个月前

In the above paragraphs I talked a lot about some buzzwords (Metadata, Data catalog, Classification) in the data world. But, is there an out of the box solution that can solve these challenges?

Yes, Apache Atlas!!!

Apache Atlas provides open metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team.

Push and Pull model

Apache atlas provides out of the box Kafka integrations for Hive, SQOOP, Spark and storm to capture metadata and lineage information.
Atlas exposes REST APIs to perform CRUD operations on metadata.

How do I get notified if a schema changes on the source side?

Thanks to atlas as it captures every change and pushes into a Kafka topic called “ATLAS_ENTITIES” so that downstream can listen to the changes and act on it quickly.

Architecture of Metadata and lineage management with Atlas <This is just high level based on my current understanding and there can be changes to it>

Please note that there are various other alternatives like Google data catalog, Amundsen and Datahub.

#data #datamanagement #dataengineering #design

Aim High, Fly High with Isha

20,164 位关注者

Tani Chhabra

Data Engineering

5 个月

Great content ????

Madhurima Paul

1 年

Sumit Gupta

SAURABH SHARMA

Big data architect, Data engineer, Multi Cloud (AWS/Azure) Architect

1 年

Wow nice one

1 次回应

Rakesh Dwivedi

1 年

Helpful! This will

1 次回应

Rahul Sharma

Vice President at Goldman Sachs

1 年

Great work. Thanks for sharing

查看更多评论

要查看或添加评论，请登录

Isha Rani的更多文章

Get Ready for Your Next System Design Interview

2023年9月24日

Get Ready for Your Next System Design Interview

Lets learn Top 6 Service Communications Strategies 12 Microservices Best Practices for Scalability and Resilience…

11 条评论
Data Engineering System Design 3 : Data Backfilling : What, When, Why and How

2022年12月17日

Data Engineering System Design 3 : Data Backfilling : What, When, Why and How

What is Data Backfilling ? It's the process of filling in missing historical data that does not exist in the system or…

247 条评论
Change Data Capture (CDC) Events Ingestion

2022年11月30日

Change Data Capture (CDC) Events Ingestion

Data Engineering System Design 1: Change Data Capture (CDC) events ingestion Why Change Data Capture (CDC) ? All…

48 条评论
10 reasons an enterprise should invest in a Self Serve Ingestion Platform

2021年9月18日

10 reasons an enterprise should invest in a Self Serve Ingestion Platform

Heard a lot about Data Mesh and wondering what’s the next step for your Data Ingestion Platform ? Or wondering why your…

3 条评论

Data Catalogue and Meta Data Management

Isha Rani

Microsoft (Head of Engineering) | AI, Data and Software Engineering Leader | Leadership Coach | Career Accelerator Program

领英推荐

Aim High, Fly High with Isha

20,164 位关注者

Isha Rani的更多文章

社区洞察

其他会员也浏览了

Understanding Data Mesh: A Modern Approach to Data Architecture

Data Engineering: The Backbone of Modern Data-Driven Decision Making

Data Lakehouse: Next Generation Data Management

Lakehouse: The Convergence of Data Warehousing, Data Science and Data Governance

Data Assets, Data Products, Data as a Product, Data Engineering - The Whimsical World of Data Terminology Soup

Importance Of Data-Centric Architecture In Business

Leap to Success with Data Pipeline

Management of Large Volumes of Data

Data Modeling Techniques for Effective Data Management

领英推荐

Aim High, Fly High with Isha

20,164 位关注者

Isha Rani的更多文章

Get Ready for Your Next System Design Interview

Data Engineering System Design 3 : Data Backfilling : What, When, Why and How

Change Data Capture (CDC) Events Ingestion

10 reasons an enterprise should invest in a Self Serve Ingestion Platform

社区洞察

其他会员也浏览了

Understanding Data Mesh: A Modern Approach to Data Architecture

Data Engineering: The Backbone of Modern Data-Driven Decision Making

Data Lakehouse: Next Generation Data Management

Lakehouse: The Convergence of Data Warehousing, Data Science and Data Governance

Data Assets, Data Products, Data as a Product, Data Engineering - The Whimsical World of Data Terminology Soup

Importance Of Data-Centric Architecture In Business

Leap to Success with Data Pipeline

Management of Large Volumes of Data

Data Modeling Techniques for Effective Data Management