Data Catalogue and Meta Data Management
Data Engineering System Design 2 : Data Catalogue and Meta Data Management
What are we solving in this newsletter ?
As the company grows so is the data. Data can come from a variety of SQL and no-sql databases, microservices etc. It is a tedious job for data analysts/scientists/engineers to identify the right dataset and keep their schema upto date and build data pipelines. It requires a great collaboration between downstream and upstream teams to track data lineage and metadata.??
In this newsletter, I am covering the best practices and right tools to cover MetaData management.
Ever wondered what MetaData is ?
What is metadata management ?
It is the process of capturing metadata changes (for eg table schema), storing and making it searchable.
What is data Catalog?
What is data classification?
领英推荐
In the above paragraphs I talked a lot about some buzzwords (Metadata, Data catalog, Classification) in the data world. But, is there an out of the box solution that can solve these challenges?
Yes, Apache Atlas!!!
Apache Atlas provides open metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team.
Push and Pull model
How do I get notified if a schema changes on the source side?
Thanks to atlas as it captures every change and pushes into a Kafka topic called “ATLAS_ENTITIES” so that downstream can listen to the changes and act on it quickly.
Architecture of Metadata and lineage management with Atlas <This is just high level based on my current understanding and there can be changes to it>
Please note that there are various other alternatives like Google data catalog, Amundsen and Datahub.
Data Engineering
5 个月Great content ????
Associate Art Director | Creative Designer | visualiser | Branding | Communication Designer | Graphic Designer | Social Media Designer | Mainline Designer
1 年Sumit Gupta
Big data architect, Data engineer, Multi Cloud (AWS/Azure) Architect
1 年Wow nice one
--
1 年Helpful! This will
Vice President at Goldman Sachs
1 年Great work. Thanks for sharing