登录查看更多内容

Demystifying AWS DataZone

Akhil Makol

Senior Vice President, Principal Engineer @ NatWest Group | 40under40 Data Science & Analytics Leader | SAFe? Agilist | Data Engineering | DevOps | Data Marketplace | Responsible AI | Fintech

发布日期: 2024年5月19日

Amazon DataZone is a streamlined service for managing data, enabling quick cataloging, discovery, sharing, and governance across AWS. It allows administrators and data stewards to regulate data access with precise controls, ensuring appropriate access levels and context. This makes it simpler for a wide range of user persona's, including engineers, data scientists, product managers, analysts, and business personnel, to access and collaborate on organizational data for insightful decision-making and analytical reporting.

AWS services getting leveraged in Data Lake Architecture

Amazon Web Services (AWS) offers a comprehensive ecosystem to build and manage data lakes, harnessing the power of services like Lake Formation, Glue, Athena, and centralized domain owned Data Zones. This article aims to guide you through the best practices for launching your AWS Data Lake, focusing on configuring Lake Formation and establishing a Data Zone.

Data Lake

A Data Lake is a centralized repository allowing storage of all data types at scale, enabling secure data collection from various sources and analysis using different tools for flexible, large-scale data processing.

AWS Lake Formation

AWS Lake Formation simplifies creating a secure data lake in AWS, automating integrations with other AWS services like S3 and Glue metadata store for easy data management and access control.

AWS Glue

AWS Glue is a managed service that makes data discovery, preparation, and cataloging for seamless analytics and machine learning use-cases.

AWS Athena

AWS Athena is a serverless query service for analyzing data in Amazon S3 using SQL. Ideal for quick, ad-hoc analyses, it supports direct queries on various data formats, facilitating efficient analysis and reporting.

AWS Data Zone

The concept of a Data Zone within data lake facilitates efficient data management, governance. It is essentially a segmented area in your data lake designed to categorize data based on its readiness for centralized consumption in pub-sub model.

Key steps for enabling the Data Zone using AWS console and managing the access enablement for producers and subscribers are outlined below -

Create S3 Bucket for DataZone

Navigate to S3 console and set up a bucket named “datazone-bucket-12345”. We’ll use this bucket for our Data. We’ll use this bucket for our DataZone area. Don’t forget to enable versioning.

Create Domain for DataZone

Navigate to AWS Datazone and create a domain for further consumption.

After that, you can see a dashboard with domain settings. Let's go to the “Blueprints” and select DefaultDataLake option and enable.

Leverage s3 location we've created for Data Lake: “s3://datazone-bucket-12345”

Create a project for Data Publisher

Next, navigate to the DataZone dashboard and select “Open data portal”. You will then be taken to the portal’s main page. Now it’s time to set up the first project for the Publisher. To do this, click on “Create project”.

Enter a name in the input field, for example, “Publisher"

Next step, would be to create environment

First, we have to create profile for Publisher project. Please choose “Create Environment Profile” from “Environments” tab.

领英推荐

AWS Glue Tutorial for Beginners

Neal K. Davis 3 年前

A Guide to Modern Cloud Data Platforms

Dr. Rabi Prasad Padhy 1 年前

Azure Databricks Vs Snowflake: A Comparison Guide You…

Kanerika Inc 2 个月前

Next, it’s time to create an environment. For this, go to the “Environments” tab and click on “create environment”. Fill in the Name and select the profile you created earlier. Leave the rest of the form blank. This way, DataZone will apply the default naming convention.

After you initiate a new environment, DataZone will start creating resources for it. Behind the scenes, CloudFormation will deploy the stack. Your new environment will be set up shortly!

Once it’s ready, you’ll be able to view a dashboard for the environment. Now please repeat the steps to create a Consumer profile with a new environment! You should have projects like this screen:

Now you can create a table with some data. Make sure that your publishing environment is selected and the database publisherdata_pub_db is selected as in the query editor.

You can use this example to ingest data into a new table: inventory_table

After that you can see your table on AWS Athena:

Generate metadata

It’s time to return to your DataZone and generate metadata from the table you created in the previous step. As a Publisher, navigate to the DATA tab and select DataSources from the menu. Here, you’ll see a list of your sources from which the system can generate metadata. Click on the first one, “PublisherData-default-datasource” which is set by default. Next to the Action dropdown menu, choose Run, and then hit the refresh button. Once the data source run is complete, the assets will be added to the Amazon DataZone inventory.

Subscribe data from the data catalog

As a “Consumer” project please search inventory_table asset and next send a request for subscribing the data.

As a “Publisher” go to DATA tab and choose “Incoming requests” and approve.

Time to use your data

Now that you have successfully published an asset to the DataZone catalog and subscribed, it’s time to return to DataZone, select the Consumer project, and log into Athena. Please choose the consumerdata_sub_db database and preview the inventory_table.

Appendix

Data Lake Blueprint

This blueprint outlines how to start and set up AWS Glue, AWS Lake Formation, and Amazon Athena in the Amazon DataZone catalog.

References

AWS re:Invent 2023 - What’s new in Amazon DataZone

Abhijit Nandy

Senior Cloud Data Architect | AWS | AZURE | Databricks | Snowflake | || I help organizations to unlock the value of their data by designing and implementing scalable, secure, and cost-efficient cloud data solutions

10 个月

Insightful

1 次回应

Anujay Suyal

10 个月

Interesting!

1 次回应

查看更多评论

要查看或添加评论，请登录

Akhil Makol的更多文章

Data Modeling Fundamentals

2024年2月11日

Data Modeling Fundamentals

Overview In the ever-evolving landscape of data management, the role of data modeling has become paramount. It serves…

3 条评论
Data Engineering on AWS

2023年12月30日

Data Engineering on AWS

Data engineering is the foundation for data science and analytics by integrating in-depth knowledge of data technology,…

4 条评论
Introduction to Amazon Bedrock

2023年9月23日

Introduction to Amazon Bedrock

Overview Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images,…

1 条评论
Life Is What We Think Life Is :)

2015年10月9日

Life Is What We Think Life Is :)

A psychologist walked around a room while teaching stress management to an audience. As she raised a glass of water…

Demystifying AWS DataZone

Akhil Makol

Senior Vice President, Principal Engineer @ NatWest Group | 40under40 Data Science & Analytics Leader | SAFe? Agilist | Data Engineering | DevOps | Data Marketplace | Responsible AI | Fintech

AWS services getting leveraged in Data Lake Architecture

Data Lake

AWS Lake Formation

AWS Glue

AWS Athena

AWS Data Zone

Create a project for Data Publisher

Next step, would be to create environment

领英推荐

Generate metadata

Subscribe data from the data catalog

Time to use your data

Appendix

References

AWS re:Invent 2023 - What’s new in Amazon DataZone

Akhil Makol的更多文章

社区洞察

其他会员也浏览了

Unlocking the Power of Data: Modern Data Analytics Reference Architecture on AWS

Data Virtualization for Google Bigquery with a powerful combination of Lyftrondata

Data Virtualization for Google Bigquery with a powerful combination of Lyftrondata

Navigate the World of Cloud Data Services: An Overview for Tech Executives

Google cloud bigquery

Building a Data Ingestion Pipeline on Google Cloud Platform (GCP)

CIO Strategy for AWS Big Data Implementation

Azure Cloud Data Engineering

Amazon Redshift

AWS services getting leveraged in Data Lake Architecture

Data Lake

AWS Lake Formation

AWS Glue

AWS Athena

AWS Data Zone

Create a project for Data Publisher

Next step, would be to create environment

领英推荐

Generate metadata

Subscribe data from the data catalog

Time to use your data

Appendix

References

AWS re:Invent 2023 - What’s new in Amazon DataZone

Akhil Makol的更多文章

Data Modeling Fundamentals

Data Engineering on AWS

Introduction to Amazon Bedrock

Life Is What We Think Life Is :)

社区洞察

其他会员也浏览了

Unlocking the Power of Data: Modern Data Analytics Reference Architecture on AWS

Data Virtualization for Google Bigquery with a powerful combination of Lyftrondata

Data Virtualization for Google Bigquery with a powerful combination of Lyftrondata

Navigate the World of Cloud Data Services: An Overview for Tech Executives

Google cloud bigquery

Building a Data Ingestion Pipeline on Google Cloud Platform (GCP)

CIO Strategy for AWS Big Data Implementation

Azure Cloud Data Engineering

Amazon Redshift