Demystifying AWS DataZone

Demystifying AWS DataZone

Amazon DataZone is a streamlined service for managing data, enabling quick cataloging, discovery, sharing, and governance across AWS. It allows administrators and data stewards to regulate data access with precise controls, ensuring appropriate access levels and context. This makes it simpler for a wide range of user persona's, including engineers, data scientists, product managers, analysts, and business personnel, to access and collaborate on organizational data for insightful decision-making and analytical reporting.


AWS services getting leveraged in Data Lake Architecture

Amazon Web Services (AWS) offers a comprehensive ecosystem to build and manage data lakes, harnessing the power of services like Lake Formation, Glue, Athena, and centralized domain owned Data Zones. This article aims to guide you through the best practices for launching your AWS Data Lake, focusing on configuring Lake Formation and establishing a Data Zone.

Data Lake

A Data Lake is a centralized repository allowing storage of all data types at scale, enabling secure data collection from various sources and analysis using different tools for flexible, large-scale data processing.

AWS Lake Formation

AWS Lake Formation simplifies creating a secure data lake in AWS, automating integrations with other AWS services like S3 and Glue metadata store for easy data management and access control.

AWS Glue

AWS Glue is a managed service that makes data discovery, preparation, and cataloging for seamless analytics and machine learning use-cases.

AWS Athena

AWS Athena is a serverless query service for analyzing data in Amazon S3 using SQL. Ideal for quick, ad-hoc analyses, it supports direct queries on various data formats, facilitating efficient analysis and reporting.


AWS Data Zone

The concept of a Data Zone within data lake facilitates efficient data management, governance. It is essentially a segmented area in your data lake designed to categorize data based on its readiness for centralized consumption in pub-sub model.

Key steps for enabling the Data Zone using AWS console and managing the access enablement for producers and subscribers are outlined below -

Create S3 Bucket for DataZone

Navigate to S3 console and set up a bucket named “datazone-bucket-12345”. We’ll use this bucket for our Data. We’ll use this bucket for our DataZone area. Don’t forget to enable versioning.

Create Domain for DataZone

Navigate to AWS Datazone and create a domain for further consumption.

After that, you can see a dashboard with domain settings. Let's go to the “Blueprints” and select DefaultDataLake option and enable.

Leverage s3 location we've created for Data Lake: “s3://datazone-bucket-12345

Create a project for Data Publisher

Next, navigate to the DataZone dashboard and select “Open data portal”. You will then be taken to the portal’s main page. Now it’s time to set up the first project for the Publisher. To do this, click on “Create project”.

Enter a name in the input field, for example, “Publisher"

Next step, would be to create environment

First, we have to create profile for Publisher project. Please choose “Create Environment Profile” from “Environments” tab.

Next, it’s time to create an environment. For this, go to the “Environments” tab and click on “create environment”. Fill in the Name and select the profile you created earlier. Leave the rest of the form blank. This way, DataZone will apply the default naming convention.

After you initiate a new environment, DataZone will start creating resources for it. Behind the scenes, CloudFormation will deploy the stack. Your new environment will be set up shortly!

Once it’s ready, you’ll be able to view a dashboard for the environment. Now please repeat the steps to create a Consumer profile with a new environment! You should have projects like this screen:

Now you can create a table with some data. Make sure that your publishing environment is selected and the database publisherdata_pub_db is selected as in the query editor.

You can use this example to ingest data into a new table: inventory_table

After that you can see your table on AWS Athena:

Generate metadata

It’s time to return to your DataZone and generate metadata from the table you created in the previous step. As a Publisher, navigate to the DATA tab and select DataSources from the menu. Here, you’ll see a list of your sources from which the system can generate metadata. Click on the first one, “PublisherData-default-datasource” which is set by default. Next to the Action dropdown menu, choose Run, and then hit the refresh button. Once the data source run is complete, the assets will be added to the Amazon DataZone inventory.

Subscribe data from the data catalog

As a “Consumer” project please search inventory_table asset and next send a request for subscribing the data.

As a “Publisher” go to DATA tab and choose “Incoming requests” and approve.

Time to use your data

Now that you have successfully published an asset to the DataZone catalog and subscribed, it’s time to return to DataZone, select the Consumer project, and log into Athena. Please choose the consumerdata_sub_db database and preview the inventory_table.

Appendix

Data Lake Blueprint

This blueprint outlines how to start and set up AWS Glue, AWS Lake Formation, and Amazon Athena in the Amazon DataZone catalog.

References

AWS re:Invent 2023 - What’s new in Amazon DataZone



Abhijit Nandy

Senior Cloud Data Architect | AWS | AZURE | Databricks | Snowflake | || I help organizations to unlock the value of their data by designing and implementing scalable, secure, and cost-efficient cloud data solutions

9 个月

Insightful

Anujay Suyal

Software Engineer at NatWest Group | Python | Airflow | AWS | Big Data | PySpark | Machine Learning

9 个月

Interesting!

要查看或添加评论,请登录

Akhil Makol的更多文章

  • Data Modeling Fundamentals

    Data Modeling Fundamentals

    Overview In the ever-evolving landscape of data management, the role of data modeling has become paramount. It serves…

    3 条评论
  • Data Engineering on AWS

    Data Engineering on AWS

    Data engineering is the foundation for data science and analytics by integrating in-depth knowledge of data technology,…

    4 条评论
  • Introduction to Amazon Bedrock

    Introduction to Amazon Bedrock

    Overview Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images,…

    1 条评论
  • Life Is What We Think Life Is :)

    Life Is What We Think Life Is :)

    A psychologist walked around a room while teaching stress management to an audience. As she raised a glass of water…

社区洞察

其他会员也浏览了