Demystifying AWS DataZone
Akhil Makol
SVP, Principal Engineer @ NatWest Group | 40under40 Data Science & Analytics Leader | Data Strategy | Data Engineering | Data Architecture | Driving DevOps, Responsible AI exploration and adoption in Fintech
Amazon DataZone is a streamlined service for managing data, enabling quick cataloging, discovery, sharing, and governance across AWS. It allows administrators and data stewards to regulate data access with precise controls, ensuring appropriate access levels and context. This makes it simpler for a wide range of user persona's, including engineers, data scientists, product managers, analysts, and business personnel, to access and collaborate on organizational data for insightful decision-making and analytical reporting.
AWS services getting leveraged in Data Lake Architecture
Amazon Web Services (AWS) offers a comprehensive ecosystem to build and manage data lakes, harnessing the power of services like Lake Formation, Glue, Athena, and centralized domain owned Data Zones. This article aims to guide you through the best practices for launching your AWS Data Lake, focusing on configuring Lake Formation and establishing a Data Zone.
Data Lake
A Data Lake is a centralized repository allowing storage of all data types at scale, enabling secure data collection from various sources and analysis using different tools for flexible, large-scale data processing.
AWS Lake Formation
AWS Lake Formation simplifies creating a secure data lake in AWS, automating integrations with other AWS services like S3 and Glue metadata store for easy data management and access control.
AWS Glue
AWS Glue is a managed service that makes data discovery, preparation, and cataloging for seamless analytics and machine learning use-cases.
AWS Athena
AWS Athena is a serverless query service for analyzing data in Amazon S3 using SQL. Ideal for quick, ad-hoc analyses, it supports direct queries on various data formats, facilitating efficient analysis and reporting.
AWS Data Zone
The concept of a Data Zone within data lake facilitates efficient data management, governance. It is essentially a segmented area in your data lake designed to categorize data based on its readiness for centralized consumption in pub-sub model.
Key steps for enabling the Data Zone using AWS console and managing the access enablement for producers and subscribers are outlined below -
Create S3 Bucket for DataZone
Navigate to S3 console and set up a bucket named “datazone-bucket-12345”. We’ll use this bucket for our Data. We’ll use this bucket for our DataZone area. Don’t forget to enable versioning.
Create Domain for DataZone
Navigate to AWS Datazone and create a domain for further consumption.
After that, you can see a dashboard with domain settings. Let's go to the “Blueprints” and select DefaultDataLake option and enable.
Leverage s3 location we've created for Data Lake: “s3://datazone-bucket-12345”
Create a project for Data Publisher
Next, navigate to the DataZone dashboard and select “Open data portal”. You will then be taken to the portal’s main page. Now it’s time to set up the first project for the Publisher. To do this, click on “Create project”.
Enter a name in the input field, for example, “Publisher"
Next step, would be to create environment
First, we have to create profile for Publisher project. Please choose “Create Environment Profile” from “Environments” tab.
领英推荐
Next, it’s time to create an environment. For this, go to the “Environments” tab and click on “create environment”. Fill in the Name and select the profile you created earlier. Leave the rest of the form blank. This way, DataZone will apply the default naming convention.
After you initiate a new environment, DataZone will start creating resources for it. Behind the scenes, CloudFormation will deploy the stack. Your new environment will be set up shortly!
Once it’s ready, you’ll be able to view a dashboard for the environment. Now please repeat the steps to create a Consumer profile with a new environment! You should have projects like this screen:
Now you can create a table with some data. Make sure that your publishing environment is selected and the database publisherdata_pub_db is selected as in the query editor.
You can use this example to ingest data into a new table: inventory_table
After that you can see your table on AWS Athena:
Generate metadata
It’s time to return to your DataZone and generate metadata from the table you created in the previous step. As a Publisher, navigate to the DATA tab and select DataSources from the menu. Here, you’ll see a list of your sources from which the system can generate metadata. Click on the first one, “PublisherData-default-datasource” which is set by default. Next to the Action dropdown menu, choose Run, and then hit the refresh button. Once the data source run is complete, the assets will be added to the Amazon DataZone inventory.
Subscribe data from the data catalog
As a “Consumer” project please search inventory_table asset and next send a request for subscribing the data.
As a “Publisher” go to DATA tab and choose “Incoming requests” and approve.
Time to use your data
Now that you have successfully published an asset to the DataZone catalog and subscribed, it’s time to return to DataZone, select the Consumer project, and log into Athena. Please choose the consumerdata_sub_db database and preview the inventory_table.
Appendix
Data Lake Blueprint
This blueprint outlines how to start and set up AWS Glue, AWS Lake Formation, and Amazon Athena in the Amazon DataZone catalog.
References
AWS re:Invent 2023 - What’s new in Amazon DataZone
Senior Cloud Data Architect | AWS | AZURE | Databricks | Snowflake | || I help organizations to unlock the value of their data by designing and implementing scalable, secure, and cost-efficient cloud data solutions
9 个月Insightful
Software Engineer at NatWest Group | Python | Airflow | AWS | Big Data | PySpark | Machine Learning
9 个月Interesting!