Can AWS DataZone solve modern data sprawl?
As businesses collect more data and workload continues to shift to the cloud, data management is now a critical part of a successful data and analytics strategy.? AWS has long provided cloud capabilities to store, process, query, and visualize data.? Now with the release of DataZone, AWS has added a capability for end users to curate, publish, and consume governed data.?
Even well managed organizations find it difficult to curate and publish disparate cloud-based datasets to the enterprise.? With Datazone, data can be presented to data stewards, data consumers, and data producers in a unified view, providing more structure and organization to AWS data assets stored in S3, Redshift, and Glue.? Each user persona interacts differently with Datazone:
?Data Producers use DataZone to create publishing agreements that define terms and conditions for data publication and subscription.? This makes it easy to share data assets without the need for complex IAM administration processes.??
Data Stewards use Datazone to manage data assets (e.g. business and technical metadata) from all AWS accounts in a single unified portal.? Using the Datazone portal, stewards can review and respond to data consumer access requests.? This review and approval process abides by a data publishing agreement and controls access and usage of data without the need for implementing complex IAM roles.
Data Consumers can discover and request access to data assets via the data zone portal.? The portal provides data consumers with a process to submit an access request for review and approval by a data steward.? Data consumers can also easily become data producers by publishing their insights back to the Datazone portal for other users to discover and consume. ?
All user activities are done via the Datazone web portal – a stand-alone application that authenticates users using IAM credentials or via the AWS IAM Identity center.? The portal provides users with the core functionality listed below:
Publish Data – Data producers who want to expose data to other users can create a publishing agreement and allow data sets to be accessed and consumed by other users.
Subscribe to Data – Users can search metadata and request access (e.g. subscribe) to data sets that have been published.
Create Projects – Users can create a project container that allows for the publication and consumption of data.? Projects abstract users from networking and IAM complexities and establish datazone user access to Athena, Glue Catalog, S3 and redshift.
Query Data – With one click from the Datazone portal, users are directed to Athena and have access to work with project data.? If the data subscription is approved, users will automatically have access to Athena and the associated data assets.? ?
Using Datazone
Setup and Configuration
Like most AWS services, the set up and configuration of Datazone is relatively straightforward.? Using the AWS console or CLI, a datazone domain is created for the associated account.? It is important to note that only one domain can be created for a given AWS account which forces a central location for data governance activities.? The domain can utilize IAM identity center to allow SSO users access to the end user DataZone portal or users can authorize using IAM credentials.? These choice of access controls are established when the domain is created and can be modified once the domain is created.? If other AWS accounts will participate in the DataZone domain, those accounts must be associated via the AWS console or CLI. ?
Once the domain is established, all other administrative access controls for data assets are managed within the Datazone portal.
Data Discovery and Usage:
Many AWS customers have multiple accounts to support their business needs.? Datazone makes it easy to centralize and manage metadata from all accounts in a single location.? By centralizing all business and technical metadata together in one location, end users can go to a single web application to search, discover, and query data.? Discovering data sets is done via standard search and filtering tools that allow a user to quickly find an asset. When data assets are found, the user can subscribe to the dataset, which will kick off an access request workflow to the assigned data steward. ?A search screen of a sample “baseball” domain is shown below.
领英推荐
?Data Publication and Subscription
Data owners can publish data sets directly from the Datazone portal and provide data stewards with the capability to curate the data and manage access requests. Publishing data creates a relationship between the underlying data set, the glue catalog, the AWS account, the Datazone project, and the datazone curated metadata.? This linkage allows for easy discovery through search and provide data stewards with the ability to control access.?
The image below shows a project that has published data sets that are available for data discovery and subscription. ??
Once a user discovers a dataset of interest, they can request access through a subscription process.? This subscription request process will notify the responsible data steward of the request and allow the data steward to grant or reject access to the data, as shown below.
Data Consumption
When a subscription to data is approved by the data steward for project use, a datazone user with project access immediately has the ability to query the data using AWS Athena.? The data is exposed in a project specific glue catalog, and user access is federated from Datazone to Athena.? No AWS console access is needed as users are passed directly from Datazone to Athena.? Access is controlled by the Datazone project and only the project specific glue catalog and data assets are available, as shown in the example below. ?
Conclusion
Datazone can help businesses that use AWS improve data governance processes while also making data discovery and usage easier. The most exciting feature offered is the ability to grant access to query datasets via Athena directly from Datazone.? This takes away an access control task from AWS administrators and moves the responsibility to data stewards that understand who should access data and how that data should be used.? Businesses that are looking to empower their data analysis and data science teams abd make data more easily available, while improving governance controls will surely find value in Datazone. ??Once a strong data foundation in place, organizations will be free to explore new opportunities such as data productization.