登录查看更多内容

Lightweight Implementation of Self-Service Data Sharing Platform on Azure

Andrei Zaichikov

Director, Enterprise Technology Strategy, EMEA at Pure Storage

发布日期: 2022年11月18日

In vast number of companies obtaining access to the dataset requires dozens of emails, meetings, and tons of ineffective communications. Bureaucracy, complexity of systems and inefficiency of processes hinder our ability to work effectively. The main factor affecting our ability to comply with the requirements of our businesses is the complexity and inefficiency of the legacy process of data governance.

To address this gap, a concept of Self-Service Data platforms (aka Data Marketplace) emerged a few years ago, and some early adopters rushed into implementation. Nowadays this partially converged into more modern and fashionable Data Mesh concept, but the problem is still there.

In this article we would like to summarize our experience from over the past three years in designing Self-Service Data Solutions and provide the technical aspects to implement the Self-Service Data solution using Azure Native services. We don’t intend to touch cultural, political, and other topics.

N.B. Before we begin, please remember that it is a personal experience and does not necessary reflect the position of my current employer.

Let’s get started.

In principle Data Sharing is usually performed based on a mutually agreed contract between two parties – Data Producer and Data Consumer which allows Data Consumer to access Dataset. We call this contract a Data Sharing Agreement.

Both Data Producers and Data Consumers can be an organization, individual, application, or process within Enterprise boundaries.

Dataset

Dataset in this context is pretty much anything which can be produced by a Data Producer and consumed by the Data Consumer. It can be a file, database, table, collection in NoSQL database, etc.

Three major characteristics of the dataset are:

It is controlled and maintained by Data Producer.
It has its own lifecycle (even if it is a very static lifecycle).
It has some rules of access and consumption.

Data Producer

As described, Data Producer can be an organization, individual, application, or a process. The main characteristic of Data Producer is the fact that it owns the Dataset as well as lifecycle of the Dataset. Thus, Data Producer is responsible for maintaining Datasets including Data Quality, Data Lifecycle and Data Access Rules.

N.B. An important and a common complex question that comes to mind is – why Data Producers will share their datasets? While we do have some practical answers, we don’t intend to provide any in this article.

An implementation of Data Sharing system includes a Data Producer application that is used for:

Registering Datasets within the system.
Provisioning Data Classification infrastructure required for Dataset sharing.
Providing Data Producer with the required access rights and privileges.
Performing other tasks required by policies (like populating appropriate tags, creating budgets, etc.)

Data Producer Application

One of the most important tasks connected with registering data within Data Sharing System is to ensure that the data is compliant and only certain Data Consumers will have access to it. Access is controlled by the set of policies which also defines the characteristics of both Datasets and Data Consumers.

To identify the characteristics of the Dataset, we must perform data classification and define classification in adherence to policies. To avoid misconfiguration, we usually perform such classifications automatically. For that we are utilizing a Data Classification Infrastructure that:

Performs initial data classification and populates Data Access Policies.
Deregisters Data Sets and cleans up resources / access when required.

How it works?

[1] Producer registers Data Set and provides all the necessary details about it.

. Typically, on this step we are talking about Data Model, Data Model Components, Delivery methods and SLAs, Frequency of Updates, Policies and Classification Applied, etc.

[2] After the Data Set Is Registered, Registration UI calls Infrastructure Provisioning Scripts.

These scripts are responsible for creating infrastructure for the Producer and providing necessary access to the producer. These scripts should also be controlled by the special policies defined for every producer. The aim of the policies is to avoid overprovisioning and control where and which services are available for the Producer. Provisioning scripts can be created using Azure Blueprints, Azure SDKs or Terraform (any other appropriate tooling).

On this step we are typically performing all the necessary additional tasks associated with the Dataset and its infrastructure, such as tagging, creating additional roles and accesses (for Audit, for instance) and all other tasks required by policies.

In most typical scenarios Azure Storage Account is created and used for the Data Producer Application. Storage Account type should be selected based on the use-case. You can find more on the storage account selection here: https://albero.cloud/service#g=Technology-Focused%20Decision%20Trees&s=Data%20Lake%20on%20Azure

Depending on the Sharing contract this Storage Account may contain some of the following structures:

Main Folder. Where the one-off or initial Golden Copy resides. It may also be updated from time to time with the newer versions produced as a part of full-load process.
CDC Folder. Is used for incremental deliveries based on the SLA defined by the producer.
Sample Dataset Folder. Is used to provide potential consumers with the test data to help them to identify if this might be a valuable for them. Such data is either intentionally created or obfuscated / generated so we can give a wide internal access to the Dataset.
Logging Folder. Is used for logging and auditing associated with the Dataset.
DLQ Folder. Is used to store and later process parts of the Dataset which were not properly processed / delivered. Basically, it is used for error handling.

In-folder structure heavily depends on the data model and consumption patterns for the particular Dataset.

Once Dataset infrastructure is provisioned, we are kicking off the creation of Data Classification Infrastructure for this same Dataset.

[3] Performing Data Classification Tasks

Most of our systems have some notion of data privacy and data classification. Usually, we must scan our Datasets and put some sort of controls are in place which will allow us to automate data compliancy checks and update access policies in accordance with the results.

To make compliance checks robust we are creating (provisioning resources and configuring) separate automated pipeline which works as follows:

a)?????When data is ingested in ADLS or other services integrated with Event Grid we are triggering either Azure Function or Azure Data Factory (Synapse) Pipeline using Event Grid native integrations. We prefer functions in case execution context is simple (few API calls) but in case it is more complex (data validation, data quality checks, etc.) we prefer using Data Factory.

b)?????Either Azure Function or Data Factory Pipeline performs necessary computations (if any) as well as required API calls (for instance to Atlas API). This compute also acts as a context-owner and orchestration engine. Calls can be made against some of Azure-native tools (like Purview) or some 3rd party tools.

c)??????Once the processing is finished, the Data Sharing Agreement is updated. Particularly we are updating Access Policies as well as some of the KPIs for dataset (like Data Quality) if required. As a result of the scan, we can either confirm the Access Policy identified by Data Producer or tighten it if there are additional classifications mentioned. We will discuss the latter later in this article.

Updates and ingestions will trigger the compliance process automatically. They will also trigger consumption processes if any.

[4] Producing Data

Once compliancy checks are established Data Producers can begin ingesting and updating data using Dataset infrastructure.

Important aspect is that in such model Data Producer takes responsibility on maintaining Dataset lifecycle with certain guarantees such as frequency of the deliveries, formats and quality, SLA and other guarantees.

Data Consumer

Like the Data Producer, Data Consumer can be an organization, individual, application or process which requires access to the Dataset to use it for achieving business goals. Data Consumer interacts with Data Sharing System using special component which we call Data Consumer Application. Data Consumer application is used for:

Exploring Datasets within the system
Requesting and provisioning access for Data Consumer in accordance with policies and data access profiles
Provisioning and controlling infrastructure for data processing (if required)
Performing amendments for Data Sharing contracts
Logging and monitoring activities of Data Consumers

Data Consumer Application. How it works?

[1] Consumer explores datasets using Data Discovery UI.

a.??????Consumer explores existing Datasets and Data Sharing Agreements using Data Discovery UI.

b.?????Ideally sample dataset should be available for the consumers so they can verify if it makes sense to obtain access to the dataset.

By “sample” here we mean that this part of the dataset can be shared within Enterprise and that the quality of this data is enough to make an informed decision if it fits consumer requirements. Ideally, such a dataset won’t be generated but intentionally built by the experts from Data Producer side.

[2] Consumer requests access to the dataset.

Organization unit administrators, application and process owners are requesting access on behalf of the Data Consumer application via Discovery UI (in this example). This should be also reflected in the internal systems, so technical users or group access should be also bind to the owner.

While applying for access to the dataset Data Consumer should also specify which type of delivery method is required (in case Dataset is provided via various delivery methods). Depending on the delivery method access can be granted or rejected based on the policies.

领英推荐

Data Lake creation details with best practices to…

Sateesh Rai PMP?,TOGAF? 2 个月前

Addressing DBMS Innovation Stagnation with Hyperlinks…

Kingsley Uyi Idehen 2 个月前

Microsoft Fabric: Data platform for the era of AI

Ijaz Ali 1 年前

[3] Access Checker verifies if access can be granted

a.??????Access Checker verifies if Consumer can obtain access.

As consumer can be virtually anything – organization, individual, application or process we need to clearly define roles, responsibilities and access management rights. In case of individual, it is quite clear – we just adhere to the privileges and taxonomies for the Individual. However, in case of organizations, applications or processes it may be much more complex.

Main reason is that in such a case we either must delegate part of the access management to Data Consumers or assume that all the users on the Data Consumer side will have same level of access to the dataset. This, in turn, requires us to review our grant policies so we can verify that potential users of these applications, processes or members of organizations can have proper entitlements.

One of the ways to do this is to add an organizational policy which will make Data Consumers (organizational units, application owners, process owners) responsible for checking their users against access policy. In addition, we can think of implementing monitoring solutions which will control the end-user accessing our datasets as well as some AI-backed tool responsible for identification of anomalies and deviant usage of the datasets. Such tools should be equipped with the means for blocking or revoking access and become a part of the information security toolkit.

b.?????In case verification fails or is unsuccessful Security team is notified so the access rules can be modified manually.

Obviously, user interface is required to perform such actions. Important part here is that access modification should be only possible within the boundaries of Enterprise Policies. There should be no adjustments outside of these boundaries.

[4] Granting access to the Dataset

a.??????Once access is granted, Access Checker calls Access Provider to kick off process of granting privileges.

Depending on where the dataset resides, such process may be quite a complex one as it may include providing RBAC access via Azure Active Directory, creating or managing ACLs, providing access to services using their own authorization mechanisms, etc.

a.??????These also should be proper orchestration for the process as well as monitoring, logging, etc. In case of process failure there should be implemented a notification mechanism so all the unprocessed grants Access Provider registers Data Sharing Agreement where delivery methods, SLAs and other information is listed.

Data Sharing Agreement is registered on the per-Consumer basis. Some elements of Data Sharing Agreement are described below in this document.

b.?????In case delivery method requires data copy, movement or transformation additional infrastructure may be provisioned. We will cover this in more details in the next section.

Data Sharing Agreement

Data sharing is usually based on mutual agreement or a contract. Such contract is implemented in a form of a model which we call Data Sharing Agreement. It states who the producers and consumers are, the delivery methods, SLAs, access patterns, classifications, and other important information.

In case of Data Producer and Data Consumer being an application, a Data Sharing Agreement typically reside within the database which serves data between Data Producer and Data Consumer. Both Data Producer and Data Consumer can be a part of the same application.

All the possible delivery methods, including if Dataset can be copied to Data Consumer infrastructure are described in the Data Sharing Agreement. All the delivery methods possible and requested by Data Consumer are also described in the same policy. Data Sharing Agreement defines which methods and which SLAs are being used.

These delivery methods are defined on the Dataset level and can be anything from the list of tools supported by the Enterprise. Delivery methods can include sharing folder, using Azure Data Share for data copy or in-place share, using Azure Data Factory Pipelines to perform some transformations and additional operations while copying data, sending messages via Azure Event Hub or Azure Service Bus and much more.

Each delivery method is a combination of:

-?????????Dataset Endpoint – such an endpoint may be an ADLS container, an API, JDBC connection, etc. Data Producer is responsible for maintaining dataset lifecycle per each endpoint with the certain guarantees.

Access to the Dataset Endpoint (some datasets may be delivered in a variety of ways).
Sharing Processor – is a process, an application, etc. which performs copy and processing activities in accordance with Data Sharing Agreement. Some Azure Data Services can act as Sharing Processor in this situation – like Azure Data Factory, Azure Data Share, etc.

When Data Consumers are allowed to copy data into their own infrastructure the proper delivery method should be provisioned and launched. This includes launching Sharing Processor (as per Data Sharing Agreement), providing Sharing Processor with access to Dataset Endpoint and configuring the delivery method parameters (such as Trigger for the pipeline, etc.) in accordance with SLAs, delivery frequency, etc. defined by Data Producer.

a.??????Access to the dataset is granted to the users / applications in accordance with data sharing agreement and access policy.

This can be achieved either via direct access or via one or more Delivery Methods as defined and described in Data Sharing Agreement.

Once access is granted Consumer can start working with the dataset.

Revoking and Adjusting Access

Of course, we should take into consideration the possibility that at the certain moment in time access of the Data Consumer to the Dataset should be revoked.

First, revoke procedure should be triggered. This can be manual operation, automated operation based on some 3rd party processes (firing people, etc.) or automated operation based on the change of Access Policy. All three require some process and interface (UI or API) on the side of Data Sharing system as well as means for logging, alerting and exception management.

Revoking Direct Access

In case of direct access this should be a relatively simple procedure. However, we should consider that Data Consumer as an individual consumer with direct access could be registered not only within AAD but also can have some user records within the services themselves (especially in case of databases). Since we have Data Sharing Agreement listed, we just need to have some special scripts for revoking direct access on the per-service basis. In most of the cases revoking access with AAD should be enough.

Revoking Access to Copied Dataset

Revoking access to copied dataset is a bit more challenging.

First step is to revoke Data Consumer access to the Dataset Endpoints. This can be done as in the case of Direct Access. Second step is to revoke Data Consumer access to the part of Data Consumer infrastructure where this dataset is located. Once this is done and confirmed with Data Consumer the whole part of the infrastructure where the copy of the dataset is located can be deleted.

Important implication is that Data Consumer infrastructure for copy of the dataset in such system should be administered by the owners of the system itself. It may have some implications on the layout of subscriptions and resource groups.

Partial Revoke

This is the most complex situation. It may occur when an access to a column, table or another object in the same dataset is revoked. In practice, because of the complexity of such procedure. The simplest way to achieve this behaviour is to revoke an access to the dataset and afterwards provision a new Data Sharing Agreement with new Delivery Methods for the reduced datasets.

Data Sharing Entities Model

So far, we have described main algorithms for Data Producer and Data Consumer as well as the data structures using which both are communicating with each other in automated fashion. Now, in this last part of an article we would like to propose a data model for such a system.

As described above we have few main components which allow us to build data sharing system. Two roles:

1.?????Data Producer

2.?????Data Consumer

One type of artifacts with its delivery mechanisms and Guarantees:

3.?????Dataset

4.?????Dataset Guarantees

5.?????Delivery Method

And a way to describe usage of the Dataset:

6.?????Data Sharing Agreement

In addition, both Data Sharing Agreement and Dataset should be checked against Infrastructure Policy (which defines which services can be used for producing and consuming Dataset). Also, Data Sharing Agreement should be always verified against Access Policy.

Many aspects will depend on the granularity and acceptable complexity of the implementation. For instance, in Data Objects we can go as far as individual fields or data elements within the document (if this is required) thus we may be able to assess access rights based on this granular model.

Hope this helps to clarify how self-service data engine can be implemented on Azure. Please note that these same steps and considerations can be also applied to 3rd party implementations and products.

What is next

We are currently working on implementation of the algorithms and models described in this article and will publish results in a public repository once we have them. Stay tuned!

Elias Nichupienko

Co-founder of Advascale | A cloud sherpa for Fintech

2 年

Andrei, thanks.

1 次回应

Mike Leaman

Enterprise Data Architect

2 年

Bala Rasaratnam This looks familiar, but worth a read as it has some interesting points.

1 次回应

Brice Lenain

Global AI Sales Strategy Lead

2 年

José Manuel Molina Montes , Gary Elliott , Ignacio Contreras , Jan Draper —> could be useful

3 次回应

Luca Bovo

DevSecOps Engineer at beanTech || Microsoft Azure enthusiast and DevOps passionate || Committed to bringing People, Process and Technology together for Secure Digital Transformation, breaking down silos among teams

2 年

Filippo Molinaro that's interesting! Take a look...

3 次回应

Scott Mckinnon

Cloud Solution Architect (CSA) - CSU Data and AI Team | Microsoft

2 年

fantastic work, love the metamodel. I look forward to the next update.

2 次回应

查看更多评论

要查看或添加评论，请登录

Andrei Zaichikov的更多文章

Time to Data: Measuring and Using for Continuous Improvement

2024年1月10日

Time to Data: Measuring and Using for Continuous Improvement

What is Time to Data (TTD) Time to Data shows how much time end user requires to start using data asset in their…

9 条评论
Come Hell or High Water: Some Lessons from Four Years of Data Mesh Implementations Learned the Hard Way: Lesson One

2023年12月1日

Come Hell or High Water: Some Lessons from Four Years of Data Mesh Implementations Learned the Hard Way: Lesson One

Disclaimer: This article represents my personal experience and opinion. It does not represent the official position of…

8 条评论
Major Technology Trends Unravelled during Big Data London 2023

2023年9月30日

Major Technology Trends Unravelled during Big Data London 2023

That's a wrap for Big Data London! One of the niciest things about these events is that one can easily check what the…

11 条评论
Building Interactive Enterprise Grade Applications with Open AI and Microsoft Azure

2023年5月17日

Building Interactive Enterprise Grade Applications with Open AI and Microsoft Azure

Disclaimer. This is an opinion of the authors, and it does not necessarily reflect the recommendations or point of view…

13 条评论
Unbiased view of bringing Synapse Analytics and Azure Databricks together

2023年4月21日

Unbiased view of bringing Synapse Analytics and Azure Databricks together

About a year ago, we created this article to provide an unbiased view on when and how to use Azure Synapse and Azure…

10 条评论
Short Note on Custom Tokenization for simple FSI use-cases

2023年3月1日

Short Note on Custom Tokenization for simple FSI use-cases

A while ago there was a question from one of our FSI customers regarding the way custom tokenization may look like…
Deleting Sensitive Data in the Data Lake (and beyond)

2022年7月15日

Deleting Sensitive Data in the Data Lake (and beyond)

Disclaimer. All the opinions and recommendations are my own.

9 条评论
Some Notes on Data Lake Zoning

2022年5月17日

Some Notes on Data Lake Zoning

Before we begin, please note that all written below reflects personal opinion and experience and doesn’t represent…

14 条评论
Landing Oracle DB on Azure: Where? How?

2022年1月20日

Landing Oracle DB on Azure: Where? How?

This is one of the top questions we have been asked for the last few years. And there are tons of artifacts answering…

5 条评论
Azure DataBox and Soft Skills – Practical Notes on using Azure DataBox and Similar Solutions

2021年8月12日

Azure DataBox and Soft Skills – Practical Notes on using Azure DataBox and Similar Solutions

In the short article below I would like to share some practical experience with Azure DataBox and similar solutions…

1 条评论

See all articles

Lightweight Implementation of Self-Service Data Sharing Platform on Azure

Andrei Zaichikov

Director, Enterprise Technology Strategy, EMEA at Pure Storage

Dataset

Data Producer

Data Producer Application

How it works?

[1] Producer registers Data Set and provides all the necessary details about it.

[2] After the Data Set Is Registered, Registration UI calls Infrastructure Provisioning Scripts.

[3] Performing Data Classification Tasks

[4] Producing Data

Data Consumer

Data Consumer Application. How it works?

[1] Consumer explores datasets using Data Discovery UI.

[2] Consumer requests access to the dataset.

领英推荐

[3] Access Checker verifies if access can be granted

[4] Granting access to the Dataset

Data Sharing Agreement

Revoking and Adjusting Access

Revoking Direct Access

Revoking Access to Copied Dataset

Partial Revoke

Data Sharing Entities Model

What is next

Andrei Zaichikov的更多文章

社区洞察

其他会员也浏览了

Addressing DBMS Innovation Stagnation with Hyperlinks as Super Keys

Automating Data Lineage in Complex Data Ecosystems with Azure Purview

Unity Catalog in Databricks: The Key to Secure and Scalable Data Governance

Google Data Flow vs. Google Data Fusion

Architecting Data Lake Solutions with Azure Data Lake Storage

Mastering Data Loading in Microsoft Fabric: A Comprehensive Guide

Microsoft Fabric Data Warehouse: Scalable, Fast, and Unified Analytics ????

Microsoft Fabric - Databases: The Foundation for Scalable and Intelligent Data Management ????

Harbr & AWS: Revolutionizing Data Management

Microsoft Purview and CluedIn: a persuasive data governance partnership

Dataset

Data Producer

Data Producer Application

How it works?

[1] Producer registers Data Set and provides all the necessary details about it.

[2] After the Data Set Is Registered, Registration UI calls Infrastructure Provisioning Scripts.

[3] Performing Data Classification Tasks

[4] Producing Data

Data Consumer

Data Consumer Application. How it works?

[1] Consumer explores datasets using Data Discovery UI.

[2] Consumer requests access to the dataset.

领英推荐

[3] Access Checker verifies if access can be granted

[4] Granting access to the Dataset

Data Sharing Agreement

Revoking and Adjusting Access

Revoking Direct Access

Revoking Access to Copied Dataset

Partial Revoke

Data Sharing Entities Model

What is next

Andrei Zaichikov的更多文章

Time to Data: Measuring and Using for Continuous Improvement

Come Hell or High Water: Some Lessons from Four Years of Data Mesh Implementations Learned the Hard Way: Lesson One

Major Technology Trends Unravelled during Big Data London 2023

Building Interactive Enterprise Grade Applications with Open AI and Microsoft Azure

Unbiased view of bringing Synapse Analytics and Azure Databricks together

Short Note on Custom Tokenization for simple FSI use-cases

Deleting Sensitive Data in the Data Lake (and beyond)

Some Notes on Data Lake Zoning

Landing Oracle DB on Azure: Where? How?

Azure DataBox and Soft Skills – Practical Notes on using Azure DataBox and Similar Solutions

社区洞察

其他会员也浏览了

Addressing DBMS Innovation Stagnation with Hyperlinks as Super Keys

Automating Data Lineage in Complex Data Ecosystems with Azure Purview

Unity Catalog in Databricks: The Key to Secure and Scalable Data Governance

Google Data Flow vs. Google Data Fusion

Architecting Data Lake Solutions with Azure Data Lake Storage

Mastering Data Loading in Microsoft Fabric: A Comprehensive Guide

Microsoft Fabric Data Warehouse: Scalable, Fast, and Unified Analytics ????

Microsoft Fabric - Databases: The Foundation for Scalable and Intelligent Data Management ????

Harbr & AWS: Revolutionizing Data Management

Microsoft Purview and CluedIn: a persuasive data governance partnership