Lightweight Implementation of Self-Service Data Sharing Platform on Azure

Lightweight Implementation of Self-Service Data Sharing Platform on Azure

In vast number of companies obtaining access to the dataset requires dozens of emails, meetings, and tons of ineffective communications. Bureaucracy, complexity of systems and inefficiency of processes hinder our ability to work effectively. The main factor affecting our ability to comply with the requirements of our businesses is the complexity and inefficiency of the legacy process of data governance.

To address this gap, a concept of Self-Service Data platforms (aka Data Marketplace) emerged a few years ago, and some early adopters rushed into implementation. Nowadays this partially converged into more modern and fashionable Data Mesh concept, but the problem is still there.

In this article we would like to summarize our experience from over the past three years in designing Self-Service Data Solutions and provide the technical aspects to implement the Self-Service Data solution using Azure Native services. We don’t intend to touch cultural, political, and other topics.

N.B. Before we begin, please remember that it is a personal experience and does not necessary reflect the position of my current employer.

Let’s get started.

In principle Data Sharing is usually performed based on a mutually agreed contract between two parties – Data Producer and Data Consumer which allows Data Consumer to access Dataset. We call this contract a Data Sharing Agreement.

Both Data Producers and Data Consumers can be an organization, individual, application, or process within Enterprise boundaries.

No alt text provided for this image

Dataset

Dataset in this context is pretty much anything which can be produced by a Data Producer and consumed by the Data Consumer. It can be a file, database, table, collection in NoSQL database, etc.

Three major characteristics of the dataset are:

  1. It is controlled and maintained by Data Producer.
  2. It has its own lifecycle (even if it is a very static lifecycle).
  3. It has some rules of access and consumption.

Data Producer

As described, Data Producer can be an organization, individual, application, or a process. The main characteristic of Data Producer is the fact that it owns the Dataset as well as lifecycle of the Dataset. Thus, Data Producer is responsible for maintaining Datasets including Data Quality, Data Lifecycle and Data Access Rules.

N.B. An important and a common complex question that comes to mind is – why Data Producers will share their datasets? While we do have some practical answers, we don’t intend to provide any in this article.

An implementation of Data Sharing system includes a Data Producer application that is used for:

  • Registering Datasets within the system.
  • Provisioning Data Classification infrastructure required for Dataset sharing.
  • Providing Data Producer with the required access rights and privileges.
  • Performing other tasks required by policies (like populating appropriate tags, creating budgets, etc.)

Data Producer Application

One of the most important tasks connected with registering data within Data Sharing System is to ensure that the data is compliant and only certain Data Consumers will have access to it. Access is controlled by the set of policies which also defines the characteristics of both Datasets and Data Consumers.

To identify the characteristics of the Dataset, we must perform data classification and define classification in adherence to policies. To avoid misconfiguration, we usually perform such classifications automatically. For that we are utilizing a Data Classification Infrastructure that:

  • Performs initial data classification and populates Data Access Policies.
  • Deregisters Data Sets and cleans up resources / access when required.

How it works?

[1] Producer registers Data Set and provides all the necessary details about it.

. Typically, on this step we are talking about Data Model, Data Model Components, Delivery methods and SLAs, Frequency of Updates, Policies and Classification Applied, etc.

[2] After the Data Set Is Registered, Registration UI calls Infrastructure Provisioning Scripts.

These scripts are responsible for creating infrastructure for the Producer and providing necessary access to the producer. These scripts should also be controlled by the special policies defined for every producer. The aim of the policies is to avoid overprovisioning and control where and which services are available for the Producer. Provisioning scripts can be created using Azure Blueprints, Azure SDKs or Terraform (any other appropriate tooling).

On this step we are typically performing all the necessary additional tasks associated with the Dataset and its infrastructure, such as tagging, creating additional roles and accesses (for Audit, for instance) and all other tasks required by policies.

No alt text provided for this image

In most typical scenarios Azure Storage Account is created and used for the Data Producer Application. Storage Account type should be selected based on the use-case. You can find more on the storage account selection here: https://albero.cloud/service#g=Technology-Focused%20Decision%20Trees&s=Data%20Lake%20on%20Azure

Depending on the Sharing contract this Storage Account may contain some of the following structures:

  • Main Folder. Where the one-off or initial Golden Copy resides. It may also be updated from time to time with the newer versions produced as a part of full-load process.
  • CDC Folder. Is used for incremental deliveries based on the SLA defined by the producer.
  • Sample Dataset Folder. Is used to provide potential consumers with the test data to help them to identify if this might be a valuable for them. Such data is either intentionally created or obfuscated / generated so we can give a wide internal access to the Dataset.
  • Logging Folder. Is used for logging and auditing associated with the Dataset.
  • DLQ Folder. Is used to store and later process parts of the Dataset which were not properly processed / delivered. Basically, it is used for error handling.

In-folder structure heavily depends on the data model and consumption patterns for the particular Dataset.

Once Dataset infrastructure is provisioned, we are kicking off the creation of Data Classification Infrastructure for this same Dataset.

[3] Performing Data Classification Tasks

Most of our systems have some notion of data privacy and data classification. Usually, we must scan our Datasets and put some sort of controls are in place which will allow us to automate data compliancy checks and update access policies in accordance with the results.

To make compliance checks robust we are creating (provisioning resources and configuring) separate automated pipeline which works as follows:

a)?????When data is ingested in ADLS or other services integrated with Event Grid we are triggering either Azure Function or Azure Data Factory (Synapse) Pipeline using Event Grid native integrations. We prefer functions in case execution context is simple (few API calls) but in case it is more complex (data validation, data quality checks, etc.) we prefer using Data Factory.

b)?????Either Azure Function or Data Factory Pipeline performs necessary computations (if any) as well as required API calls (for instance to Atlas API). This compute also acts as a context-owner and orchestration engine. Calls can be made against some of Azure-native tools (like Purview) or some 3rd party tools.

c)??????Once the processing is finished, the Data Sharing Agreement is updated. Particularly we are updating Access Policies as well as some of the KPIs for dataset (like Data Quality) if required. As a result of the scan, we can either confirm the Access Policy identified by Data Producer or tighten it if there are additional classifications mentioned. We will discuss the latter later in this article.

Updates and ingestions will trigger the compliance process automatically. They will also trigger consumption processes if any.

[4] Producing Data

Once compliancy checks are established Data Producers can begin ingesting and updating data using Dataset infrastructure.

Important aspect is that in such model Data Producer takes responsibility on maintaining Dataset lifecycle with certain guarantees such as frequency of the deliveries, formats and quality, SLA and other guarantees.

Data Consumer

Like the Data Producer, Data Consumer can be an organization, individual, application or process which requires access to the Dataset to use it for achieving business goals. Data Consumer interacts with Data Sharing System using special component which we call Data Consumer Application. Data Consumer application is used for:

  • Exploring Datasets within the system
  • Requesting and provisioning access for Data Consumer in accordance with policies and data access profiles
  • Provisioning and controlling infrastructure for data processing (if required)
  • Performing amendments for Data Sharing contracts
  • Logging and monitoring activities of Data Consumers

Data Consumer Application. How it works?

[1] Consumer explores datasets using Data Discovery UI.

a.??????Consumer explores existing Datasets and Data Sharing Agreements using Data Discovery UI.

b.?????Ideally sample dataset should be available for the consumers so they can verify if it makes sense to obtain access to the dataset.

By “sample” here we mean that this part of the dataset can be shared within Enterprise and that the quality of this data is enough to make an informed decision if it fits consumer requirements. Ideally, such a dataset won’t be generated but intentionally built by the experts from Data Producer side.

No alt text provided for this image

[2] Consumer requests access to the dataset.

Organization unit administrators, application and process owners are requesting access on behalf of the Data Consumer application via Discovery UI (in this example). This should be also reflected in the internal systems, so technical users or group access should be also bind to the owner.

While applying for access to the dataset Data Consumer should also specify which type of delivery method is required (in case Dataset is provided via various delivery methods). Depending on the delivery method access can be granted or rejected based on the policies.

[3] Access Checker verifies if access can be granted

a.??????Access Checker verifies if Consumer can obtain access.

As consumer can be virtually anything – organization, individual, application or process we need to clearly define roles, responsibilities and access management rights. In case of individual, it is quite clear – we just adhere to the privileges and taxonomies for the Individual. However, in case of organizations, applications or processes it may be much more complex.

Main reason is that in such a case we either must delegate part of the access management to Data Consumers or assume that all the users on the Data Consumer side will have same level of access to the dataset. This, in turn, requires us to review our grant policies so we can verify that potential users of these applications, processes or members of organizations can have proper entitlements.

One of the ways to do this is to add an organizational policy which will make Data Consumers (organizational units, application owners, process owners) responsible for checking their users against access policy. In addition, we can think of implementing monitoring solutions which will control the end-user accessing our datasets as well as some AI-backed tool responsible for identification of anomalies and deviant usage of the datasets. Such tools should be equipped with the means for blocking or revoking access and become a part of the information security toolkit.

b.?????In case verification fails or is unsuccessful Security team is notified so the access rules can be modified manually.

Obviously, user interface is required to perform such actions. Important part here is that access modification should be only possible within the boundaries of Enterprise Policies. There should be no adjustments outside of these boundaries.

[4] Granting access to the Dataset

a.??????Once access is granted, Access Checker calls Access Provider to kick off process of granting privileges.

Depending on where the dataset resides, such process may be quite a complex one as it may include providing RBAC access via Azure Active Directory, creating or managing ACLs, providing access to services using their own authorization mechanisms, etc.

a.??????These also should be proper orchestration for the process as well as monitoring, logging, etc. In case of process failure there should be implemented a notification mechanism so all the unprocessed grants Access Provider registers Data Sharing Agreement where delivery methods, SLAs and other information is listed.

Data Sharing Agreement is registered on the per-Consumer basis. Some elements of Data Sharing Agreement are described below in this document.

b.?????In case delivery method requires data copy, movement or transformation additional infrastructure may be provisioned. We will cover this in more details in the next section.

Data Sharing Agreement

Data sharing is usually based on mutual agreement or a contract. Such contract is implemented in a form of a model which we call Data Sharing Agreement. It states who the producers and consumers are, the delivery methods, SLAs, access patterns, classifications, and other important information.

In case of Data Producer and Data Consumer being an application, a Data Sharing Agreement typically reside within the database which serves data between Data Producer and Data Consumer. Both Data Producer and Data Consumer can be a part of the same application.

All the possible delivery methods, including if Dataset can be copied to Data Consumer infrastructure are described in the Data Sharing Agreement. All the delivery methods possible and requested by Data Consumer are also described in the same policy. Data Sharing Agreement defines which methods and which SLAs are being used.

These delivery methods are defined on the Dataset level and can be anything from the list of tools supported by the Enterprise. Delivery methods can include sharing folder, using Azure Data Share for data copy or in-place share, using Azure Data Factory Pipelines to perform some transformations and additional operations while copying data, sending messages via Azure Event Hub or Azure Service Bus and much more.

No alt text provided for this image

Each delivery method is a combination of:

-?????????Dataset Endpoint – such an endpoint may be an ADLS container, an API, JDBC connection, etc. Data Producer is responsible for maintaining dataset lifecycle per each endpoint with the certain guarantees.

  • Access to the Dataset Endpoint (some datasets may be delivered in a variety of ways).
  • Sharing Processor – is a process, an application, etc. which performs copy and processing activities in accordance with Data Sharing Agreement. Some Azure Data Services can act as Sharing Processor in this situation – like Azure Data Factory, Azure Data Share, etc.

When Data Consumers are allowed to copy data into their own infrastructure the proper delivery method should be provisioned and launched. This includes launching Sharing Processor (as per Data Sharing Agreement), providing Sharing Processor with access to Dataset Endpoint and configuring the delivery method parameters (such as Trigger for the pipeline, etc.) in accordance with SLAs, delivery frequency, etc. defined by Data Producer.

a.??????Access to the dataset is granted to the users / applications in accordance with data sharing agreement and access policy.

This can be achieved either via direct access or via one or more Delivery Methods as defined and described in Data Sharing Agreement.

Once access is granted Consumer can start working with the dataset.

No alt text provided for this image

Revoking and Adjusting Access

Of course, we should take into consideration the possibility that at the certain moment in time access of the Data Consumer to the Dataset should be revoked.

First, revoke procedure should be triggered. This can be manual operation, automated operation based on some 3rd party processes (firing people, etc.) or automated operation based on the change of Access Policy. All three require some process and interface (UI or API) on the side of Data Sharing system as well as means for logging, alerting and exception management.

Revoking Direct Access

In case of direct access this should be a relatively simple procedure. However, we should consider that Data Consumer as an individual consumer with direct access could be registered not only within AAD but also can have some user records within the services themselves (especially in case of databases). Since we have Data Sharing Agreement listed, we just need to have some special scripts for revoking direct access on the per-service basis. In most of the cases revoking access with AAD should be enough.

Revoking Access to Copied Dataset

Revoking access to copied dataset is a bit more challenging.

First step is to revoke Data Consumer access to the Dataset Endpoints. This can be done as in the case of Direct Access. Second step is to revoke Data Consumer access to the part of Data Consumer infrastructure where this dataset is located. Once this is done and confirmed with Data Consumer the whole part of the infrastructure where the copy of the dataset is located can be deleted.

Important implication is that Data Consumer infrastructure for copy of the dataset in such system should be administered by the owners of the system itself. It may have some implications on the layout of subscriptions and resource groups.

Partial Revoke

This is the most complex situation. It may occur when an access to a column, table or another object in the same dataset is revoked. In practice, because of the complexity of such procedure. The simplest way to achieve this behaviour is to revoke an access to the dataset and afterwards provision a new Data Sharing Agreement with new Delivery Methods for the reduced datasets.

Data Sharing Entities Model

So far, we have described main algorithms for Data Producer and Data Consumer as well as the data structures using which both are communicating with each other in automated fashion. Now, in this last part of an article we would like to propose a data model for such a system.

No alt text provided for this image

As described above we have few main components which allow us to build data sharing system. Two roles:

1.?????Data Producer

2.?????Data Consumer

One type of artifacts with its delivery mechanisms and Guarantees:

3.?????Dataset

4.?????Dataset Guarantees

5.?????Delivery Method

And a way to describe usage of the Dataset:

6.?????Data Sharing Agreement

In addition, both Data Sharing Agreement and Dataset should be checked against Infrastructure Policy (which defines which services can be used for producing and consuming Dataset). Also, Data Sharing Agreement should be always verified against Access Policy.

Many aspects will depend on the granularity and acceptable complexity of the implementation. For instance, in Data Objects we can go as far as individual fields or data elements within the document (if this is required) thus we may be able to assess access rights based on this granular model.

Hope this helps to clarify how self-service data engine can be implemented on Azure. Please note that these same steps and considerations can be also applied to 3rd party implementations and products.

What is next

We are currently working on implementation of the algorithms and models described in this article and will publish results in a public repository once we have them. Stay tuned!

Elias Nichupienko

Co-founder of Advascale | A cloud sherpa for Fintech

2 年

Andrei, thanks.

Mike Leaman

Enterprise Data Architect

2 年

Bala Rasaratnam This looks familiar, but worth a read as it has some interesting points.

Luca Bovo

DevSecOps Engineer at beanTech || Microsoft Azure enthusiast and DevOps passionate || Committed to bringing People, Process and Technology together for Secure Digital Transformation, breaking down silos among teams

2 年

Filippo Molinaro that's interesting! Take a look...

Scott Mckinnon

Cloud Solution Architect (CSA) - CSU Data and AI Team | Microsoft

2 年

fantastic work, love the metamodel. I look forward to the next update.

  • 该图片无替代文字

要查看或添加评论,请登录

Andrei Zaichikov的更多文章

社区洞察

其他会员也浏览了