登录查看更多内容

Part 2 - Cloud Modernization Journey of an MNC with Data Residency Requirements - Implementation on Microsoft Azure

Suma Manohar

Senior Global Black Belt - Data & AI, Microsoft

发布日期: 2021年12月23日

+ 关注

Introduction

"The Car Will Become A Computer On Wheels" - Ford Motors

"We want to be 'D' in Gandalf" - DBS Bank

With the statements like these from industry leaders of non-technology sectors, it is evident that it is the era of digitize or die for businesses of all sectors. Organizations can no longer afford to ignore dramatic changes in the way customers are looking at products and services but to adapt to them urgently. Chief Data Officers (CDO) of companies are forced to think of becoming digital-first and are starting to embrace the latest technologies like cloud computing, AI/ML and are accelerating their transition to a data-driven culture.

Due to this rapid data modernization, regulating and protecting the usage of data especially the national and citizen data (also known as personally identifiable information or PII) is becoming difficult for the governments. Governments are imposing stricter data residency requirements as prerequisites for businesses who wish to modernize their platforms to the cloud. This is now a known challenge that every business needs to address.

In this blog…

This is the second part of a 2-part blog about the cloud data modernization journey for MNCs with data residency restrictions. In my previous blog, we discussed the 3 main problem statements of the customer, wanting to move to the cloud, and their appropriate solution patterns. Considering different combinations of tools and technologies, storage and compute needs, skills of the customer teams, the demands of the use case, there are numerous ways in which a solution can be implemented on the cloud. In this blog, drawing from my experience working with various customers, I will discuss one such cloud implementation.

Implementation on Azure cloud

Considering all the regulations around data residency, we can envision the design pattern and solution implementation on Azure data services. The design pattern can be looked at in two ways-

1.?????Where the required data has cleared DRR and is approved to be persisted into a Data Consolidation storage area by the data product owner(DPO).

2.?????Where the required data has cleared DRR but can only be accessed via Data Consolidation Compute but cannot be persisted in Consolidated Data Storage.

Note that the customer’s RCT(Regulatory and Compliance Team) along with the DPO team are accountable to decide whether required data can or cannot be persisted outside the data’s country of origin. This decision is a key prerequisite to solution design and overall project plan.

Common Data Services

Azure Synapse Analytics

Quoting Microsoft’s Synapse product page

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated resources at scale.

Synapse analytics has two consumption models

Serverless- pay-per-query ideal for ad-hoc data lake exploration and transformation. Dedicated clusters optimized mission-critical data warehouse workloads.

We can use use either options based on the required design pattern which is discussed more in detail in below section(pattern 1 and 2).

Azure Data Lake Storage Gen2 for big-data storage

Referring to the above diagram, Azure Data Lake Storage Gen2(ADLS Gen2) is serving as a storage area for countries that can reside on the cloud. For Ex: Group A, Group B and Consolidated Storage – all three sets’ storage can be on ADLS Gen2. All the accounts can belong to the same subscription or different depending on how the customer wants to isolate the workloads and also manage their cost and billing models. Following are some reference articles to know more about ADLS gen2 and to get started.

Overview of Azure Data Lake Storage for the data management and analytics scenario - Cloud Adoption Framework | Microsoft Docs

Key considerations for Azure Data Lake Storage - Cloud Adoption Framework | Microsoft Docs

adlsguidancedoc/Hitchhikers_Guide_to_the_Datalake.md at master · rukmani-msft/adlsguidancedoc · GitHub

Azure Data Factory/ADF for ETL and Orchestration

ADF is Azure's cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management. ADF can be used to move the data from Group B and C data to consolidated storage and computing area as shown in the diagram above.?The Azure Purview which is a data governance tool captures all lineage data generated by data factory.

Introduction to Azure Data Factory - Azure Data Factory | Microsoft Docs

Azure Data Factory: Frequently asked questions - Azure Data Factory | Microsoft Docs

Azure Purview for data governance

Azure Purview is a unified data governance service that helps you manage and govern your on-premises, multi-cloud, and software-as-a-service (SaaS) data. Create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage. Enable data curators to manage and secure your data estate. Empower data consumers to find valuable, trustworthy data.

Introduction to Azure Purview - Azure Purview | Microsoft Docs

Push Data Factory lineage data to Azure Purview - Azure Data Factory | Microsoft Docs

Power BI

Power BI is a collection of software services, apps, and connectors that work together to turn your unrelated sources of data into coherent, visually immersive, and interactive insights. Your data may be an Excel spreadsheet or a collection of cloud-based and on-premises hybrid data warehouses. Power BI lets you easily connect to your data sources, visualize and discover what's important, and share that with anyone or everyone you want.

You will be able to integrate the existing Power BI workspace with Azure Synapse Analytics so that you can quickly access datasets, edit reports directly in the Synapse Studio, and automatically see updates to the report in the Power BI workspace.

There are two modes in which you can access data in Power BI - Direct query and Import mode which we will discuss in relation to pattern 1 and 2 below.

Dataset modes in the Power BI service - Power BI | Microsoft Docs

Design pattern #1 - for SLA driven standard reports

In this design pattern, the required data has cleared DRR and is approved to be persisted into a Data Consolidation storage area.

Following are the pre-requisites for the scenario to qualify for this design pattern.

Group Reporting Team (GRT) needs to deliver an SLA driven report for the business users.
?Data requested by GRT is not lightweight data.
Data requested by GRT is approved by DPO & RCT.
Data is approved to be physically shipped over to Data Consolidation Storage area because it does not breach any regulatory rules and reasons for that could be

a. Data requested has no PII data

b. Data requested is in aggregated form and doesn’t divulge any detail/ transaction-level data

Synapse Dedicated SQL Pool for Relational data processing and data serving

Synapse Dedicated SQL Pool serves as the unification layer where the data from all the other groups is brought together for transformation to drive the group reporting requirements which could be driven by strict SLAs.

In short, dedicated SQL pool (formerly SQL DW) stores data in relational tables with columnar storage. It follows an MPP architecture to support high concurrency, high volume, performant querying that is typical in an SLA-driven reporting. It can serve the typical use cases like datawarehouse, data mart, data lakehouse etc. Following are some links to get you started with Azure Synapse Analytics and Azure Synapse Dedicated SQL Pool -

What is Azure Synapse Analytics? - Azure Synapse Analytics | Microsoft Docs

Synapse SQL architecture - Azure Synapse Analytics | Microsoft Docs

Best practices for dedicated SQL pools - Azure Synapse Analytics | Microsoft Docs

Power BI users can access this data by connecting and creating reports in Power BI. For pattern #1, it is recommended to access the reports via import mode since it allows the data to be copied over to Power BI memory cache and in cases where the reports are SLA driven, multiple users access the same report this proves to be highly performant.

Design pattern#2 - for lightweight and non-persistent data access

In this pattern, the required data has cleared DRR but can only be accessed via Data Consolidation Compute but cannot be persisted in Consolidated Data Storage.

Following are the pre-requisites for the scenario to qualify for approach this two.

Data requested by GRT is lightweight data.
Data requested by GRT is approved by DPO & RCT.
Data is NOT approved to be physically shipped over to the data consolidation area but it can be cached or accessed in a non-persistent mode. Now, it is very essential to define the caching that is allowed by the regulatory requirement. The customer needs to document the requirement and then choose the solution approach carefully.
Depending on the definition of caching/nonpersistent data access, the GRT can access data in several ways.

Referring to the above diagram, ADLS Gen2?and Synapse Dedicated SQL Pool are retained as in approach 1. Azure Data Factory(ADF) serves as data transfer tool from Group A to Group Reporting or Data consolidation area.

Synapse External Tables for data virtualization/non persistent data access

For bringing in data from Group B to Data consolidation area, an external table can be defined on top of ADLS Gen2 data from Group B or Group A that is required in this case. An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from files or write data to files in Azure Storage. With Synapse Analytics, you can use external tables to read external data using a dedicated SQL pool or serverless SQL pool.

Use external tables with Synapse SQL - Azure Synapse Analytics | Microsoft Docs

Synapse Serverless SQL Pool with OPENROWSET

For bringing data from Group B to Consolidated area, we can use Synapse serverless Openrowset option. When there is a need for in-place querying of data residing in files in data lake, serverless SQL pool extends the existing?OPENROWSET?function.

Tutorial: Connect serverless SQL pool to Power BI Desktop & create report - Azure Synapse Analytics | Microsoft Docs

Serverless SQL pool - Azure Synapse Analytics | Microsoft Docs

For bringing in data from Group C to group reporting or data consolidation area, since on-prem system can be anything be it Oracle, Netezza, Teradata etc, it is recommended to go with any data virtualization tool like Apache Presto/Starburst that supports bringing data from these on-prem data platforms to Azure.

For pattern #2, Power BI users can create reports using the direct query mode since this allows connecting to the data dynamically and also it does not persist the data in cache(immutable).

Using DirectQuery in Power BI - Power BI | Microsoft Docs

Conclusion

Data residency requirements are rapidly changing. Cloud-Scale Analytics is rapidly advancing with newer and better technologies day by day. Businesses can benefit a lot by embracing technology and becoming data-driven at the core. Engaging a reliable cloud service partner who is an industry leader in provisioning the cloud infrastructure, experienced in handling such regulatory complexities and has wealth of knowledge is key for the customer’s successful journey to the cloud.

In this blog, I have tried to consolidate and document one of my personal experiences of cloud implementation journey with a customer. In no way does it cover the full capabilities of all the services mentioned, but this is just an effort to introduce the services to you so that you can start exploring the richness of these products. If you think the article was helpful, do like the post. In case you have any questions, comments or observations regarding the article, please drop a line and I will try my best to answer it.

Ramadwaipayan Sarangi

Leader - Data Engineering -Data Technology-Data Management

3 年

Good one Suma Manohar ..thanks for putting it into a blog.

1 次回应

Mario J. Vargas Valles

Tech Executive / Technology Agitator/ Cloud/ Data & AI lover/Mentor/Angel Tech Investor/GM/EMEA/LATAM/US

3 年

Outstanding post Suma, congrat.! I will leverage withbour customers.!

2 次回应

Sanjay Kumar Ramakrishnamurthy

Data Architect

3 年

1 次回应

查看更多评论

要查看或添加评论，请登录

Suma Manohar的更多文章

Data Quality in the Era of AI

2024年4月24日

Data Quality in the Era of AI

Today - in the tech-world, every hour, AI is evolving, technology is advancing, and data is increasing. With growing…

16 条评论
Data Quality: A Love-Hate Relationship

2024年2月2日

Data Quality: A Love-Hate Relationship

Sure, it is! A love-hate relationship - with data and especially data quality! I promise it is not a click bait and I…

8 条评论
Governance for Modern Data Estate

2023年2月10日

Governance for Modern Data Estate

The objective of this blog is to cover the basics of data governance, why data governance is important, who should…

17 条评论
Journey From Traditional To Modern Datawarehouse

2022年3月7日

Journey From Traditional To Modern Datawarehouse

Introduction Sir Tim-Berners Lee, famously known for inventing the world wide web once said- CDOs and CEOs know, that…

27 条评论
Part 1 - Cloud Modernization Journey of an MNC with Data Residency Requirements

2021年12月8日

Part 1 - Cloud Modernization Journey of an MNC with Data Residency Requirements

Introduction With the globalization of the economy and the unfortunate outbreak of the pandemic, businesses are only as…

18 条评论

See all articles