Part 2 - Cloud Modernization Journey of an MNC with Data Residency Requirements - Implementation on Microsoft Azure
Introduction
"The Car Will Become A Computer On Wheels" - Ford Motors
"We want to be 'D' in Gandalf" - DBS Bank
With the statements like these from industry leaders of non-technology sectors, it is evident that it is the era of digitize or die for businesses of all sectors. Organizations can no longer afford to ignore dramatic changes in the way customers are looking at products and services but to adapt to them urgently. Chief Data Officers (CDO) of companies are forced to think of becoming digital-first and are starting to embrace the latest technologies like cloud computing, AI/ML and are accelerating their transition to a data-driven culture.
Due to this rapid data modernization, regulating and protecting the usage of data especially the national and citizen data (also known as personally identifiable information or PII) is becoming difficult for the governments. Governments are imposing stricter data residency requirements as prerequisites for businesses who wish to modernize their platforms to the cloud. This is now a known challenge that every business needs to address.
In this blog…
This is the second part of a 2-part blog about the cloud data modernization journey for MNCs with data residency restrictions. In my previous blog, we discussed the 3 main problem statements of the customer, wanting to move to the cloud, and their appropriate solution patterns. Considering different combinations of tools and technologies, storage and compute needs, skills of the customer teams, the demands of the use case, there are numerous ways in which a solution can be implemented on the cloud. In this blog, drawing from my experience working with various customers, I will discuss one such cloud implementation.
Implementation on Azure cloud
Considering all the regulations around data residency, we can envision the design pattern and solution implementation on Azure data services. The design pattern can be looked at in two ways-
1.?????Where the required data has cleared DRR and is approved to be persisted into a Data Consolidation storage area by the data product owner(DPO).
2.?????Where the required data has cleared DRR but can only be accessed via Data Consolidation Compute but cannot be persisted in Consolidated Data Storage.
Note that the customer’s RCT(Regulatory and Compliance Team) along with the DPO team are accountable to decide whether required data can or cannot be persisted outside the data’s country of origin. This decision is a key prerequisite to solution design and overall project plan.
Common Data Services
Azure Synapse Analytics
Quoting Microsoft’s Synapse product page
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated resources at scale.
Synapse analytics has two consumption models
Serverless- pay-per-query ideal for ad-hoc data lake exploration and transformation. Dedicated clusters optimized mission-critical data warehouse workloads.
We can use use either options based on the required design pattern which is discussed more in detail in below section(pattern 1 and 2).
Azure Data Lake Storage Gen2 for big-data storage
Referring to the above diagram, Azure Data Lake Storage Gen2(ADLS Gen2) is serving as a storage area for countries that can reside on the cloud. For Ex: Group A, Group B and Consolidated Storage – all three sets’ storage can be on ADLS Gen2. All the accounts can belong to the same subscription or different depending on how the customer wants to isolate the workloads and also manage their cost and billing models. Following are some reference articles to know more about ADLS gen2 and to get started.
Azure Data Factory/ADF for ETL and Orchestration
ADF is Azure's cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management. ADF can be used to move the data from Group B and C data to consolidated storage and computing area as shown in the diagram above.?The Azure Purview which is a data governance tool captures all lineage data generated by data factory.
Azure Purview for data governance
Azure Purview is a unified data governance service that helps you manage and govern your on-premises, multi-cloud, and software-as-a-service (SaaS) data. Create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage. Enable data curators to manage and secure your data estate. Empower data consumers to find valuable, trustworthy data.
Power BI
Power BI is a collection of software services, apps, and connectors that work together to turn your unrelated sources of data into coherent, visually immersive, and interactive insights. Your data may be an Excel spreadsheet or a collection of cloud-based and on-premises hybrid data warehouses. Power BI lets you easily connect to your data sources, visualize and discover what's important, and share that with anyone or everyone you want.
You will be able to integrate the existing Power BI workspace with Azure Synapse Analytics so that you can quickly access datasets, edit reports directly in the Synapse Studio, and automatically see updates to the report in the Power BI workspace.
There are two modes in which you can access data in Power BI - Direct query and Import mode which we will discuss in relation to pattern 1 and 2 below.
Design pattern #1 - for SLA driven standard reports
In this design pattern, the required data has cleared DRR and is approved to be persisted into a Data Consolidation storage area.
Following are the pre-requisites for the scenario to qualify for this design pattern.
a. Data requested has no PII data
b. Data requested is in aggregated form and doesn’t divulge any detail/ transaction-level data
Synapse Dedicated SQL Pool for Relational data processing and data serving
Synapse Dedicated SQL Pool serves as the unification layer where the data from all the other groups is brought together for transformation to drive the group reporting requirements which could be driven by strict SLAs.
In short, dedicated SQL pool (formerly SQL DW) stores data in relational tables with columnar storage. It follows an MPP architecture to support high concurrency, high volume, performant querying that is typical in an SLA-driven reporting. It can serve the typical use cases like datawarehouse, data mart, data lakehouse etc. Following are some links to get you started with Azure Synapse Analytics and Azure Synapse Dedicated SQL Pool -
Power BI users can access this data by connecting and creating reports in Power BI. For pattern #1, it is recommended to access the reports via import mode since it allows the data to be copied over to Power BI memory cache and in cases where the reports are SLA driven, multiple users access the same report this proves to be highly performant.
Design pattern#2 - for lightweight and non-persistent data access
In this pattern, the required data has cleared DRR but can only be accessed via Data Consolidation Compute but cannot be persisted in Consolidated Data Storage.
Following are the pre-requisites for the scenario to qualify for approach this two.
Referring to the above diagram, ADLS Gen2?and Synapse Dedicated SQL Pool are retained as in approach 1. Azure Data Factory(ADF) serves as data transfer tool from Group A to Group Reporting or Data consolidation area.
Synapse External Tables for data virtualization/non persistent data access
For bringing in data from Group B to Data consolidation area, an external table can be defined on top of ADLS Gen2 data from Group B or Group A that is required in this case. An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from files or write data to files in Azure Storage. With Synapse Analytics, you can use external tables to read external data using a dedicated SQL pool or serverless SQL pool.
Synapse Serverless SQL Pool with OPENROWSET
For bringing data from Group B to Consolidated area, we can use Synapse serverless Openrowset option. When there is a need for in-place querying of data residing in files in data lake, serverless SQL pool extends the existing?OPENROWSET?function.
For bringing in data from Group C to group reporting or data consolidation area, since on-prem system can be anything be it Oracle, Netezza, Teradata etc, it is recommended to go with any data virtualization tool like Apache Presto/Starburst that supports bringing data from these on-prem data platforms to Azure.
For pattern #2, Power BI users can create reports using the direct query mode since this allows connecting to the data dynamically and also it does not persist the data in cache(immutable).
Conclusion
Data residency requirements are rapidly changing. Cloud-Scale Analytics is rapidly advancing with newer and better technologies day by day. Businesses can benefit a lot by embracing technology and becoming data-driven at the core. Engaging a reliable cloud service partner who is an industry leader in provisioning the cloud infrastructure, experienced in handling such regulatory complexities and has wealth of knowledge is key for the customer’s successful journey to the cloud.
In this blog, I have tried to consolidate and document one of my personal experiences of cloud implementation journey with a customer. In no way does it cover the full capabilities of all the services mentioned, but this is just an effort to introduce the services to you so that you can start exploring the richness of these products. If you think the article was helpful, do like the post. In case you have any questions, comments or observations regarding the article, please drop a line and I will try my best to answer it.
Leader - Data Engineering -Data Technology-Data Management
3 年Good one Suma Manohar ..thanks for putting it into a blog.
Tech Executive / Technology Agitator/ Cloud/ Data & AI lover/Mentor/Angel Tech Investor/GM/EMEA/LATAM/US
3 年Outstanding post Suma, congrat.! I will leverage withbour customers.!
Data Architect
3 年??