The Microsoft Digital Data Transformation Journey

The Microsoft Digital Data Transformation Journey

Happy New Year Everyone! Hope you all had a wonderful holiday filled with memorable moments and are feeling re-charged for a grand 2022! The pandemic period that we have navigated over the past 2 years and continue to navigate, has challenged, shaped, and evolved us forward in ways not imagined. It has taught us to lead and live every dimension of our lives with purpose and grit. Here's to a fantastic year ahead filled with positivity and progress!

Sharing learnings and fostering conversations for new learnings is a great way to start out the new year. There is no better way to start the new year for our team in Microsoft Digital than with this article to share a balcony view of our Enterprise Data Transformation journey. Microsoft Digital is Microsoft's IT organization. We build and operate the systems that power Microsoft's global business and operations, with an emphasis on digitally transforming our customer experiences, our employee experiences, and our internal operations. Our team is the Enterprise Data team in Microsoft Digital. We build and operate the Enterprise Data Foundations to responsibly democratize data for enterprise-wide data applications. We are charting and navigating a journey to transform the Enterprise Data Estate that powers Microsoft's global business and operations. We are 2.5 years into what has been an incredible journey of applied learnings, with many more ahead. The most amazing aspect of our journey has been the conversations that we have had with many of our global customers, marquee brands investing in their own data transformation journeys, to share our applied learnings and to learn from them in shaping our evolution. The intent of this article is to capture and share a balcony view of our data transformation journey. The investments and progress described in this article are the leadership of an amazing team, with material influence and inspiration from industry-wide data leaders.

Enterprise Data Transformation is a very dense topic. Thought leading architectural approaches such as Data Mesh , Data Fabric , and Data Hub are widely discussed and debated in context in the data-verse, each with sound credibility, and all of which have influenced our data transformation journey in Microsoft Digital. The goal for this writing is to share a balcony view of our data transformation considerations, macro architectural choices, and applied learnings. With a focus on presenting a wide balcony view, this article will not distill the depth beneath the range of topics introduced in this writing. We will publish further and deeper articles on the topics introduced in this balcony view. To also note is that this writing does not advocate a singular architectural approach or specific implementation technologies. Our motivation is to share applied practitioner learnings that are vendor and technology agnostic, and to foster dialog with anyone interested to learn more and share their learnings, as we navigate our journeys of continuous and adaptive learnings in powering Data Driven Innovation for our purposes and organizations. With these framing contexts set, let's dive right in!

The content to follow is structured in 2 sections. The first section will introduce the quality challenges that existed in our prior data estate and the opportunities that we identified for our data transformation journey. The second section will provide a balcony view of our data transformation blueprint, our investments, and the outcomes that we have progressed to date.

The Enterprise Data Opportunity

Data being an invaluable enterprise asset is broadly stated and generally well understood. Enterprises generate large volumes of data, want to generate and acquire tons more, and want their people and their systems to be able to access and use the data needed for transformative impact. The Enterprise Data Opportunity stated simply, is to Scale Responsible and Transformative Data Applications. Transformative Data Applications integrate insights and intelligence from connected enterprise-wide data to create delightful customer experiences, enhance employee productivity, and generate operational efficiencies. Enabling teams across an enterprise to do such responsibly and at scale is a quality opportunity and challenge.

Data is an invaluable and a sensitive asset. It is generally not directly usable in the raw and atomic forms that it is produced in. The following are top of mind considerations that impact the use of data:

  1. The quality of the data encompassing its validity, accuracy, completeness, and freshness.
  2. The usability of the data in the form it is produced, without further conformance and enrichment.
  3. The joinability of the data to related data that it needs to be connected with to be used in meaningful ways. Transformative enterprise data applications generally entail connecting cross-domain data for connected enterprise insights and intelligence.
  4. The use of data in compliance with user preferences, regulatory requirements, mandatory enterprise policies, and domain specific data policies.
  5. The data literacy of the data users, spanning all the above and the contextual use of data in their domains.

Laxing on any of the above considerations can make the difference between the outcomes of data use being beneficial or detrimental. Scaling controls to address these considerations at the volume, variety, and growth velocity of an enterprise's data estate, is a fundamental to responsibly democratize data to scale enterprise-wide applications. This as we well know, is easier said than done :-)

As most Enterprises, we have a globally distributed and organically evolved / evolving data estate in Microsoft Digital. The illustration below is a macro zoom-out view of our enterprise data estate when we started our data transformation journey.

No alt text provided for this image

Readers in data teams (or) with an awareness of their enterprise data estates, should be able to see the concealed complexity in this simplistic view. The concealed complexity when we started our journey was the volume and the variety of each of the components and their constituent elements in the view. We had hundreds of data sources, data infrastructures, and data consuming teams generating, processing, and using several petabytes of data on a daily basis.

To scale responsible and transformative data applications, we had to unpack and determine how we could responsibly scale each of the fundamentals viz. data access and use, data storage and management, and data compute.

1. Scaling Responsible Data Access and Use

The following were the mechanisms for data access and use in our pre-transformation state:

  • Data being used only by the team generating the data.
  • Data being shared between teams via point-to-point integrations, with integration specific contracts and entailing data movement / copies and duplicative data infrastructure. Such data movement and copies would commonly transcend multiple tiers of data sharing, resulting in loss of true data ownership and causing governance challenges.
  • Data requirements being serviced by shared data operations teams trained in data management, situated within an organization or situated centrally in IT serving multiple organizations.
  • Canned Dashboards and Reports with limited self-serve interactivity for commonly used data and insights.
  • Tightly scoped and contained Data APIs with point-to-point integrations and contracts.

These mechanisms are each and collectively non-scalable anti-patterns for an enterprise aspiring to responsibly scale transformative enterprise data applications.

The following are the opportunities that we identified to scale responsible data access and use:

  1. Data as an Enterprise Asset. Most enterprise data will have transformative application opportunities in current known and future to-be-discovered enterprise contexts. Such application opportunities will exist and surface in enterprise-wide teams and must be enabled responsibly at scale.
  2. Accountable Data Ownership. Instating ownership for all data by the best suited subject area experts trained in data management governance, with accountability for the overall health and responsible use of the data they own.
  3. Compliant by Design. Ensuring that all data access and use is always compliant with the contextual data owner defined, regulatory, and enterprise policies with fine grain auditing to prove compliance.

2. Scaling Responsible Data Storage and Management

Data in an enterprise is distributed by virtue of the systems and operations that generate, process, and serve data being distributed. The core challenge in scaling responsible data storage and management is anchored in the organic evolution of enterprise systems and operations that process and serve data. Proliferated data copies and data pipelines are prevalent in most enterprises, posing challenges in not just maintaining the integrity of data ownership, but also in responsible data management.

The following were the top line data storage and management challenges in our pre-transformation state:

  • Proliferated data copies from point-to-point integrations for data sharing, growing with fast evolving needs for connected data across enterprise domains.
  • Distributed data management investments (practices, systems, operations) for common standards and requirements, resulting in duplicative and inconsistent implementations. Data Management is a broad term that encompasses Metadata Management, Data Quality Management, Data Conformance, Data Unification, Data Compliance, Security and Access Management, Data Lifecycle Management, Source-to-Application Data Lineage Tracing, and Data Infrastructure Optimization.?
  • Dilution of data ownership, data exposure risks, and material operational expenditure resulting from the above.

The following are the opportunities that we identified to scale responsible data storage and management:

  1. Containing and governing essential data copies.?Data copies is a widely discussed and debated topic. While avoiding data movement and copies wherever doable is a generally recommended practice, there are cases where controlled copies are essential. The movement of data from a store optimized for transaction processing to a store optimized for big data compute, the movement of data to a store where it can be conformed and connected with broader enterprise data as required by many applications, the movement of data to an application's edge data cache for serving mission critical latencies and offline scenarios, and storing the output of compute for a data product that includes the data used in a store that isn't the data source, are examples of such cases. Such data movement and copies could be to both physical and in-memory stores that aren't the stores where the data used is stored on creation, with the choice determined by contextual factors such as the volume of data, data retention requirements, and infrastructure cost. Technologies aimed at addressing the "data copies for big data compute" challenge such as Hybrid Transactional and Analytical Processing Systems (HTAP), In-Memory Compute, and Data Virtualization, are not solves for all cases and also entail moving and transforming data to "under the hood" constructs, whether in-memory and/or backed by physical storage. Essential data copies are not bad as long as they are contained, well managed, and governed.
  2. Enabling teams to own their data applications, without proliferating data infrastructure and data copies. The top reasons for the proliferation of data infrastructure and data copies in an enterprise are a) data source systems not being optimized for the compute and serving requirements of modern data applications, b) requirements to conform and connect data from across enterprise domains and systems, and c) teams wanting to have full control ownership of their data applications. With managed, self-serve enabled, and governed infrastructure services that are optimized for modern data workloads, data copies can be contained to the essential minimum while enabling teams across an enterprise to build and own responsible data applications without infrastructure proliferation.
  3. Standardizing and Scaling enterprise data management. Standardized and Scalable Data Management enables uncompromised data ownership, impactful data applications, proactive avoidance of data exposure risks, and optimized data operations. Related and essential investments include defining enterprise data management standards, scaling them with intelligent automation, and operationalizing an enterprise-wide data governance program to maintain compliance.

3. Scaling Responsible Compute

Data is generated, processed, and used by compute. Compute is owned by teams across an enterprise, making it the most critical and the most challenging to scale responsibly.

The following were the top line compute challenges in our pre-transformation state:

  1. Data movement to compute infrastructure owned by teams across the enterprise resulting in proliferated data copies.?
  2. Compute outputs persisted in and served from proliferated stores with variant data management mechanisms.
  3. Duplicative investments by multiple teams in compute for common data management capabilities.
  4. Correlating changes across data and compute in tracing breaking data changes to the causal compute.
  5. Lacking rigors in scalable engineering excellence for data compute.

The following are the opportunities that we identified to scale responsible compute:

  1. Enabling responsible and federated compute.?Scaling federated compute for data applications across enterprise-wide teams, by abstracting the complexities of responsibly managing data used and produced by compute.
  2. Bringing compute to data where it is stored and optimized.?Pivoting from moving data to compute infrastructures, to bringing compute to data where it is stored and optimized for the compute context.
  3. Enabling preferred compute. Enabling the flexibility for teams to bring and use their preferred compute modalities for their data applications without requiring data movement or compromising responsible data management.
  4. Standardizing compute for enterprise data management. Abstracting the complexities of enterprise data management with standardized compute for common data management requirements, enabling teams to focus their compute investments on differentiated data applications.
  5. Engineering Excellence for data as practiced for our software products. Applying engineering systems and practices to harden the engineering and operations rigors for data compute and data in fostering reuse, scaling responsible deployments, production observability, incident management, and infrastructure optimization.

The challenges and opportunities introduced in this section constitute the purpose of our data transformation journey and investments introduced in the next section.

Transforming the Microsoft Digital Enterprise Data Estate

This section presents a balcony view of our data transformation blueprint to address the challenges and opportunities in scaling responsible and transformative data applications.

Our data transformation journey has been and will continue to be one of continuous applied and adaptive learnings. Transforming data in an enterprise is a multi-dimensional adaptive leadership opportunity and challenge. It is not just a technology transformation journey. It is at its core a people and practices evolution with technology as an enabler. While there are incredible technology opportunities and challenges in transforming an enterprise data estate, even the best technical solutions will fall short without organizational alignment, people advocacy, and the evolution of data practices. Synergizing these multi-faceted dimensions and making progress is an incremental journey of adaptive leadership and applied learnings. We are 2.5 years into our journey here in Microsoft Digital, a period during which we have evolved 80% of our data estate to the blueprint introduced in this section. We are on path to a 100% and expect to be in state by the end of this calendar year.

This section will present a balcony view of our data transformation blueprint. Each of the topics introduced here could be a focused book. We will look to publish further and deeper articles on these topics, the transformative outcomes that we have progressed, and the overall softer dimension of navigating our organizational change management in evolving to the blueprint.

Our data transformation blueprint is based on the over-arching principle of scaling federated and transformative enterprise data applications with responsible enterprise data foundations. There are vibrant discussions and debates in the data-verse on federated versus centralized approaches to data transformation. Our point of view on this is to strike the essential balance and our data transformation blueprint is based on the following related principles:

  • Teams across the enterprise must be fully empowered and enabled to build and operate transformative data applications for their domains.
  • Teams across the enterprise should not have to invest in duplicating what can be shared data foundations.
  • Shared data foundations should enable self-serve capabilities and extensibility for domain specific requirements.
  • Any team and anyone from across the enterprise can contribute to building shared data foundations, without needing to duplicate investments.

Presented below and based on these principles, is a balcony view of our data transformation blueprint, the components of which are introduced in the following sections in the sequence as numbered in the view:

No alt text provided for this image

1. The Enterprise Data Governance Platform

"Do we know where all of our data exists, who is using which data, and for what purpose?" - a real question posed a few years ago by a senior leader at Microsoft, for which there was no all-encompassing answer at that point in time. With our Enterprise Data Governance Platform investment, we have been able to make good progress towards addressing the question, with an all-encompassing lens.

Forming a comprehensive view of an enterprise data estate is the first step in identifying the opportunities to responsibly scale data management and use. Our Enterprise Data Governance Platform is our shared data foundation for managing our Enterprise Data Estate. It is the single destination for enterprise-wide data owners and data consumers. Data owners use the platform to configure policies and manage the health of their data assets. Data consumers use the platform to discover, subscribe to, and access data in compliance with data owner, regulatory, and enterprise policies.

The capabilities of our Enterprise Data Governance Platform include:

  • Data Scanners and Metadata Publishing APIs to construct and maintain a map of our enterprise data estate with rich metadata and data lineage visibility. The data estate map serves as the foundation to scale data management with intelligent automation and essential human controls.
  • Data Discoverability using an Enterprise Data Catalog with built-in capabilities for defining and managing a data glossary, data taxonomies, and data classification rules.
  • Data Quality Management to configure and maintain shared enterprise and domain-specific data quality standards.
  • Data Access Management for data owners to configure and manage data access policies, automated workflows for data consumer access request processing in compliance with data owner-defined access policies, and data access auditing and reporting.
  • Data Compliance for regulatory and corporate data compliance standards such as Privacy, GDPR, and SOX.
  • Data Lifecycle Management for data retention in compliance with configurable data retention policies for regulatory and data use purposes.
  • Data Health Scorecards to automate data health measurements and insights for all the above.
  • Extensibility to add domain specific data management extensions.

Central to all capabilities is the notion of generating, capturing, and using metadata to its fullest in scaling data management with intelligent automation and essential human data steward controls.

No alt text provided for this image

Some examples of metadata driven intelligent automation to scale data management include:

  • Classifying assets for regulatory compliance and automating related requirements.
  • Classifying assets for contextual and automated data quality checks.
  • Constructing data lineage graphs to automate the detection of data copies and recommend actions to reduce.
  • Detecting assets with PII data to automate PII de-identification for applications that do not require PII attributions.
  • Inferring undefined data relationships to automate connecting and unifying related data.
  • Instating and automating contextual data lifecycle management policies.
  • Enabling semantic and natural language discovery experiences in the data catalog.

Investing in an enterprise data governance platform and program, is the top data transformation investment. If there is one investment that you are looking to get started with, consider this as the first.

2. The Enterprise Connected Data Platform

Transformative applications of data commonly entail connecting data at big data scale from across enterprise domains to generate connected insights and intelligence as data products. Connected datasets, metrics, conformed dimensions, reports, data marts, analytics models, and ML/AI models are all examples of such data products. The data products are applied in cross-domain applications to enable transformative experiences and efficiencies.

The following are common norms for such data products and their applications:

  1. The data required is stored in multiple cross-domain enterprise data sources of variant types.
  2. The data is not conformed in structure and quality across the data sources to enable connecting in direct-from-source forms.
  3. The data sources are not the apt destination to persist the data products.
  4. The data sources are not optimized to serve the scale of compute required by the data products. The exception to this would be HTAP (Hybrid Transactional and Analytics Processing) systems, with most enterprise data sources not being HTAP capable.
  5. The data sources are not optimized to serve the scale of data serving required by the use of the data products.

Teams took to addressing these challenges by investing in big data compute and serving optimized infrastructure for their data products. Such infrastructure can be physical storage based, in-memory based, or hybrid, with all modalities needing data to be queried from the data sources, and the physical store and hybrid modalities needing data to be moved and stored.

The following are the top line challenges and inefficiencies with distributed investments by teams in such infrastructure:

  1. Duplicative investments in services for data ingestion, data standardization (format optimization, data merging, conformance), and data unification.
  2. Duplicative and variant integrations with the Enterprise Data Governance platform.
  3. Data movement to and copies in multiple such infrastructures.
  4. Multiple point-to-point integrations between the infrastructures for data sharing.
  5. Lost opportunities for infrastructure optimization with variant implementations.

Our approach and solution to addressing the requirements and the challenges in scaling enterprise-wide data products, is our Enterprise Connected Data Platform with the following top line capabilities:

  1. Provisioning and managing big data compute and serving optimized storage.
  2. Multi-Tenancy and containerization to host teams across the enterprise, with self-serve capabilities in enabling teams to bring their own data, use their preferred compute, and fully own / operate their data products on the platform.
  3. Integrations with enterprise data publishers, with built-in services to ingest and manage shared enterprise data, by applying standard patterns and contextual optimizations.
  4. Built in services to standardize and unify data. Data standardization includes capabilities for format optimization, data merging, and conformance to common data models. Data unification automates connecting shared and related enterprise data, without needing teams to invest in doing.
  5. Built-in Integration with the Enterprise Data Governance Platform with consistently applied data management and governance.

We have migrated about 80% of our prior point big data compute and serving infrastructures (distributed data lakes, data warehouses, operational data stores) to our connected data platform since commencing our data transformation journey, generating material infrastructure efficiencies, reducing data copies, and enabling engineering capacity to be pivoted from duplicative data infrastructure investments to differentiated data applications. We are on path to having a 100% of our big data compute and serving on the platform by the end of this calendar year.

Our enterprise data governance platform and enterprise connected data platform are our core shared enterprise data foundations to scale responsible and transformative data applications.

3. Enterprise Data Compute and Serving

Our goal with compute and serving is to enable teams to be able to use their preferred modalities, while being compliant with enterprise data management and governance requirements.

Our compute modalities are Spark and SQL based with manifestations in development environments, Analytics tools, ML and AI services. The common practice is for teams to own and manage their preferred compute. All compute is registered, visible, and audited in our enterprise data governance platform.

For data product serving, we are incrementally navigating in the direction of enabling with and directly from our connected data platform wherever possible. And there will continue to be good cases to move and serve data products from edge systems for requirements such as mission critical latencies and offline scenarios. In such cases, our data management standard is to register the edge systems in our enterprise data governance platform for auditing and lineage tracing.

4. Enterprise Engineering Platforms

We have mature systems for engineering excellence that we have honed and applied over several years in developing and operating our software products. Applying these systems to incorporate parity rigors for our data products is an investment that we are currently focused on with our core data foundations in place.

Our top line engineering excellence priorities for our data products include:

  1. Operationalizing Continuous Integration and Deployment pipelines to scale responsible deployments and guard rail against breaking changes.
  2. Change capture and auditing to correlate changes in data such as schema evolutions and data drifts to causal compute.
  3. Operationalizing intelligent observability for our business and data product OKRs (objectives and key results) and service level agreements, to monitor trending and trigger timely corrective actions.
  4. Operationalizing incident management to optimize our mean time to detect, mitigate, root cause, and resolve incidents.
  5. Observability and intelligent automation for responsible infrastructure optimization.

5. Enterprise Data Programs

As mentioned at the outset, the best technical solutions for data transformation will fall short without organizational alignment, people advocacy, and the evolution of applied data practices. In addition to technology, we are also investing in the following enterprise data programs to scale company-wide data awareness and best practices in the responsible use of data:

  1. Enterprise Data Governance Program: to form and train a community of data owners / stewards from across the company in our responsible data management and governance practices using our enterprise data governance platform.
  2. Enterprise Data Education Program: to train our teams in data concepts and practices in fostering a data driven culture.
  3. Responsible AI Program: to train our teams in the principles and practices of Responsible AI applications.


Outcomes Progressed and The Journey Forward

The following are the top line outcomes that we have been able to progress to date in our data transformation journey:

  1. Onboarding 80% of our data estate to our data transformation blueprint to make good progress in standardizing our data management, generate c$20M per annum in infrastructure efficiencies, and pivot material engineering capacity from building duplicative data infrastructure to building transformative data applications.
  2. Enabling new and transformative connected data applications in enterprise-wide domains such as customer data management, enterprise measurement systems for our business OKRs, marketing automation, sales intelligence, finance operations, real estate and facilities management, and environmental sustainability.

As for our journey ahead, we have a lot more to get done in the period ahead. Our top line focus areas for 2022 are:

  • To complete onboarding a 100% of the Microsoft Digital data estate to our data transformation blueprint.
  • Continuing to harden our engineering and operations excellence for our data products.
  • Enabling teams across the organization to grow and scale their investments in transformative data applications.

Sharing, Learning, and Evolving

We hope that you found this writing to be a useful read. We would super appreciate and value hearing your thoughts on our journey. We would also love to connect and share deeper learnings, as well as learn from your journeys and experiences. Drop us a line in the comments and we will get in touch with you. Let us also know if there are any topics introduced here that you would like a deeper dive on, to help us prioritize our next writings.

Data transformation is a vast topic and there is so much that we can learn from each other. Here's to a fantastic 2022 and to sharing, learning, and evolving our data transformation journeys for our causes and organizations!

Great article - I'll be pointing my customers at it.

回复
Garnadi Garnadi

Data Analytics & AI Specialist - Business & Technical

2 年

This is indeed an awesome insight.

回复
Arne Rossmann

Innovation Lead | Chief Architect Data & AI for Intelligent Industry @ Capgemini | GTM Lead AUTO + MLS | Born at 342.53 ppm

2 年

great article and insightful journey. Thanks for sharing!

回复

Very interesting write up. Thank you. Excellent.

回复
Agustín Santamaría

Data&AI Solution Specialist Manager

2 年

It's what we need. These learnings are very valuable. Many thanks for sharing. I love it!.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了