Data Privacy & sovereignty Versus exposure to AI (public) cloud services
Jér?me Vetillard
Healthcare Innovation Leader | Business Transformation Expert | Leveraging Data & AI for Impactful Change
Very often I got this long-lasting conversation with my customers about data privacy, i.e. they want their data to be kept on-premises, and, exposure to AI cloud services which are ready to use, secured and more affordable than on-premises infrastructure to run AI services.
Often the question is: Can my data stay “on premises” when I expose it to your AI cloud services to get new business insights, etc.. ?”
There’s a misunderstanding about what does mean “my data stay on premises”.
If we are talking about the long-term storage, the data repository, the corporate data lake… yes it can stay “on-premises” delivering a data storage function.
If we are talking about increasing the business value of your data estate, through AI Cloud Services, we need to repeat it loud and clear: we need to transport your data to our cloud servers in order to compute it with AI cloud services. This means that although the data storage function is still delivered “On-Premises”, the data value extraction/creation is done in the cloud.
So either way from a technical perspective:
- We do it “in-memory” with very temporary storage/cache of your data in our infrastructure (Azure cache for Redis 2) , and transport it through the internet back and forth between your “on-premises” and Azure infrastructure (or any other hyper-scaler). This internet data transport poses a constraint on the “performance” of the AI cloud services that will have to wait for the data to be copied over the internet before being able to compute it.
- Or we extract from your “data lake” or "data warehouse" a relevant portion of the data, a “data pond” or a "data mart" to serve as a more lasting (at least for the duration of the whole computation) data cache in Azure datacenters. This design will improve the performance of the computation.
You can consider this alternative from two perspectives:
- Data Privacy is paramount, and possibly national regulations about data sovereignty prevent you from hosting your data in the cloud. So we need to find "data exposure patterns" that are compliant with these specific regulations.
- There is no legal hard constraints/blockers on the data privacy/sovereignty itself, and the need to keep data on premises is more related to "ageing" data security ways of thinking. While ensuring that you as a data controller, puts the right governance in place to control the work of the data processor, you will still have a business case work to do to choose and implement the best option depending on your Data + AI maturity and use case.
Data Privacy/sovereignty is paramount
There is a mandatory legal check about the regulations that govern your data processing capabilities. Is a very temporary cache can be considered as being “data hosting” ? Does this apply to all your data or is there a data segmentation ? Beyond regulation, are your data governance, architecture able to "enforce" your data segmentation usage based on existing metadata and to make sure that you, as a "Data Controller" are able to control (trace, audit...) the work of a "Data Processor".
To compute the data, we need to read it...
Back to our use case, if your data need to stay "On-Premises", as stated above, in order to "compute/process" it, we need to read your data. Even with the best practices ensuring that your data is always encrypted at rest (storage) and during transport (over the internet) to compute/process it, anyway at one time your data will be exposed to our CPU/GPU/Memory when "in use". There are two principal ways to address this issue: federated learning and confidential computing.
Federated learning to provide better control on the "Data Processor"
Thinking out of the box, if your data cannot go to the computers, then the computers will go to your data. First popular scenario is Scenario 1: Data stay “on premises” and computation (AI foundational model) goes on premises too. This is a well-know pattern known as “Federated Learning” which is often used in healthcare as many people (data owners, data controllers...) think it provides a better control of the data processor as it will process the data “on your promises”. This means that a compute capability must be deployed “on premises” (infrastructure + AI model) and connected to your corporate network to access your data. It can be more expensive, complex and lengthier for sure. Projected compute power is also likely to be smaller and neither as elastic nor as secured as AI cloud services.
Federated learning is primarily used in scenarios where data privacy and security are paramount, such as healthcare, financial services, and edge computing. It allows multiple parties or organizations to collaboratively train a model without sharing their raw data. However, the primary focus is on improving the model, not on transferring data between organizations. Here's how it "works":
领英推荐
- Decentralized Data: In federated learning, data remains on the devices or servers where it is generated, such as smartphones, IoT devices, or local data centers. Data does not get transferred to a central location.
- Model Training: Instead of sending data to a central server for training, a global machine learning model (foundational model) is sent to the local devices or servers. These devices then perform model training on their local data.
- Model Updates: After training on local data, the local models send only the model updates (e.g., gradients) back to the central server, without sharing the actual data. These updates are aggregated at the central server to improve the global model.
- Privacy Preservation: Federated learning is designed to protect the privacy of individual data sources. Since raw data never leaves the local device, sensitive information remains secure.
- Iterative Process: The process of model updates, aggregation, and re-distribution continues iteratively until the global model converges to a desired level of accuracy.
Azure Stack HCI can provide the mandatory infrastructure to run AI services "on premises", while enabling a "software defined" ready to deploy feature to speed-up the configuration of the projected infrastructure. ( Azure Stack HCI ).
I have been working with some start-ups that use Federated Learning pattern, and it appears that the procurement, deployment, configuration and integration of this projected infrastructure is always complex and takes (a lot of) time. Software defined projected infrastructure speeds-up the configuration and integration of the projected infrastructure (edge computing) while providing a far better standardization of the process. It could be coupled with a cloud central infrastructure that manages the federated learning at scale implementing all MLOPS best practices for the potential certification of the AI Algorithms by FDA or EMA (for Healthcare).
Use of Confidential Computing for Confidential AI??
A more advanced scenario is to leverage “Confidential Computing” ( Azure Confidential Computing ) to protect your data even “in use”. In this Scenario 2 your data stay “on premises”, is copied (encrypted during transport) to the computing enclave in Azure and computed there. Confidential Computing ensures that at anytime along the process, the “data processor entity” did not have access to your data (data is protected and remains always protected “at rest”, “in transport”, and “in use”). You can have an example of how this can be implemented on Azure for Healthcare purposes here.
Confidential AI is a set of hardware-based technologies that provide cryptographically verifiable protection of data and models throughout the AI lifecycle, including when data and models are in use. Confidential AI technologies include accelerators such as general purpose CPUs and GPUs that support the creation of Trusted Execution Environments (TEEs), and services that enable data collection, pre-processing, training and deployment of AI models. Confidential AI also provides tools to increase trust, transparency, and accountability in AI deployments.
Azure provides some confidential AI built-in capabilities that are available such as "Confidential VM" and "Confidential containers" on ACI (Azure Container Instances). Confidential VMs on SNP (AMD technology) are generally available while those on TDX (Intel technology) are still in limited preview. Microsoft and NVIDIA are working closely together to provide confidential GPU VMs based on AMD-SNP and NVIDIA A100 GPU. Those confidential GPU VM are still in limited preview. To learn more about Confidential AI you can start here.
Regulation is not really an issue...but you know (bad ?) old habits...
If there is no legal hard constraints/blockers on the data privacy/sovereignty itself, and the need to keep data on premises is more related to "ageing data security" ways of thinking, then while ensuring that you as a data controller, puts the right governance in place to control the work of the data processor…? you have a business case to do, comparing:
- Scenario 1 : federated learning (see above)
- Scenario 2: confidential computing (see above)
- Scenario 3: A small and relevant portion of your data estate, a “data pond” or a “data mart” is extracted and copied to Azure while being encrypted at rest and in transport. This local data cache is then exposed to AI cloud services for computation.
Although the scenario 3 might be felt by a very few/ specific customers as "risky", as they move their data to the Azure public cloud, they should not forget that their data is kept encrypted at rest (possibly with a "Bring Your Own Key" scenario, meaning the encryption is done "client/on-premise" side with a customer's private key), is transported encrypted... and that we can still add "confidential computing" option in this scenario to make sure the data is never exposed to Microsoft in "clear", neither "at rest", nor "in transport", nor "in use". Furthermore, Azure complies with a lot of privacy and security regulations either international or national among the most expecting in the world ( Azure Compliance ). Whenever you store your data in your Azure tenant to process them with Azure AI components which are also running in your tenant, you inherit by design all these security and privacy features provided by Azure. So you can develop enterprise grade AI services, trained on your own datasets, with top-notch security and privacy features.
Data exposure to AI cloud services in only one portion of the game. Ideally, during your Data + AI Journey, you will need to put in place a robust data governance, a secured data architecture able to expose securely your data estate (through a data catalogue) to your subscribed AI Cloud Services while being compliant with regulation. Beyond the sole "data related capabilities", you will also need to put in place the MLOPS capabilities to ensure quality training of your algorithms for ethical AI (know the datasets used for training, KPI about AI performance, traceability, auditability, explainability, fairness...). This is the reason why, while scenario 1 et 2 are still very valid options for Proof Of Concepts, small scale adoption of AI... scenario 3 is the only one able to unleash the power of AI at scale by leveraging all related services such as Azure Data Factory, Azure MLOPS, Azure AI ecosystem as a whole including Azure cognitive services and among these Azure OpenAI services.
Take Aways
- So whatever your data privacy and sovereignty constraints are, there is most of the time a solution to expose it to Azure AI Services while staying compliant with these specific regulations. So let's talk about it ! You can engage through your Microsoft representative or commenting this article.
- Exposed solutions above are not all compatible with the use of AI at scale. Eventually, when you'll be mature enough to implement AI at scale, there will be nothing like the scalability, agility, cost efficiency, security and privacy of the Azure AI Services running in your Azure tenant… along with all what is needed for MLOPS and other complementary cloud services. However, along your AI journey, these solutions can enable experimentations, validate use cases, teach your organization about AI, foster the improvement of your data governance and data architecture, build a robust data + AI culture.
- If you have to leverage “Confidential Computing”, not all AI services could be “Confidential Computing ready" delivering "Confidential AI". So, depending on your use cases, you might need to deploy in a confidential computing enabled Kubernetes cluster (confidential containers on Azure ACI) or on Confidential VMs your own AI algorithms. Azure can provide confidential VMs with either CPU, GPU or both, and confidential containers on Azure ACI.
- Eventually, "Federated learning"" is an interesting pattern which poses the fundamental question of the business value of data itself Versus AI Algorithm. Data Controllers often (over)value the business value of their data still thinking about it with the "oil industry analogy": Data is the 21st century oil... while maybe it is another analogy that should be used: Your data is like Citrus, once computed (pressed), the value of it, its juice is transferred to the AI algorithm. For sure you’ll keep your Citrus “on-premises”, but do choose a business model in which you’ll have a fee/reward for the portion of “knowledge” your data has generated. As an example, a company who trains its AI services on your data leveraging “federated learning” could provide these once-trained AI services at a discounted price to your organization (this is one business model among others).