Data Privacy & sovereignty Versus exposure to AI (public) cloud services
Jér?me Vetillard
Healthcare Innovation Leader | Business Transformation Expert | Leveraging Data & AI for Impactful Change
Very often I got this long-lasting conversation with my customers about data privacy, i.e. they want their data to be kept on-premises, and, exposure to AI cloud services which are ready to use, secured and more affordable than on-premises infrastructure to run AI services.
Often the question is: Can my data stay “on premises” when I expose it to your AI cloud services to get new business insights, etc.. ?”
There’s a misunderstanding about what does mean “my data stay on premises”.
If we are talking about the long-term storage, the data repository, the corporate data lake… yes it can stay “on-premises” delivering a data storage function.
If we are talking about increasing the business value of your data estate, through AI Cloud Services, we need to repeat it loud and clear: we need to transport your data to our cloud servers in order to compute it with AI cloud services. This means that although the data storage function is still delivered “On-Premises”, the data value extraction/creation is done in the cloud.
So either way from a technical perspective:
You can consider this alternative from two perspectives:
Data Privacy/sovereignty is paramount
There is a mandatory legal check about the regulations that govern your data processing capabilities. Is a very temporary cache can be considered as being “data hosting” ? Does this apply to all your data or is there a data segmentation ? Beyond regulation, are your data governance, architecture able to "enforce" your data segmentation usage based on existing metadata and to make sure that you, as a "Data Controller" are able to control (trace, audit...) the work of a "Data Processor".
To compute the data, we need to read it...
Back to our use case, if your data need to stay "On-Premises", as stated above, in order to "compute/process" it, we need to read your data. Even with the best practices ensuring that your data is always encrypted at rest (storage) and during transport (over the internet) to compute/process it, anyway at one time your data will be exposed to our CPU/GPU/Memory when "in use". There are two principal ways to address this issue: federated learning and confidential computing.
Federated learning to provide better control on the "Data Processor"
Thinking out of the box, if your data cannot go to the computers, then the computers will go to your data. First popular scenario is Scenario 1: Data stay “on premises” and computation (AI foundational model) goes on premises too. This is a well-know pattern known as “Federated Learning” which is often used in healthcare as many people (data owners, data controllers...) think it provides a better control of the data processor as it will process the data “on your promises”. This means that a compute capability must be deployed “on premises” (infrastructure + AI model) and connected to your corporate network to access your data. It can be more expensive, complex and lengthier for sure. Projected compute power is also likely to be smaller and neither as elastic nor as secured as AI cloud services.
Federated learning is primarily used in scenarios where data privacy and security are paramount, such as healthcare, financial services, and edge computing. It allows multiple parties or organizations to collaboratively train a model without sharing their raw data. However, the primary focus is on improving the model, not on transferring data between organizations. Here's how it "works":
领英推荐
Azure Stack HCI can provide the mandatory infrastructure to run AI services "on premises", while enabling a "software defined" ready to deploy feature to speed-up the configuration of the projected infrastructure. ( Azure Stack HCI ).
I have been working with some start-ups that use Federated Learning pattern, and it appears that the procurement, deployment, configuration and integration of this projected infrastructure is always complex and takes (a lot of) time. Software defined projected infrastructure speeds-up the configuration and integration of the projected infrastructure (edge computing) while providing a far better standardization of the process. It could be coupled with a cloud central infrastructure that manages the federated learning at scale implementing all MLOPS best practices for the potential certification of the AI Algorithms by FDA or EMA (for Healthcare).
Use of Confidential Computing for Confidential AI??
A more advanced scenario is to leverage “Confidential Computing” ( Azure Confidential Computing ) to protect your data even “in use”. In this Scenario 2 your data stay “on premises”, is copied (encrypted during transport) to the computing enclave in Azure and computed there. Confidential Computing ensures that at anytime along the process, the “data processor entity” did not have access to your data (data is protected and remains always protected “at rest”, “in transport”, and “in use”). You can have an example of how this can be implemented on Azure for Healthcare purposes here.
Confidential AI is a set of hardware-based technologies that provide cryptographically verifiable protection of data and models throughout the AI lifecycle, including when data and models are in use. Confidential AI technologies include accelerators such as general purpose CPUs and GPUs that support the creation of Trusted Execution Environments (TEEs), and services that enable data collection, pre-processing, training and deployment of AI models. Confidential AI also provides tools to increase trust, transparency, and accountability in AI deployments.
Azure provides some confidential AI built-in capabilities that are available such as "Confidential VM" and "Confidential containers" on ACI (Azure Container Instances). Confidential VMs on SNP (AMD technology) are generally available while those on TDX (Intel technology) are still in limited preview. Microsoft and NVIDIA are working closely together to provide confidential GPU VMs based on AMD-SNP and NVIDIA A100 GPU. Those confidential GPU VM are still in limited preview. To learn more about Confidential AI you can start here.
Regulation is not really an issue...but you know (bad ?) old habits...
If there is no legal hard constraints/blockers on the data privacy/sovereignty itself, and the need to keep data on premises is more related to "ageing data security" ways of thinking, then while ensuring that you as a data controller, puts the right governance in place to control the work of the data processor…? you have a business case to do, comparing:
Although the scenario 3 might be felt by a very few/ specific customers as "risky", as they move their data to the Azure public cloud, they should not forget that their data is kept encrypted at rest (possibly with a "Bring Your Own Key" scenario, meaning the encryption is done "client/on-premise" side with a customer's private key), is transported encrypted... and that we can still add "confidential computing" option in this scenario to make sure the data is never exposed to Microsoft in "clear", neither "at rest", nor "in transport", nor "in use". Furthermore, Azure complies with a lot of privacy and security regulations either international or national among the most expecting in the world ( Azure Compliance ). Whenever you store your data in your Azure tenant to process them with Azure AI components which are also running in your tenant, you inherit by design all these security and privacy features provided by Azure. So you can develop enterprise grade AI services, trained on your own datasets, with top-notch security and privacy features.
Data exposure to AI cloud services in only one portion of the game. Ideally, during your Data + AI Journey, you will need to put in place a robust data governance, a secured data architecture able to expose securely your data estate (through a data catalogue) to your subscribed AI Cloud Services while being compliant with regulation. Beyond the sole "data related capabilities", you will also need to put in place the MLOPS capabilities to ensure quality training of your algorithms for ethical AI (know the datasets used for training, KPI about AI performance, traceability, auditability, explainability, fairness...). This is the reason why, while scenario 1 et 2 are still very valid options for Proof Of Concepts, small scale adoption of AI... scenario 3 is the only one able to unleash the power of AI at scale by leveraging all related services such as Azure Data Factory, Azure MLOPS, Azure AI ecosystem as a whole including Azure cognitive services and among these Azure OpenAI services.
Take Aways