Automating data compliance in the Cloud

Automating data compliance in the Cloud

In the first instalment, we laid down the building blocks of a "dream" data mesh hosted as a service in the Public Cloud.

The main driver behind this proposal was for Cloud customers to minimize the burden of protecting regulated data by leveraging PaaS to maximize automation.

This second and last part will be more "hands-on": we'll introduce a few more concepts and put the different pieces together.

The failure of Open Policy Agents and Rego policies

Let' start by asking: why make a "new" design proposal for a Cloud data mesh?

Because an efficient data governance is hardly achievable at the scale of whole information systems: it supposes that a large workforce is accountable, trained and incentivized to proactively identify regulated data everywhere in every corners of corporate data stores, and that it also defines, runs and manages the compliance policies data should meet.

If that is case, suffices it to disseminate a fleet of Open Policy Agents throughout the Information System to get the job of configuring, deploying and monitoring Rego Policies done by a horde of compliance officers. End of the story!

In the field, alas, data compliance suffers the same caveats as data quality: because data governance is everybody's business and because the task of identifying and classify data is so daunting, everybody deals with this is in best effort (meaning pretty much anything ranging from "do nothing" to "implement some arbitrary, undocumented, non-interoperable Rego policy).

Let's face it: at the end of the day, data compliance in large corporations turns out to be nobody's business.

Agentless compliance

In the model I'm trying to put forward, forget about agents, policies and manual operations. Welcome automation and scalability.

To achieve scalable automation, the 'human factor' is strictly limited to a very small number of knowledgeable people:

  • golden source (or: Master Data) owners, who are the knowledgeable people for data privacy requirements
  • devSecOps security champions, who are the knowledgeable people about what mitigations are installed in their application

Because we don't want to bug these key people with low value, time consuming, poorly scalable, unrewarding tasks, we capture regulatory requirements from the former and operational mitigations rom the latter in the form of contracts as code (as explained in part 1), and we let automated reasoning do "the bulk" of compliance appraisal.

Appraisal is done with propositional logic: starting from each golden source in the graph of PaaS instances making up the data mesh, all possible transformations are automatically inferred by the Cloud provider and converted into a big vector of boolean constraints.

The business question of being compliant becomes a mathematical matter of satisfiability.

A Satisfiable Modulo Theory (SMT) solver is an unsupervised tool that is doing just that: it attempts to solve very, very big such problems. In a very, very efficient way.

Data classification: who said it should be a chore?!

How do we get around the problem of data classification? SMT solvers require an accurate and huge quantity of manually-fed metadata, there's no way around it, is it?

The thing is, solvers don't need the usual metadata as long as they're aware of table-level signatures... As explained in part 1, a signature is a high-level representation of the data transformation being performed on a table at some point in the application flow. Such a transformation always takes one (or more) upstream table as an input, and produces one (or more) downstream table as an output.

Okay, but how can the provider produce the much needed signatures?

Here is the magic: being in charge of all storage PaaS instances (databases, caches, object storages, key/vault repositories) of the data mesh, the provider knowns exactly how they are queried and by which compute PaaS instance (functions, servers, containers).

Let's look at this into more details.

Signatures generation

Suppose an online shoes seller has two golden sources: the first one is an AWS RDS shoes stocks database, the second one is a dynamoDB clients repository.

No alt text provided for this image
A fully PaaS web site with two golden sources:a shoes stock and a clients repository

A fleet of ECS tasks is handling online orders: when a customer wants to checkout, the stocks are checked, and, based on the references of the customer, an order is published to an SNS topic.

Two services subscribe to this topic: an S3 bucket persists the transactions, and a Lambda (or, more likely, a steps function) triggers the eventually consistent housekeeping (billing, updating the stocks and the cart, ...).

The upstream tables are queried directly from ECS: so, figuring out the signature of the operations is quite straightfoward.

Here is how it might look like:

  • a select query is issued by an ECS container to the RDS instance (probably along with many clauses like shoes size, etc);
  • the customer delivery address and payment options are retrieved as a keys from dynamoDB by the same pod/container.

The signature of such ECS activity can be expressed as a concatenation of both operations. Since operations come from different PaaS services, we are faced with what I call a 'Hydra': a dual-headed (or multiple-headed) upstream data source. To reflect that point, I fancy to use a special operator &() rather than the usual select operator s():


Of course it's only cosmetic: feel free to stick to s() if you're more comfortable.

The important takeaway is this: the signature of ECS activity doesn't care which clauses are passed to the select statement s(stock), neither does it care about the client id or its billing address being projected in p(s(client)). What the signature cares about, is that data coming from golden sources 'stock' and 'client' are being manipulated and persisted by ECS tasks.

Rethinking stateful computing

Signatures require a little thinking effort, because the usual way to deal with stateful computing is by reasoning about the output of the compute task, not its input(s): we are used to say ECS tasks produce inserts into the orders table.

But here, we are not interested by the orders table, because it is not a golden source. From a compliance perspective, it makes much more sense to see it as a downstream table derived from the Hydra's heads. We do not treat orders as first class citizens: we track orders as &(s(stock),p(s(client))

So, as far as stateful computing is concerned:

  • we shall not think in terms of mutable operations like create table, insert into, update, replace, merge...
  • we should focus on immutable operations (select and hydra) to keep track of all original data sources.


In practice, however, unsupervised calculation of signatures is more complex and may require to feed more insights to the provider, because:

  1. not all queries from upstream tables lead to data persistence in downstream tables;
  2. the ordering of inserts or updates is not always the same as the ordering of queries. (transactions are interleaved).

The more a data mesh is abstracted away from its infrastructure, the better: for applications hosted in VMs, the Cloud provider will need the most guidance, whereas for pure functions, signatures generation will be straightforward. Containers kind of sit in-between.

Introducing SATmesh

During my tinkering with automated compliance, I staged a proof of concept based on a very simple SMT solver able to manipulate Boolean constraints: its codename is SATmesh.

Let me walk you through the process of finding compliance gaps with SATmesh. There are two main cases:

  1. unsatisfiable requirements between two data golden sources;
  2. unsatisfiable requirements between data owners and security champions.

Unsatifiable requirements between two golden sources

We carry on with our shoes seller example: in the picture below, we depict the 'stocks' golden source in green and the 'clients' golden source in light blue. The output, 'orders table', is depicted in orange. Remember that it is not known as a self contained table, but as: &(s(stock),p(s(client))

The tables have 3 geozoning constraints: DE (data must stay in Germany), NL (data must stay in the Nerherlands), and EU (data must stay in the European Union).

Black arrows show how tables ??flow?? downwards from PaaS to PaaS. PaaS instances are depicted in gray.

No alt text provided for this image

The clients table is hosted in Spain, in DynamoDB, and is subsequently processed by ECS in France. Since both Spain and France belong to the EU, and since the data owner stated that client data must reside in EU, no conflict is raised.

The stocks table is hosted in a RDS instance located in Germany. The stocks are processed by ECS in France, and now this is a real issue, because the internal SATmesh representation of geozones contains the following constraints:

  • Implies(stocks,DE) as per stocks data owner requirement
  • Implies(stocks,FR) as per ECS security champion requirement
  • DE!=FR as per SMT theory of equality

This set of constraints is NOT satisfiable. SATmesh will raise an anomaly.

Unsatisfiable requirements between data owners and security champions

Sometimes, PaaS instances which are on the critical path of some application flow just cannot meet the requirements set by data owners.

Imagine an R&D golden source table (in blue, below) has two compliance constraints: 'IP' meaning that it contains Intellectual Property, and 'PII' meaning it also contains Personally Identifiable Information.

No alt text provided for this image

Blue table data are processed sequentially by two PaaS instances (shown in gray), from top to bottom:

  • a first PaaS instance located in France 'FR', enforces client-side encryption at rest 'CS'
  • a second one located in Japan 'JP', enforces server-side encryption at rest 'SS'.

For SATmesh, service-level encryption constraints are:

  • Implies(PII,Or(SS,CS)) meaning that PII must always be encrypted at rest
  • Implies(IP,CS) meaning that IP must always by client-side-encrypted at rest
  • CS is True (for the top PaaS only)
  • SS is True (for the bottom PaaS only)
  • SS != CS

The bottom PaaS is only able to handle SS which makes the above constraints unsatisfiable. Once again, SATmesh will raise an anomaly.

Remember from part 1 that we always reason at table level, our unit of measure. So, when we say PII are encrypted, what we actually mean is that the whole PII table is encrypted, not only the cells, rows or columns which contain actual PII.

Impact analysis

Resorting to automated reasoning for compliance brings a huge opportunity: one can easily perform impact analysis before implementing design changes over the whole data mesh:

  • upgrade a contract
  • add an application
  • add or remove a security feature in an application
  • connect an existing application to a new data source
  • add a new region to an application
  • add a new regulatory requirement to a region
  • add a new business requirement to a golden source

By careful analysis of what-if scenarios, which are always comprehensive, impact analysis allows for incremental broadening of a compliance scope.

Privacy Enhancing Technologies (PETs)

Automated reasoning brings a second very useful opportunity: sensitive data exposure.

'Data exposure' means that the information flow reaches the confines of the data mesh: some information are about to leave the compliance perimeter.

Depending on the compliance requirements of such exposed data, extra privacy preserving processing is likely to be required. Typically this kind of processing comes in three flavors, commonly referred to as 'PETs":

  • homomorphic encryption: when an untrusted third party must perform searches or calculations on our data
  • in-memory encryption (aka confidential computing): when we must operate off-premises, at some untrusted third party's
  • anonymization: when we must share data or collaborate with untrusted third parties

We may use tools like SATmesh to deal with the problem of knowing whether the right PET treatment is systematically enforced at exposure endpoints (where sensitive private data cross the Information system boundary).

Let me attempt to articulate a principle:

The furthest away the data are from their golden source, the highest the risk of a data breach.

Automated reasoning keeps track of data lineage and data privacy state, from source to distribution endpoints.

Relaxing constraints

Upkeeping a very large number of satisfiable constraints is an extremely powerful way of enforcing compliance by using ??what it takes?? to make sure no weak spot is missed, as expected from sound risk management practices. But sometimes, the bar is raised too high: what we gain through automation, we lose by imposing costly, or business-impairing constraints that don't always make sense.

This is especially true when we reason at low resolution as we do with tables.

Look at these two queries:

  1. count the number of orders in the orders database;
  2. calculate the number of sales per customer in the orders database.

If, as we argued, the orders database has signature &(s(stock),p(s(client)), both queries will result in the same downstream signature:


Obviously, the first query is completely harmless, but the second one may cause privacy issues.

A stylish workaround is to perform an ad hoc, pinpoint assessment of both operations: if the first one is confirmed to be innocuous, we can use an automated reasoning technique called E-matching to perform a very simple strings substitution of all instances of a(s(&(s(stock),p(s(client)))) when (and only when) they relate to the first query.

If replace a(s(&(s(stock),p(s(client)))) by a new golden source 'g',

  • a(s(&(s(stock),p(s(client)))), being now g, is excluded from all paths related to golden sources 'stock' and 'client';
  • g becomes subjected to a new, dedicated data ownership governance because it is a new golden source. Since g is innocuous, the governance, although grand-sounding, is likely to be some kind of one-shot record in a book.

This explains why it is important to keep track of all the cumbersome signatures: who knows if, in some future, we have to substitute part of a signature with a new innocuous golden source.

In practice, signatures represent the data lineage of any data table at any point in the datamesh.

The hidden structure of the PaaS datamesh

Finally, what would this dream of a managed datamesh could look like?

Last summer I shared a bird eye’s view of its most notable features. It was based on a would-be implementation in Azure:

  • The data and control planes of compute and database PaaS like CosmosDB or AKS would need to be significantly refactored to accomodate seamless and scalable integration into a global, overarching mesh;
  • The mesh control plane could be centrally managed from AAD. This would require extra capabilities to tackle things like contracts and compliance vectors management and authorizations. Remember that the unit of measure is the table, and that tables are not native Azure RBAC entities;
  • A single pane of glass would be needed to visualize data flows, compliance appraisals, impact analysis, drawing arrows, showing signatures…
  • Azure Resource Graph or Microsoft Graph would need powerful extensions to query all facets of the data mesh. Kusto style.

No alt text provided for this image
A would-be managed datamesh in Azure

Wrapping up

Is this Cloud-managed datamesh ever going to be implemented? This is far from sure, but I believe in the duty of Cloud customers to speak out their needs. We need to rebalance the relationship between Cloud customers and Cloud providers.

Providers don't have the monopoly of strategic thinking: what works best is synergy and common understanding.

Even if all this remains at the stage of a dream, I hope to have shared some re-usable design patterns along the way. I won't repeat the ones covered in part 1, let me recap only the ones we've seen today:

  1. In large, data-hungry corporations, agentless compliance ought to scale better than OPA due to the challenge of establishing actionable holistic data governance;
  2. Data compliance automation requires to focus on few key knowledgeable people: golden source owners (for privacy requirements) and security champions (for local, service-level mitigations);
  3. Data compliance automation relies on three technical pillars: graphs exploration, data lineage and SMT constraints solving;
  4. The most important benefit of data compliance automation is that it doesn't require extensive data classification;
  5. Data lineage is generated thanks to signatures, an out-of-the box way of thinking: signatures work by tracking immutable operations (select) performed on golden sources along an optimal subgraph rather than accounting for all mutable operations (create, insert...) performed on arbitrary sources over the whole graph;
  6. (Opportunity 1/3) Data lineage unlocks automated impact analysis, a much desirable feature in complex information systems;
  7. (Opportunity 2/3) Starting with the most stringent levels of assurance, data lineage unlocks cautious constraints relaxation, in line with risk management principles;
  8. (Opportunity 3/3) Data lineage lets us keep control over sensitive data exposure by checking that proper PETs are implemented at the right places before a privacy breach occurs.


Christophe Parisel的更多文章

  • Adversarial lateral motion in Azure PaaS: are we prepared?

    Adversarial lateral motion in Azure PaaS: are we prepared?

    Lateral motion techniques are evolving in PaaS, and we should be worried. Let's discuss a risk confinement approach.

    19 条评论
  • How will Microsoft Majorana quantum chip ??compute??, exactly?

    How will Microsoft Majorana quantum chip ??compute??, exactly?

    During the 2020 COVID lockdown, I investigated braid theory in the hope it would help me on some research I was…

    16 条评论
  • Zero-shot attack against multimodal AI (Part 2)

    Zero-shot attack against multimodal AI (Part 2)

    In part 1, I showcased how AI applications could be affected by a new kind of AI-driven attack: Mystic Square. In the…

    6 条评论
  • Zero-shot attack against multimodal AI (Part 1)

    Zero-shot attack against multimodal AI (Part 1)

    The arrow is on fire, ready to strike its target from two miles away..

    11 条评论
  • 2015-2025: a decade of preventive Cloud security!

    2015-2025: a decade of preventive Cloud security!

    Since its birth in 2015, preventive Cloud security has proven a formidable achievement. By raising the security bar of…

    11 条评论
  • Exploiting Azure AI DocIntel for ID spoofing

    Exploiting Azure AI DocIntel for ID spoofing

    Sensitive transactions execution often requires to show proofs of ID and proofs of ownership: this requirements is…

    10 条评论
  • How I trained an AI model for nefarious purposes!

    How I trained an AI model for nefarious purposes!

    The previous episode prepared ground for today’s task: we walked through the foundations of AI curiosity. As we've…

    19 条评论
  • AI curiosity

    AI curiosity

    The incuriosity of genAI is an understatement. When chatGPT became popular in early 2023, it was even more striking…

    3 条评论
  • The nested cloud

    The nested cloud

    Now is the perfect time to approach Cloud security through the interplay between data planes and control planes—a…

    8 条评论
  • Overcoming the security challenge of Text-To-Action

    Overcoming the security challenge of Text-To-Action

    LLM's Text-To-Action (T2A) is one of the most anticipated features of 2025: it is expected to unleash a new cycle of…

    19 条评论

