Data mesh, compliance and the Cloud: a hopeless combo?
Datamesh as a service

Data mesh, compliance and the Cloud: a hopeless combo?

I had a dream of a data mesh: a data mesh as a service, compliant with my many #privacy regulations, where the control plane was managed by a single actor - my Cloud Provider, and where my only contribution to buildi the data plane was limited to "draw arrows" between my PaaS producers and my PaaS consumers.

In that dream, I specified which high-level transformers I wanted to be applied at the end of each arrow: arrows defined edges of a directed graph, and PaaS instances defined the vertices where transformers were applied.

Believe it or not, that's pretty much all our #Cloud provider would need to set us up to speed with our data mesh!

From graph flows...

The unit of measure for all information traversing this graph is the data table - it could have been a dataset, a row or even a simple cell, but I feel for now that the added complexity of using a smaller grain is not worth the pain. You will shortly see why.

Using tables doesn't mean that the underlying data ought to be structured: they can take any shape. Likewise, the exchange protocol between table producers and table consumers is not restricted to SQL: it can be an event-driven message queue, a microservice REST call, whatever...

Based on this 'table' unit of measure, the high level transformers I was referring to in my dream are the usual table operators: s (select), a (aggregate), p (project), e (extract), j (join), u (union)... What's important to understand is that the PaaS instances which execute these operations for data processing act as table filters: unaries (like 'select') filter one table among the many ones circulating in the mesh, binaries (like 'join') filter two tables.

As a result, a PaaS instance sitting at a vertex of our graph may be seen as a stop valve which opens only when a specific table (or a couple of tables, in the case of binary operators) is matched by the local vertex filter(s).

...to real-life application flows

Now that the graph is set up as a pipework with a bunch of valves, the datamesh service provider (not us, mind this!) is able to figure out and maintain an accurate representation of all our application flows.

Since we are dealing with a flow of tables (so to speak... ), it makes sense to separate upstream tables and downstream tables, depending whether a table is being consumed or produced by a given PaaS instance.

The interesting thing about table flows is that they are a tiny subset of all possible paths: in a fully connected graph, the number of paths grows exponentially with the number of vertices. If we account for loops, it can even be infinite. But if we reason at our unit of measure, with data motion severely controlled by filters, we don't face the combinatory problem which occur in usual directed graphs.

That's because -as we said- the flows are a faithful representation of the actual application flows taking place in our information system, and applications are never exponentially meshed (thanks God!).

Contracts

In a real #PaaS service, filters usually mix and match of course: if you select 's()' and aggregate 'a()' data from upstream table 'A', the resulting downstream table will be the composition of s and a, which we write a(s(A)). We don't care about the detailed operations performed by 'a()' or 's()' because we only reason at the level of tables.

We reach a crucial point: the signature a(s(A)) is enough to capture the very useful notion of a contract - an agreement between a PaaS producer and a PaaS consumer.

The consumer commits on two axes:

  • scope: she restricts herself to act only upon upstream table 'A' (her table filter);
  • ops: she restricts herself to execute only a(s(.)).

These are duties, but a contract also entails rights. In our datamesh, rights are compliance criteria the service owner imposes upon its service: typically, this would be geozoning, but it could be much, much more sophisticated.


Contract automation (contract as code)

A great property of such contracts is that they are easy for customers to set up through configuration. No paperwork, of course. And no coding:

  1. for the "rights" part of the contract: actual signatures are automatically inferred (and updated) by the managed service by looking up at the actual queries created (or modified) by PaaS consumers;
  2. for the "duties" part, compliance criteria would be represented as boolean vectors.

Ideally, the provider exposes an API to the different parties involved into the contract to let them move on through the contract workflow using infra-as-code.

Example

Wake up ChatGPT, we need you please!

No alt text provided for this image
This Kusto query corresponds to high-level transformer a(s(StormData))

  • Our upstream table 'A' is 'StormData'
  • the select is 'StormData | where Season == "Summer" or Season == "Winter"
  • the aggegate is '| summarize count() by Region = bin(Latitude, 10)'

So the inferred filter signature of the detailed consumer operations is: a(s(StormData)). This is the first part of the contract.

Now for the second part: a list of compliance criteria is even easier to implement as code: the provider maintains a dictionary of supported constraints as boolean vectors, as we said above. Customers pick the ones they're interested in. We'll get back to them momentarily.

Contract lifecyle

If either party calls for a change, a new version of the contract must be agreed upon:

  • (first part of the contract): should the Latitude bins change to span 15 degrees instead of 10, the contract version would change but the signature would remain strictly the same;
  • (second part of the contract): should a new compliance criteria be added (eg: encryption at rest), the contract version would change as well.

Once again, a Cloud provider API would enable easy handling of contracts versioning.

Solving the data governance nightmare

There is, however, an issue: the data producer is generally not the legitimate owner of the data stored in table 'A', because chances are, table 'A' itself is coming from an upstream series of transformations. In fact, if 'A' is the result of many binary transformations, it is likely to have many, many owners. Shouldn't they have their say in contract terms?

These 'ultimate owners' play a special role in all organizations: they only produce downstream tables - they are not enslaved to any other producers. Their tables, being grown out of thin air, are called golden sources.


No alt text provided for this image
Datamesh and the Cloud..

Golden sources #datagovernance is a headache when everybody expects data owners to have data accountability over the whole datamesh. But, rejoice! Contracts are an elegant way to ease the life of golden source owners: remember when we said that contracts entail not only duties, but rights as well? As long as owners are able to define a list of immutable constraints (coming from their Business Line or from their Regulatory bodies) and translate them into a list of compliance constraints, then all they have to do is stick them to their own local contracts and mark them as inheritable: this way, data accountability will stop at their doorstep.

Well... How so?

Unsupervised compliance... across the whole mesh!

This is where comes the really interesting part... To make sure that the golden source compliance criteria are properly inherited throughout the application flow, we leverage the immense benefit brought by the Cloud service provider acting as a central and single point of service management.

Let's look behind the scenes:

  • a 'compliance vector', made of Boolean coordinates, is attached to all golden sources. These vectors will follow the tables throughout their transforming journey across the datamesh;
  • the compliance vectors are zero'ed;
  • they are OR'ed with the inheritable Boolean Business and Regulatory constraints specified in golden source contracts. (OR wills flip Boolean coordinates to 'one' where appropriate).
  • all possible flows golden sources may take in the graph are traversed. These are not all possible paths, only the ones where the stop valves are opened for the golden source(s) under inspection;
  • at each step, we check if the compliance constraints defined in the local contract and the ones carried by the compliance vector of the table are satisfiable. A Satisfiability Modulo Theory (SMT) solver may be used for automating this task;
  • if the local filtrer involves more than one table (i.e. if binary operators are being called forth), we merge all compliance vectors into a fresh one (for the downstream table). This merger is carried out by ORing the upstream tables vectors.


Cloud architecture at 'Smithy, Inc'

Let's take the example of 'Smithy, Inc':

This company has three golden sources: Aluminum 'A', Brass 'B' and Copper 'C'. The data mesh is made of 5 PaaS services, labelled I to V.

In the diagram below, the filters are superimposed on arrows going from producer to consumer services.

No alt text provided for this image
Smithy Inc data mesh.


Say we want to investigate the whereabouts of Aluminum 'A'.

The PaaS filters operating on 'A' would simplify the graph and keep only 3 transitions:

  • I -> II
  • I -> III
  • II -> V

One transition not involving ‘A’ would be excluded:

  • IV -> II

We would then attach a compliance vector to ‘A’ and explore all paths from the subgraph made of the 3 permitted transitions. Checking satisfiability at each step of this process can be easily automated.

The discovery of an unsatisfiable condition would raise a compliance anomaly to the owners of ‘A’.

Observe that many tables involve more than just ‘A’: for example in service V, all three golden sources are involved. In this case, the downstream table of V would have a compliance vector made of A or B or C.


Constraints propagation and automated compliance

The compliance vector behaves a lot like entropy: as we keep OR-ing more satisfiable constraints, the norm of the vector may only grow.

Entropy is the key feature that lets propagate the rights of upstream data owners: as we will see in part 2, an ever-increasing norm guarantees that the accountability of golden source owners actually stops at their doorstep.

We will also see how the various building blocks fit together to make a solid base for automated compliance.

We will discuss a bunch of important features:

  • automated #dataclassification
  • automated compliance appraisal (introducing SATmesh)
  • data lineage
  • impact analysis and what-if scenarios
  • data exposure
  • orchestrated constraints relaxation
  • #privacyengineering and Privacy Enhancing Technologies


Wrapping it up

For now, here is a recap of the architecture and design patterns in my dream of a data mesh:

  1. Control plane operating model: a data mesh is a PaaS like any other; it is managed as-a-service by a Cloud provider;
  2. Data plane operating model (1/2): customers choose the services they need from their provider PaaS catalog, then "draw arrows" between them. The provider natively and seamlessly integrates all these PaaS instances to make up the data plane of the mesh;
  3. Data plane operating model (2/2): customers define all the PaaS transformations required by their business logic in the form of SQL statements, NoSQL queries, etc. The provider automatically infers signatures from such transformations;
  4. Applications flow: the unit of measure flowing through the mesh is the data table. PaaS instances involved in the mesh consume upstream tables and produce downstream ones according to the above mentioned transformations;
  5. Data governance (1/2): inheritable compliance constraints are set by golden sources owners and written as code in contracts managed by the Cloud provider. Constraints are then serialized into a vector of booleans which gets propagated and enriched for each data table as it flows through the mesh;
  6. Data governance (2/2): contracts also contain the signatures of the transformations that each local PaaS instance abides to;
  7. Automated compliance and anomaly detection are performed by unsupervised satisfiability solvers.

(on sabbatical) Scott Hirleman (back mid next year maybe but prob not)

Data Mesh Radio Host - Helping People Understand and Implement Data Mesh Since 2020 ??

2 年

If you want to talk data mesh and privacy, I recommend Katharine Jarmul

Christophe Humbert

Wizard in Chief @cloudswizards.com | IT Security, Infrastructure, Architecture

2 年

Very interesting thought. For you is there any cloud providers more advance or having the tools already to go through a first implementation of your proposal

Gregory Spiller "The Digital Diogenes"

2X #1 Best Selling Featured Author | Technology Swiss Army Knife | BizOps Alchemist | Just-A-Guy-as-a-Service |

2 年

Thanks for the post! Adding to the reading list...

要查看或添加评论,请登录

Christophe Parisel的更多文章

  • How will Microsoft Majorana quantum chip ??compute??, exactly?

    How will Microsoft Majorana quantum chip ??compute??, exactly?

    During the 2020 COVID lockdown, I investigated braid theory in the hope it would help me on some research I was…

    16 条评论
  • Zero-shot attack against multimodal AI (Part 2)

    Zero-shot attack against multimodal AI (Part 2)

    In part 1, I showcased how AI applications could be affected by a new kind of AI-driven attack: Mystic Square. In the…

    6 条评论
  • Zero-shot attack against multimodal AI (Part 1)

    Zero-shot attack against multimodal AI (Part 1)

    The arrow is on fire, ready to strike its target from two miles away..

    11 条评论
  • 2015-2025: a decade of preventive Cloud security!

    2015-2025: a decade of preventive Cloud security!

    Since its birth in 2015, preventive Cloud security has proven a formidable achievement. By raising the security bar of…

    11 条评论
  • Exploiting Azure AI DocIntel for ID spoofing

    Exploiting Azure AI DocIntel for ID spoofing

    Sensitive transactions execution often requires to show proofs of ID and proofs of ownership: this requirements is…

    10 条评论
  • How I trained an AI model for nefarious purposes!

    How I trained an AI model for nefarious purposes!

    The previous episode prepared ground for today’s task: we walked through the foundations of AI curiosity. As we've…

    19 条评论
  • AI curiosity

    AI curiosity

    The incuriosity of genAI is an understatement. When chatGPT became popular in early 2023, it was even more striking…

    3 条评论
  • The nested cloud

    The nested cloud

    Now is the perfect time to approach Cloud security through the interplay between data planes and control planes—a…

    8 条评论
  • Overcoming the security challenge of Text-To-Action

    Overcoming the security challenge of Text-To-Action

    LLM's Text-To-Action (T2A) is one of the most anticipated features of 2025: it is expected to unleash a new cycle of…

    19 条评论
  • Cloud drift management for Cyber

    Cloud drift management for Cyber

    Optimize your drift management strategy by tracking the Human-to-Scenario (H/S) ratio: the number of dedicated human…

    12 条评论

社区洞察

其他会员也浏览了