Data and artificial intelligence (4th part)

Data and artificial intelligence (4th part)

Data is at the heart of the proper functioning of machine learning models, deep learning, LLMs, RAGs and so on. No model, none, can understand our world, without having gone through a training phase.

Some models are capable of learning on their own, depending of course on the data they are provided with; others need the human being to label the data beforehand. But it's invariable: an AI model is nothing without the data it needs to learn.

So, it's easy to imagine that the quality of what the model learns depends on the quality of the data it learns from. In one sentence, we've said it all! And then the problem appears.

If I supply my model with poor quality data, it will predict or generate poor quality results! It's not that difficult to understand.

So how do you go about it? In fact, everything is already in place, nothing new, we just need to apply the best practices linked to data governance. Yes, indeed, deploying AI tools in production without data governance is as dangerous as driving without having passed the code!

Data governance has three facets: knowledge (i.e. the data catalog), the quality of the data used, and finally its compliance.

So, AI or dashboard, the stakes are the same.

First and foremost, knowledge. If you don't know what data is feeding your AI models, you've got it all wrong. Or to be more precise, you run the risk of using unsuitable data. So, the first step is to reference and catalog the data used by your models. We often use graph modeling to connect the data, the algorithms that use them, and the people in charge. So, the first step is to map and catalog the data.

The second step is quality. Second indeed, because how can you measure the quality of data without first referencing it? So, measure, evaluate, quantify non-quality. Just because we're used to hearing at the coffee machine that this data is false, doesn't mean it really is. If so, in what proportion? Is it still usable? You can't improve what you haven't measured. Once measured, we look for the root causes of this non-quality. There's no point in correcting the data stock if you haven't plugged the leak first! At this stage, we assess whether the data can be used to feed algorithms and inform users of the actual state of its quality.

And thirdly, compliance. Doesn't it shock you to feed an algorithm with data you have no right to use? For RGPD compliance reasons, for ethical reasons, for AI Act compliance reasons, etc. So, the data used by AI must be compliant, no loopholes.

In short, prior to any production launch, the data used by artificial intelligence must be catalogued, its quality measured and its conformity validated.

I did specify, before going live. That some tests are carried out by data scientists in sandbox mode on anonymized data, just "to see". This is acceptable. But be warned: before going into production, data governance and AI must be rigorously scrutinized.

From the point of view of corporate responsibilities, there are many intersections between the person in charge of data governance and the person in charge of Artificial Intelligence governance. And it's logical that, in some organizations, the same person should take on both responsibilities.

#data #ai #aigovernance #dataquality #datagovernance

要查看或添加评论,请登录

社区洞察

其他会员也浏览了