Does quantity matter? A few words on (the importance of) data quality for machine learning models
ceforai - ai ethics & data governance
#datagovernance #ai #workshops #itconsulting #datamanagement #aistrategy #aigovernance
Much has been said and written in the context of data used to train and apply machine learning or deep learning algorithms and models. Nevertheless, it is often believed that the more data used for training, validation and testing - the better. And this is partly true, because the more good quality data, the better for the model being trained. However, the amount of data alone will not be a determinant of the effectiveness or accuracy of a given model, and certainly, a significant amount of data of poor quality will not contribute to improving these indicators. More important, especially in those areas that "touch" various human spheres, will be the quality of data, which, however, is difficult to define unambiguously, because QUALITY depends on for what purpose we want to apply machine learning or deep learning mechanisms
The influence of data quality on how models perform is unquestionable, and the popular maxim "Garbage In, Garbage Out" applies not only in the context of ethics of"artificial intelligence", such as algorithmic bias, but also with less "sophisticated" solutions. The quality of data is crucial for those systems that use the process of "learning" to acquire new "knowledge" on the basis of externally supplied data, which are then translated into concrete results, such as predictions or recommendations, which the given model makes available to the user.
V. N. Gudivada (et al.) indicate in a paper that outliers in the training dataset can cause instability or lack of convergence in the learning process. Incomplete, inconsistent and missing data can lead to a drastic deterioration in the prediction score and, going a little further, cause predictions to be inaccurate and recommendations to be misleading. In other words, a given model will be good for nothing, and certainly will not fulfil the purpose for which it was created.
[It's important to remember that even the most advanced and sophisticated algorithm will be hopeless if you feed it poor quality data].
A problem that is often ignored in the context of using so-called "artificial intelligence systems" and data is the lack of proper understanding of the needs, identified problems that the model is supposed to solve or access to data necessary to create an effective tool. Relatively often organisations think that having huge amounts of data and potential for automation, they can "instantly" create solutions that will support every process and contribute to solving the most difficult problems. Hiring data scientists often seems to be a remedy for all ills, and the results of their work can be practically immediately transferred to the department responsible for creating a machine learning model or a related one.
In practice, however, identifying WHY we want to develop a model, as well as WHAT PROBLEM it is supposed to solve is crucial for moving on to the next steps, which include, among other things, identifying the data that can be used to train the model. If we skip these steps, we also fail to assess whether the data "fits" the application, and thus it will be impossible to reliably assess the learning process and the effectiveness of a given model. So these are the steps that must precede data collection, because only then will we know what we need.
Fundamentally, however, if we know what problems we are solving, then we can think about what data we will be looking for to achieve that goal. For a model to be 'well' [however broad that term is] trained, we need to ensure that the data have the right characteristics, such as relevance, representativeness, completeness or accuracy, and should also be as error-free as possible [whether the data is erroneous should also be subject to additional evaluation]. In an ideal world, we would also like to obtain data that is free of so-called algorithmic [human?] bias, however the data we use is most often created by humans [less often by other algorithms and models, but even then indirectly by humans], and these biases are usually being transferred to the model being trained.
领英推荐
The quality of the data is therefore of paramount importance, but it also requires a great deal of commitment from the many units that will be responsible for the solutions created, especially if we are talking about ethical [responsible] artificial intelligence, which is supposed to be free of bias. We won't always succeed, but it's always worth trying.
So if we have identified specific data [its characteristics], it only remains to acquire it. Sometimes it may turn out that the data we have in our organization may be perfectly suitable for this, and sometimes we need to reach out to external sources [intermediaries] that will allow us to obtain it. However, each of these approaches carries certain limitations and risks that we should consider when trying to develop a new solution based on 'artificial intelligence'.
In the first case, just because we have the data does not mean that we are free to use it for our new idea, as much will depend on what consent users have given and whether it is possible to "harness" them [the data, not the users] for this particular case. The data may also be subject to various errors, in particular being incomplete or modified by the organization itself. In the second case, we need to be sure that the data comes from a reliable source and reflects the reality "sought", as well as has not been altered or enriched in such a way that it has become useless. It is also important to make sure what are the conditions of obtaining and storing the data, as well as their use after a certain event, such as termination of a contract. Will we still be able to use them in such a situation? It is also important to remember that a good model is constantly learning, so access to data - in principle - should be uninterrupted.
If we have THIS data, we still need to sit down and examine it, check that it is indeed relevant and accurate, make appropriate annotations or labeling, and check how 'safe' it is. Skipping this step is a straight road to failure, which can look like either the model simply won't work, or it will work incorrectly, e.g. by discriminating against certain groups or not recognizing what we are looking for (image recognition). At this stage, it is important to have the interaction of both data scientists and those responsible for machine learning itself, as well as others whose influence on the data may be beneficial or necessary (including business units, compliance or data protection). In another post, I will try to describe how the organizational structure can be arranged to effectively manage the implementation of machine learning and deep learning-based solutions.
Such a process can take quite a long time, but completing it POSITIVELY can be one component of success, although the algorithm itself will also matter here. If the data - at least that's what we think - is "good", then we can use it for training. After validation and testing, however, we may find that our assumptions have "crashed" with reality and adjustments need to be made. This is normal. It's worth sitting down again and thinking about what went wrong. Maybe it was the wrong labels, or maybe we didn't match the data properly?
The process of ensuring data quality should be an ongoing process. Primarily if we use algorithms and models that are constantly learning. For this reason, human supervision and detection of errors that the model may generate will also be so important. How we do this, of course, depends on the organization in question.
That's it in a nutshell, although I haven't touched on many of the detailed themes here because this is not the time or place, although issues such as data-driven culture or risk management systems and organizational structure are important in this context. But about that on another occasion. And today we conclude our thoughts and encourage you to discuss and contact us if you are interested in this topic.
AI & Data | Partner #AI #Cyber #FinTech @ ZP Legal | GovernedAI.com | Polska Organizacja Niebankowych Instytucji P?atno?ci | PTI
2 年Przemyslaw Biecek adam forma Mateusz Chrobok Magdalena Korona Deepak Amirtha Raj ?? Konrad Maj Mateusz Kupiec prof. U? dr hab Darek Szostek Marta Fydrych-G?sowska Kamil Wojciechowski Micha? Junczyk ?ukasz Dziekan Robert Sroka Ph.D. Robert Pakla