登录查看更多内容

How to apply Machine Learning in case of limited data set?

Rachid Kherrazi

Chief Technology Officer at Akkodis Netherlands

发布日期: 2022年4月12日

Because I get this question from several clients, I decided to investigate and make this post to share this information with those who are interested. Hoping to start discussion and maybe get some new insight around this challenging topic.

?So, let’s start answering the question: "how to make analyses with data from a limited number of machines via Artificial Intelligence"

In order to give a good answer to this question, I have to explain a number of things first. There are three aspects that an AI team needs to consider making an ML model work properly:

Step 1: Capture complexity issue
Step 2: Model selection
Step 3: Dataset size needed for model performance.

In an AI project, these 3 above aspects must be thoroughly investigated. The team ( which includes AI expert and the customer domain expert) is responsible for the execution and execution of each of these steps. Together they define a clear problem, identify the model and decide more data is needed.

Step 1: Complexity of the problem

In most cases, the problem description is not clear enough and the team itself must define the exact problem. The team should define the problem as easily as possible and describe the business challenge. This is the first step and a good start always makes a good ending!

Step 2: Model complexity

The complexity of the model can be derived from the number of weights (An) to be set during the training phase. If the model contains a large number of weights, it is more flexible to learn things. In this case, more data is needed to train the model (simplified view: Model = A0+ A1. X1+ A2. X2 + A3. X3+... + An.Xn)

Let's explain some examples:

A deep learning model with 3 inputs and 1 hidden layer with ten neurons, has about 50 weights.
A decision tree with a depth of 4 has 15 weights (the thresholds are considered weights).
A linear regression model for 3 inputs has 4 weights.

An AI-expert should have some knowledge of the different algorithms he uses from them. If you know how an algorithm works, you have some feeling about the number of weights and complexity of the algorithm. In this case, there is no time to spend entering little data on a complex model.

?Choice of algorithm also plays an important role

Choice of algorithm can be based on type problem, performance performance, but also on available training time and available data. There are algorithms that need a lot of data but also some that show good performance with little data. See picture

?Step 3: Dataset size (sample size):

领英推荐

How to Provide Data to Your Gen AI Application

Dr. Rabi Prasad Padhy 5 个月前

What are the Best Practices in Machine Learning…

Rajaram J 7 个月前

Decoding Machine Learning: A Business Leader's Guide…

Damian R. Mingle, MBA 1 年前

This is an important step in the design of an ML model. From experience, there are cases where the sample size is too small, but the team still likes to develop a high-performing model. In some cases, several hundred samples are sufficient for model training, while in others the model cannot learn from thousands of samples.

?How much data are needed?

It is important to look at enough data is for the selected ML model and the problem.The right data sample size is a challenging problem. No one can say in advance how many samples it takes to train a model without thoroughly analyzing the data. This is based on experience. However, it is very important to have a feeling for the right range. There are a number of tips and rules of thumb about estimating the correct number of samples.

The number of weights in the model representing the model complexity shall not exceed the number of samples. Think of a simple linear regression problem with 10 weights (9 characteristics) and you have 8 characteristics. From a mathematical point of view, there are infinitely many linear models that are passed by the passes. Experience has shown that the number of samples must be ten times greater than the number of functions. This means that for the training of a regression model with 10 characteristics we need at least 100 samples. It is good to remember that the number ten is based on the experience, and there is no evidence for it, since it depends on the characteristics of data. For e.g. if the data have a linear characteristic, we can train the model with less. Moreover, if all the samples are similar, we need more to have a more robust model. There are also other parameters that contribute to the correct number of samples. However, if the number of weights in the model has to be taken into account, there is a hint about the number of samples that are approximately needed.

Moreover, this also has to do with the stochastic nature of the ML. The behavior and performance of many machine learning algorithms are called stochastic. Stochastic refers to a variable process in which the only outcome will be inspected and has some uncertainty. It is a mathematical term and is closely related to 'random' and 'probabilistic' and can be contrasted with the idea of 'deterministic'.

The stochastic nature of machine learning algorithms is an important fundamental concept in machine learning and needs to be understood to make the behavior of many predictive models effective.

?Different scenarios in case of small dataset.

Based on the status of each parameter, we have 4 different scenarios in the case of a small dataset. If the number is small, it's better to think about a rules-based solution rather than training an ML model. In general, we do not use an ML model for less than 50 samples. The advice is to collect more data (e.g. implementation of an IoT project).

These are the 4 possibilities in case of small dataset:

Model complexity = low, problem complexity = low: This is the only case where the team can get a meaningful result if the team does not have sufficient data. If the problem is not complex, the advice is to choose a simple model and train with the available dataset.
Model complexity = high, problem complexity = low: The team should choose a simpler model for the complexity of the model. Even if the team gets higher performance from a very complex model, a simple model works better for the unseen data. The reason is that the evaluation is based only on the small number of data that is not reliable and by aligning the parameters of the complex model, the team is likely to overfit the model. The simple model would be more robust in practice.
Model complexity = low, problem complexity = high: It should be a sub-problem to address a simpler version of the problem. In any case, the team can achieve a meaningful result.
Model complexity = high, problem complexity = high: Even if the team gets high accuracy, the model will fail in a real situation when it comes to looked-up data. In this case, the advice is to collect more data (e.g. data augmentation of implementation of IoT project)

?In case of large sample size which is a luxury problem: Having enough samples is key to creating a good ML model. It should be noted that the samples should not be biased. Otherwise, there is not much difference between small and large sample sizes.

Model complexity= Low, Problem complexity = Low: Model selection is not a big issue with a complex problem and large sample size. Most of the ML models return a good result.
Model complexity= High, Problem complexity = Low: A complex model may return a slightly better result compared to a typical model. Considering explain-ability and maintenance of the model, you may choose to use the simpler model.
Model complexity= Low, Problem complexity = High: If you love working with state-of-the-art methods, this is a suitable case for you. Pick a complex model (e.g. deep learning), and you will get a desirable result.
Model complexity= High, Problem complexity = High: You can see the power and beauty of Machine Learning in this case. Magic happens here.

?Summary: in case of limited data set, start a research process based on the current available data and try to use a number of high-performing algorithms, such as random forest. Start research in collaboration with AI expert on the model performance of the available dataset from the limited number of machines. The result can be plotted as a line plot with the sample size of the training dataset on the x-axis and the model performance (accuracy) on the y-axis. This gives an idea of how the sample size of the dataset affects the performance of the model on specific problem. This graph is called a learning curve. see picture. From this graph, we may be able to project the amount of data needed to develop a competent model. I strongly recommend developing this approach for robust AI/ML models

?You want to know more or to discuss, please leave a message

Manuj Aggarwal

2 年

Well-written post! Thank you so much for sharing these helpful tips about this topic that people rarely talk about, how to apply machine learning in case of a limited data set. I wrote down the salient points you listed out. Very helpful. I must say.? I look forward to reading more amazing posts like this on your page. Thanks for sharing.

1 次回应

要查看或添加评论，请登录

Rachid Kherrazi的更多文章

Optimal Data Strategy for Smart Industry

2022年6月27日

Optimal Data Strategy for Smart Industry

Smart Industry or Industry 4.0 creates what is called a “smart factory”.

1 条评论
How to stop data centers from gobbling up the world’s Energy

2022年4月6日

How to stop data centers from gobbling up the world’s Energy

Data centers use an estimated 200 terawatt hours (TWh) each year. That is more than the national energy consumption of…

2 条评论
Accelerating Product Development with Model Based Systems Engineering

2022年1月31日

Accelerating Product Development with Model Based Systems Engineering

What is Systems Engineering? Systems engineering is an interdisciplinary field of engineering and engineering…

1 条评论
Reinforcement Learning and the Traditional Control System

2021年12月31日

Reinforcement Learning and the Traditional Control System

What is Control Engineering? Control system engineering is the branch of engineering which deals with the principles of…
AKKA Technologies present at the IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW 2020)

2020年10月25日

AKKA Technologies present at the IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW 2020)

Yesterday I had the pleasure to present our accepted AKKA research paper about application of tabular notation in model…
Artificial intelligence: are robots taking over the world? No! But they can help us save it!

2020年10月22日

Artificial intelligence: are robots taking over the world? No! But they can help us save it!

When people think of AI, the first thing that comes to mind is disastrous images of self-aware robots trying to destroy…
LOOKING BACK, LOOKING AHEAD, GIVING THANKS

2020年7月1日

LOOKING BACK, LOOKING AHEAD, GIVING THANKS

Yesterday I had my first online conference. Thanks to the good organization of DevOpsCon, was my session an interactive…

1 条评论

See all articles

How to apply Machine Learning in case of limited data set?

Rachid Kherrazi

Chief Technology Officer at Akkodis Netherlands

领英推荐

Rachid Kherrazi的更多文章

社区洞察

其他会员也浏览了

Getting started with AI – how much data do you need?

Top 14 No-Code Machine Learning Platforms To Use in 202

Dealing with the Intrinsic Instability and Dual Nature of AI Models: The Promise of MLOps

ML Model Testing (ML4Devs Newsletter, Issue 2)

A Primer to Interpretable Machine Learning

Data Quality Is Essential for AI and Machine Learning Success

Importance of Datasets in Machine Learning and AI Research

Essential Data Structures and Algorithms for AI: Explained with Real-Life Examples

Data, Data, Data: Readiness, Quality, and AI Strategy

6 Pillars of a Successful AI-Strategy

领英推荐

Rachid Kherrazi的更多文章

Optimal Data Strategy for Smart Industry

How to stop data centers from gobbling up the world’s Energy

Accelerating Product Development with Model Based Systems Engineering

Reinforcement Learning and the Traditional Control System

AKKA Technologies present at the IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW 2020)

Artificial intelligence: are robots taking over the world? No! But they can help us save it!

LOOKING BACK, LOOKING AHEAD, GIVING THANKS

社区洞察

其他会员也浏览了

Getting started with AI – how much data do you need?

Top 14 No-Code Machine Learning Platforms To Use in 202

Dealing with the Intrinsic Instability and Dual Nature of AI Models: The Promise of MLOps

ML Model Testing (ML4Devs Newsletter, Issue 2)

A Primer to Interpretable Machine Learning

Data Quality Is Essential for AI and Machine Learning Success

Importance of Datasets in Machine Learning and AI Research

Essential Data Structures and Algorithms for AI: Explained with Real-Life Examples

Data, Data, Data: Readiness, Quality, and AI Strategy

6 Pillars of a Successful AI-Strategy