How to apply Machine Learning in case of limited data set?
Because I get this question from several clients, I decided to investigate and make this post to share this information with those who are interested. Hoping to start discussion and maybe get some new insight around this challenging topic.
?So, let’s start answering the question: "how to make analyses with data from a limited number of machines via Artificial Intelligence"
In order to give a good answer to this question, I have to explain a number of things first. There are three aspects that an AI team needs to consider making an ML model work properly:
In an AI project, these 3 above aspects must be thoroughly investigated. The team ( which includes AI expert and the customer domain expert) is responsible for the execution and execution of each of these steps. Together they define a clear problem, identify the model and decide more data is needed.
Step 1: Complexity of the problem
In most cases, the problem description is not clear enough and the team itself must define the exact problem. The team should define the problem as easily as possible and describe the business challenge. This is the first step and a good start always makes a good ending!
Step 2: Model complexity
The complexity of the model can be derived from the number of weights (An) to be set during the training phase. If the model contains a large number of weights, it is more flexible to learn things. In this case, more data is needed to train the model (simplified view: Model = A0+ A1. X1+ A2. X2 + A3. X3+... + An.Xn)
Let's explain some examples:
An AI-expert should have some knowledge of the different algorithms he uses from them. If you know how an algorithm works, you have some feeling about the number of weights and complexity of the algorithm. In this case, there is no time to spend entering little data on a complex model.
?Choice of algorithm also plays an important role
Choice of algorithm can be based on type problem, performance performance, but also on available training time and available data. There are algorithms that need a lot of data but also some that show good performance with little data. See picture
?Step 3: Dataset size (sample size):
领英推荐
This is an important step in the design of an ML model. From experience, there are cases where the sample size is too small, but the team still likes to develop a high-performing model. In some cases, several hundred samples are sufficient for model training, while in others the model cannot learn from thousands of samples.
?How much data are needed?
It is important to look at enough data is for the selected ML model and the problem.The right data sample size is a challenging problem. No one can say in advance how many samples it takes to train a model without thoroughly analyzing the data. This is based on experience. However, it is very important to have a feeling for the right range. There are a number of tips and rules of thumb about estimating the correct number of samples.
The number of weights in the model representing the model complexity shall not exceed the number of samples. Think of a simple linear regression problem with 10 weights (9 characteristics) and you have 8 characteristics. From a mathematical point of view, there are infinitely many linear models that are passed by the passes. Experience has shown that the number of samples must be ten times greater than the number of functions. This means that for the training of a regression model with 10 characteristics we need at least 100 samples. It is good to remember that the number ten is based on the experience, and there is no evidence for it, since it depends on the characteristics of data. For e.g. if the data have a linear characteristic, we can train the model with less. Moreover, if all the samples are similar, we need more to have a more robust model. There are also other parameters that contribute to the correct number of samples. However, if the number of weights in the model has to be taken into account, there is a hint about the number of samples that are approximately needed.
Moreover, this also has to do with the stochastic nature of the ML. The behavior and performance of many machine learning algorithms are called stochastic. Stochastic refers to a variable process in which the only outcome will be inspected and has some uncertainty. It is a mathematical term and is closely related to 'random' and 'probabilistic' and can be contrasted with the idea of 'deterministic'.
The stochastic nature of machine learning algorithms is an important fundamental concept in machine learning and needs to be understood to make the behavior of many predictive models effective.
?Different scenarios in case of small dataset.
Based on the status of each parameter, we have 4 different scenarios in the case of a small dataset. If the number is small, it's better to think about a rules-based solution rather than training an ML model. In general, we do not use an ML model for less than 50 samples. The advice is to collect more data (e.g. implementation of an IoT project).
These are the 4 possibilities in case of small dataset:
?In case of large sample size which is a luxury problem: Having enough samples is key to creating a good ML model. It should be noted that the samples should not be biased. Otherwise, there is not much difference between small and large sample sizes.
?Summary: in case of limited data set, start a research process based on the current available data and try to use a number of high-performing algorithms, such as random forest. Start research in collaboration with AI expert on the model performance of the available dataset from the limited number of machines. The result can be plotted as a line plot with the sample size of the training dataset on the x-axis and the model performance (accuracy) on the y-axis. This gives an idea of how the sample size of the dataset affects the performance of the model on specific problem. This graph is called a learning curve. see picture. From this graph, we may be able to project the amount of data needed to develop a competent model. I strongly recommend developing this approach for robust AI/ML models
?You want to know more or to discuss, please leave a message
Top Voice in AI | Helping SMBs Scale with AI & Automation | CIO at TetraNoodle | AI Speaker & Author | 4x AI Patents | Travel Lover??
2 年Well-written post! Thank you so much for sharing these helpful tips about this topic that people rarely talk about, how to apply machine learning in case of a limited data set. I wrote down the salient points you listed out. Very helpful. I must say.? I look forward to reading more amazing posts like this on your page. Thanks for sharing.