How to apply Machine Learning in case of limited data set?

How to apply Machine Learning in case of limited data set?

Because I get this question from several clients, I decided to investigate and make this post to share this information with those who are interested. Hoping to start discussion and maybe get some new insight around this challenging topic.

?So, let’s start answering the question: "how to make analyses with data from a limited number of machines via Artificial Intelligence"

In order to give a good answer to this question, I have to explain a number of things first. There are three aspects that an AI team needs to consider making an ML model work properly:

  1. Step 1: Capture complexity issue
  2. Step 2: Model selection
  3. Step 3: Dataset size needed for model performance.

In an AI project, these 3 above aspects must be thoroughly investigated. The team ( which includes AI expert and the customer domain expert) is responsible for the execution and execution of each of these steps. Together they define a clear problem, identify the model and decide more data is needed.

Step 1: Complexity of the problem

In most cases, the problem description is not clear enough and the team itself must define the exact problem. The team should define the problem as easily as possible and describe the business challenge. This is the first step and a good start always makes a good ending!

Step 2: Model complexity

The complexity of the model can be derived from the number of weights (An) to be set during the training phase. If the model contains a large number of weights, it is more flexible to learn things. In this case, more data is needed to train the model (simplified view: Model = A0+ A1. X1+ A2. X2 + A3. X3+... + An.Xn)

Let's explain some examples:

  • A deep learning model with 3 inputs and 1 hidden layer with ten neurons, has about 50 weights.
  • A decision tree with a depth of 4 has 15 weights (the thresholds are considered weights).
  • A linear regression model for 3 inputs has 4 weights.

An AI-expert should have some knowledge of the different algorithms he uses from them. If you know how an algorithm works, you have some feeling about the number of weights and complexity of the algorithm. In this case, there is no time to spend entering little data on a complex model.

?Choice of algorithm also plays an important role

Choice of algorithm can be based on type problem, performance performance, but also on available training time and available data. There are algorithms that need a lot of data but also some that show good performance with little data. See picture

?Step 3: Dataset size (sample size):

This is an important step in the design of an ML model. From experience, there are cases where the sample size is too small, but the team still likes to develop a high-performing model. In some cases, several hundred samples are sufficient for model training, while in others the model cannot learn from thousands of samples.

?How much data are needed?

It is important to look at enough data is for the selected ML model and the problem.The right data sample size is a challenging problem. No one can say in advance how many samples it takes to train a model without thoroughly analyzing the data. This is based on experience. However, it is very important to have a feeling for the right range. There are a number of tips and rules of thumb about estimating the correct number of samples.

The number of weights in the model representing the model complexity shall not exceed the number of samples. Think of a simple linear regression problem with 10 weights (9 characteristics) and you have 8 characteristics. From a mathematical point of view, there are infinitely many linear models that are passed by the passes. Experience has shown that the number of samples must be ten times greater than the number of functions. This means that for the training of a regression model with 10 characteristics we need at least 100 samples. It is good to remember that the number ten is based on the experience, and there is no evidence for it, since it depends on the characteristics of data. For e.g. if the data have a linear characteristic, we can train the model with less. Moreover, if all the samples are similar, we need more to have a more robust model. There are also other parameters that contribute to the correct number of samples. However, if the number of weights in the model has to be taken into account, there is a hint about the number of samples that are approximately needed.

Moreover, this also has to do with the stochastic nature of the ML. The behavior and performance of many machine learning algorithms are called stochastic. Stochastic refers to a variable process in which the only outcome will be inspected and has some uncertainty. It is a mathematical term and is closely related to 'random' and 'probabilistic' and can be contrasted with the idea of 'deterministic'.

The stochastic nature of machine learning algorithms is an important fundamental concept in machine learning and needs to be understood to make the behavior of many predictive models effective.

?Different scenarios in case of small dataset.

Based on the status of each parameter, we have 4 different scenarios in the case of a small dataset. If the number is small, it's better to think about a rules-based solution rather than training an ML model. In general, we do not use an ML model for less than 50 samples. The advice is to collect more data (e.g. implementation of an IoT project).

These are the 4 possibilities in case of small dataset:

  1. Model complexity = low, problem complexity = low: This is the only case where the team can get a meaningful result if the team does not have sufficient data. If the problem is not complex, the advice is to choose a simple model and train with the available dataset.
  2. Model complexity = high, problem complexity = low: The team should choose a simpler model for the complexity of the model. Even if the team gets higher performance from a very complex model, a simple model works better for the unseen data. The reason is that the evaluation is based only on the small number of data that is not reliable and by aligning the parameters of the complex model, the team is likely to overfit the model. The simple model would be more robust in practice.
  3. Model complexity = low, problem complexity = high: It should be a sub-problem to address a simpler version of the problem. In any case, the team can achieve a meaningful result.
  4. Model complexity = high, problem complexity = high: Even if the team gets high accuracy, the model will fail in a real situation when it comes to looked-up data. In this case, the advice is to collect more data (e.g. data augmentation of implementation of IoT project)

?In case of large sample size which is a luxury problem: Having enough samples is key to creating a good ML model. It should be noted that the samples should not be biased. Otherwise, there is not much difference between small and large sample sizes.

  1. Model complexity= Low, Problem complexity = Low: Model selection is not a big issue with a complex problem and large sample size. Most of the ML models return a good result.
  2. Model complexity= High, Problem complexity = Low: A complex model may return a slightly better result compared to a typical model. Considering explain-ability and maintenance of the model, you may choose to use the simpler model.
  3. Model complexity= Low, Problem complexity = High: If you love working with state-of-the-art methods, this is a suitable case for you. Pick a complex model (e.g. deep learning), and you will get a desirable result.
  4. Model complexity= High, Problem complexity = High: You can see the power and beauty of Machine Learning in this case. Magic happens here.

?Summary: in case of limited data set, start a research process based on the current available data and try to use a number of high-performing algorithms, such as random forest. Start research in collaboration with AI expert on the model performance of the available dataset from the limited number of machines. The result can be plotted as a line plot with the sample size of the training dataset on the x-axis and the model performance (accuracy) on the y-axis. This gives an idea of how the sample size of the dataset affects the performance of the model on specific problem. This graph is called a learning curve. see picture. From this graph, we may be able to project the amount of data needed to develop a competent model. I strongly recommend developing this approach for robust AI/ML models

?You want to know more or to discuss, please leave a message

Manuj Aggarwal

Top Voice in AI | Helping SMBs Scale with AI & Automation | CIO at TetraNoodle | AI Speaker & Author | 4x AI Patents | Travel Lover??

2 年

Well-written post! Thank you so much for sharing these helpful tips about this topic that people rarely talk about, how to apply machine learning in case of a limited data set. I wrote down the salient points you listed out. Very helpful. I must say.? I look forward to reading more amazing posts like this on your page. Thanks for sharing.

要查看或添加评论,请登录

Rachid Kherrazi的更多文章

社区洞察

其他会员也浏览了