Getting started with AI – how much data do you need?
In 2017, the Economist claimed that oil was no longer the world’s most valuable resource; instead, it was data (link). Several years later, with all the spectacular technological advances made in data insights, organizations realize the inherent value their data contains and how emerging technologies like Artificial Intelligence (AI) have the potential to be a driver for their competitive edge. However, when organizations explore new AI opportunities, we are often asked: “how much data do we need?”
When working with AI, there is no perfect amount of data needed, and often companies do not only have to focus on the quantity but also on the quality of their data. Usually, the datasets under scrutiny fall short in at least one of these categories:?
However, some of these aspects are more critical than others and more or less hard to fix. Missing records can, in some cases, be backfilled or inferred, and mistakes may be corrected based on rules or logic. But, if the data is in limited quantity, it might not be easy to collect more data in terms of time or cost.
This write-up intends to provide the reader with a comprehensive high-level view of state-of-the-art techniques for dealing with limited or incomplete data to offer a broad understanding of the methods used to address these challenges.
Dealing with limited or incomplete data
1. How much data is enough?
The minimal size of a dataset can depend on many factors, such as the complexity of the model you’re building, the performance you’re aiming for, or the time frame at your disposal. Usually, machine learning practitioners will try to achieve the best results with the minimum amount of resources (data or computation) while building their AI model; this means first trying simple models with few data points before trying more advanced methods, which potentially require larger amounts of data.
Imagine working out a linear model between your target variable (i.e., what you are trying to predict) and your features (i.e., your explanatory variables). As you may remember from high school math, a linear model has two parameters only (y = a*x+b). You may also remember that two data points are generally enough to fit a straight line.
If you consider a quadratic model with three parameters (y = a*x2+b*x+c), you’ll need at least three data points. Usually, even if there is no one-to-one relationship, the more complex your model becomes, the more data you will need to determine its parameters. For instance, one of the latest models for classifying images like Inception V3 from Google contains a bit less than 24 Million parameters and requires about 1.2 Million data points (in that case, labeled images) to be trained.
The amount of data needed may also have to do with the particular problem you're solving. Suppose you are trying to forecast a time series with a simple structure but exceptionally long seasonality or cyclical patterns (e.g., 30 years). In that case, the bottleneck may not reside in the number of parameters in the model but in your ability to collect data points from the past 30 years. E.g., this problem “only” represents 360 data points for monthly data over 30 years, but they may not be possible to collect.
Finally, there are actual mathematical ways to figure out whether you have enough data. Let’s say that your team of data scientists has worked on a model and has reached the best possible performance with the data at hand, but it’s just not enough. What should you do now? Collect different data? Collect more of the same data? Or should you collect both to optimize your time and efforts? This question can be answered by a diagnostic of the model and data by employing a learning curve, which shows how the model’s performance increases as you add more data points, as depicted in Fig. 1 (from Researchgate):
The idea is to see how much the model’s performance benefits from adding more data and whether or not the model has already saturated, in which case, adding more data will not help.
领英推荐
2. What to do if you are running short on data?
If you find yourself in a situation where you need more data, there are different strategies to consider depending on the problem at hand and your situation:
If collecting more data is not possible: If you cannot collect more of the same data, you can try your luck by resorting to either data augmentation or data synthesis, i.e., creating artificial data based on the data you already have.
However, data augmentation and synthesis will most likely have marginal effects if your data is not well distributed or too small in size to use the above-mentioned methods. In that case, you will have no other choice but to go out and collect new data points.
If collecting more data is an option
If collecting more data is the way to go, either because it is affordable to collect more of the same data you already have or because you possess or have access to large amounts of data— even partially complete data such as unlabeled data — you basically have two options:
3. How to label your unlabeled data?
If you’ve already recorded a significant amount of data but missed some parts of the information such as the label, you could, of course, try to retrieve this information manually (data collection with traditional supervision) but it can turn out to be a very slow and painful process. So how do you get more labeled training data? There are a few different approaches to address the lack of data (the figure is taken from?Stanford).
There are three major routes that you can try in order to get more usable data from unused and unlabeled data, summarized above. Let’s examine them one by one:
Conclusion
The brave reader will by now have understood that there is no fatality in case of data shortage and that many solutions already exist to address that commonly faced challenge. However, it can be somewhat difficult to identify which approach might be best suited for you. Particularly, most of the recently developed approaches are designed for unstructured data (images, videos, text, audio, speech, etc.) but do not always directly translate to more traditional tabular data types. We can help you assess what next steps to take – just click the button below to connect