Getting started with AI – how much data do you need?

Getting started with AI – how much data do you need?

In 2017, the Economist claimed that oil was no longer the world’s most valuable resource; instead, it was data (link). Several years later, with all the spectacular technological advances made in data insights, organizations realize the inherent value their data contains and how emerging technologies like Artificial Intelligence (AI) have the potential to be a driver for their competitive edge. However, when organizations explore new AI opportunities, we are often asked: “how much data do we need?”

When working with AI, there is no perfect amount of data needed, and often companies do not only have to focus on the quantity but also on the quality of their data. Usually, the datasets under scrutiny fall short in at least one of these categories:?

Categories of the data set challenges: completeness, accessibility, quality, connectivity, quantity, validity

However, some of these aspects are more critical than others and more or less hard to fix. Missing records can, in some cases, be backfilled or inferred, and mistakes may be corrected based on rules or logic. But, if the data is in limited quantity, it might not be easy to collect more data in terms of time or cost.

This write-up intends to provide the reader with a comprehensive high-level view of state-of-the-art techniques for dealing with limited or incomplete data to offer a broad understanding of the methods used to address these challenges.


Dealing with limited or incomplete data

1. How much data is enough?

The minimal size of a dataset can depend on many factors, such as the complexity of the model you’re building, the performance you’re aiming for, or the time frame at your disposal. Usually, machine learning practitioners will try to achieve the best results with the minimum amount of resources (data or computation) while building their AI model; this means first trying simple models with few data points before trying more advanced methods, which potentially require larger amounts of data.

Imagine working out a linear model between your target variable (i.e., what you are trying to predict) and your features (i.e., your explanatory variables). As you may remember from high school math, a linear model has two parameters only (y = a*x+b). You may also remember that two data points are generally enough to fit a straight line.

If you consider a quadratic model with three parameters (y = a*x2+b*x+c), you’ll need at least three data points. Usually, even if there is no one-to-one relationship, the more complex your model becomes, the more data you will need to determine its parameters. For instance, one of the latest models for classifying images like Inception V3 from Google contains a bit less than 24 Million parameters and requires about 1.2 Million data points (in that case, labeled images) to be trained.

The amount of data needed may also have to do with the particular problem you're solving. Suppose you are trying to forecast a time series with a simple structure but exceptionally long seasonality or cyclical patterns (e.g., 30 years). In that case, the bottleneck may not reside in the number of parameters in the model but in your ability to collect data points from the past 30 years. E.g., this problem “only” represents 360 data points for monthly data over 30 years, but they may not be possible to collect.

Finally, there are actual mathematical ways to figure out whether you have enough data. Let’s say that your team of data scientists has worked on a model and has reached the best possible performance with the data at hand, but it’s just not enough. What should you do now? Collect different data? Collect more of the same data? Or should you collect both to optimize your time and efforts? This question can be answered by a diagnostic of the model and data by employing a learning curve, which shows how the model’s performance increases as you add more data points, as depicted in Fig. 1 (from Researchgate):

Fig 1. showcase a model’s performance as a function of the training dataset size. The figure is taken from Researchgate: https://www.researchgate.net/figure/Learning-Curve-of-machine-learning-model-with-the-size-of-dataset-used-for-testing-and_fig7_320592670

The idea is to see how much the model’s performance benefits from adding more data and whether or not the model has already saturated, in which case, adding more data will not help.


2. What to do if you are running short on data?

If you find yourself in a situation where you need more data, there are different strategies to consider depending on the problem at hand and your situation:

If collecting more data is not possible: If you cannot collect more of the same data, you can try your luck by resorting to either data augmentation or data synthesis, i.e., creating artificial data based on the data you already have.

  • Data Augmentation?– consists of generating new data points based on the ones you already have. For an image dataset, it would be required to create new images with lower or higher resolutions, cropped, rotated, with linear transformations, or added noise. This would help your algorithm become more robust to these types of perturbations. For further reading, have a look at?unsupervised data augmentation.
  • Data synthesis?– is sometimes used to remedy classification problems where one class is imbalanced. New data points can be created using complex sampling techniques such as?SMOTE. More recent and advanced methods leverage the power of deep learning and aim at learning the distribution (or, more generally a representation) of the data to artificially generate new data that mimics the real data. Among such methods, one can mention?variational autoencoders?and?generative adversarial networks.
  • Discriminative methods?– when data is limited, you want to make sure that you focus on the right part. A common technique is called?regularization, where you penalize “non-important” data to give more weight to relevant data points, thus reducing the model complexity. More recently, in deep learning, a method called?multi-task learning?is used to exploit the limited amount of data at hand and alleviate overfitting in single-task model training. In essence, you are training several models instead of one to better generalize new, unseen data.

However, data augmentation and synthesis will most likely have marginal effects if your data is not well distributed or too small in size to use the above-mentioned methods. In that case, you will have no other choice but to go out and collect new data points.

If collecting more data is an option

If collecting more data is the way to go, either because it is affordable to collect more of the same data you already have or because you possess or have access to large amounts of data— even partially complete data such as unlabeled data — you basically have two options:

  • Data Collection?– is always the first option to consider. If your resources are limited and you have access to domain experts (aka SME, Subject Matter Expert) who can help you qualify (label) your data, you may want to have a spin at?active learning. With active learning the process of learning is iterative: the algorithm is trained on a limited number of labeled data, then the model identifies difficult unlabeled points and asks in an interactive manner for an SME to label the data point, which is in turn included in the training set.
  • Data Labeling?– is about using the data points that you already own, but which are not part of your training or testing data (i.e. data used for modeling) because they are?incomplete?(e.g. missing label data). In that case, it might be interesting to see how you can leverage the latest advances in AI in order to make use of this untapped data potential.


3. How to label your unlabeled data?

If you’ve already recorded a significant amount of data but missed some parts of the information such as the label, you could, of course, try to retrieve this information manually (data collection with traditional supervision) but it can turn out to be a very slow and painful process. So how do you get more labeled training data? There are a few different approaches to address the lack of data (the figure is taken from?Stanford).

Fig 2. Different approaches to address the lack of data. The figure is taken from Stanford: https://ai.stanford.edu/blog/weak-supervision/

There are three major routes that you can try in order to get more usable data from unused and unlabeled data, summarized above. Let’s examine them one by one:

  • Semi-supervised learning?– is particularly interesting if you find yourself in a situation with a small amount of labeled data and a large amount of unlabeled data. The idea is to use both the labeled and unlabeled data to achieve higher modeling performance either by inferring the correct labels of the unlabeled data or by using the unlabeled data if possible. Semi-supervised learning makes specific assumptions about the topology of the data, i.e., points being close to each other are assumed to belong to the same class. The interested reader can look at the?MixMatch?algorithm, developed by Google Research.
  • Transfer learning?– is about “recycling” models that have been trained for a similar task and rewiring them to perform another one. For instance, you can easily use transfer learning in the above-mentioned Inception V3 model from Google, which has been trained to recognize the difference between 1000 different categories and “fine-tune” it for your own application. This way, the model will become effective at differentiating between a couple of new categories of interests. This approach, however, requires that you have access to an already pre-trained model and that you can use transfer learning with these models, which may not often be the case. Transfer learning can, in some ways, be considered a weak supervision method.
  • Weak supervision?– The rationale behind weak supervision is to use noisy data (low-quality data) to find missing information in your existing data. The case for weak supervision is relevant when a vast number of labeled data is needed, and a certain level of domain expertise is available to be leveraged. This is not a method that is applicable to all cases of a limited dataset. A remarkable example of a new weak supervision tool comes from Stanford research and goes by the name of?Snorkel. It removes humans from the labeling process and instead uses “labeling functions” though still incorporating human knowledge. Snorkel asks humans to write a set of labeling functions called heuristics and then label some data points. Of course, these labels will be imprecise and noisy. Still, Snorkel will automatically build a generative model using these labeling functions to create a probabilistic label reflecting the confidence over the label.


Conclusion

The brave reader will by now have understood that there is no fatality in case of data shortage and that many solutions already exist to address that commonly faced challenge. However, it can be somewhat difficult to identify which approach might be best suited for you. Particularly, most of the recently developed approaches are designed for unstructured data (images, videos, text, audio, speech, etc.) but do not always directly translate to more traditional tabular data types. We can help you assess what next steps to take – just click the button below to connect

https://mail:info@2021.ai/

要查看或添加评论,请登录

社区洞察

其他会员也浏览了