How Much Minimum Data Do We Need ?
Akshay Ugalmogale
Digital Marketer | Web Designer | AI ML | Data Engineer | Mechanical Engineer
Data is the new oil. It’s a commodity that every company needs more of - and they need it now. “How much training data do you need to build your AI models?” is one of the most common questions in the field. The answer varies depending on what it is that you’re trying to do. AI models usually require vast amounts of data to train them, but some datasets are so limited in size that it can be hard to know where to start or what to do next.
How Do You Determine the Size of a Data Set Needed to Train AI?
How much data will you need? The answer depends on the type of task your algorithm is supposed to fulfill, the method you use to achieve AI, and expected performance. In general, traditional machine learning algorithms will not need as much data as deep learning models. 1000 samples per category are considered a minimum for simplest machine learning algorithms, but it won’t be enough to solve the problem in most cases.
The more complex the problem is, the more training data you should have. The number of data samples should be proportional to the number of parameters. According to the so-called rule of 10, often used in dataset size estimation, you should have around 10 times more data samples than parameters. Of course, this rule is just a suggestion, and it may not apply in all the projects (some deep learning algorithms perform well at a 1:1 ratio). Still, it’s beneficial once you’re trying to estimate the minimum size of your dataset. Note, however, that some variables, like signal-to-noise ratio, may radically change this demand.
Generally Used Thumb Rules For Data Collection :
1.?Estimate using the rule of 10:: For an initial estimation for the amount of data required, you can apply the rule of 10, which recommends that the amount of training data you need is 10 times the number of parameters – or degrees of freedom - in the model. This recommendation came about as a way of addressing the totality of outputs available when combining the defined parameters.
领英推荐
2.???Supervised deep learning rule of thumb: in their deep learning book Goodfellow, Bengio and Courville claim that 5,000 labeled examples per category is enough for a supervised deep learning algorithm to achieve acceptable performance which will match human performance. To exceed human performance, they recommend at least 10 million labeled examples.
3.???Computer vision rule of thumb: When using deep learning for image classification, a good baseline to start from is 1,000 images per class. Pete Warden analyzed entries in the ImageNet classification challenge, where the dataset had 1,000 categories, each being a bit short of 1,000 images for each class. The dataset was large enough to train the early generations of image classifiers like AlexNet, so the author concluded that roughly 1,000 images is a good baseline for computer vision algorithms.
4.???20% of a training set is typically used for the validation: Another recommendation from the Deep Learning book is to use about 80% of the data for learning and 20% for validation. The validation set is the subset of data used to guide the selection of hyperparameters. When our data is small this is ok but when we have large big data this Validation and Test Data ratio can be reduce to 0.1% or less than that also.
5.???Plotting learning curves: To determine the efficiency of your machine learning algorithm, try plotting the learning curve of the sample size against the success rate. If the algorithm was trained adequately, the graph will look similar to a log function. If you found that the last two points plotted with your current sample size still have a positive slope, then you can increase the dataset for a better success rate. As the slope approaches zero, increasing the dataset is unlikely to improve the success rate
There is a Technique to scale your small data is known as Data Augmentation. The performance of most ML models, and deep learning models in particular, depends on the quality, quantity and relevancy of training data. However, insufficient data is one of the most?common?challenges in implementing machine learning in the enterprise. This is because collecting such data can be costly and time-consuming in many cases.
Data?augmentation?is?a?set?of?techniques?to?artificially?increase?the?amount?of?data?by?generating?new?data?points?from?existing?data.?This?includes?making?small?changes?to?data?or?usingdeep?learning?models?to?generate?new?data?points.
We will have deep look on Data Augmentation In Next Article.
Thank You !
AIRCRAFT MAINTENANCE AVIONICS/IFE
1 年Good Information..
Mechatronics | System Engineering | Product development | Project Management
1 年Helpful! ????