登录查看更多内容

How Much Minimum Data Do We Need ?

Akshay Ugalmogale

Digital Marketer | Web Designer | AI ML | Data Engineer | Mechanical Engineer

发布日期: 2022年12月21日

Data is the new oil. It’s a commodity that every company needs more of - and they need it now. “How much training data do you need to build your AI models?” is one of the most common questions in the field. The answer varies depending on what it is that you’re trying to do. AI models usually require vast amounts of data to train them, but some datasets are so limited in size that it can be hard to know where to start or what to do next.

How Do You Determine the Size of a Data Set Needed to Train AI?

How much data will you need? The answer depends on the type of task your algorithm is supposed to fulfill, the method you use to achieve AI, and expected performance. In general, traditional machine learning algorithms will not need as much data as deep learning models. 1000 samples per category are considered a minimum for simplest machine learning algorithms, but it won’t be enough to solve the problem in most cases.

The more complex the problem is, the more training data you should have. The number of data samples should be proportional to the number of parameters. According to the so-called rule of 10, often used in dataset size estimation, you should have around 10 times more data samples than parameters. Of course, this rule is just a suggestion, and it may not apply in all the projects (some deep learning algorithms perform well at a 1:1 ratio). Still, it’s beneficial once you’re trying to estimate the minimum size of your dataset. Note, however, that some variables, like signal-to-noise ratio, may radically change this demand.

Generally Used Thumb Rules For Data Collection :

1.?Estimate using the rule of 10:: For an initial estimation for the amount of data required, you can apply the rule of 10, which recommends that the amount of training data you need is 10 times the number of parameters – or degrees of freedom - in the model. This recommendation came about as a way of addressing the totality of outputs available when combining the defined parameters.

Manuj Aggarwal 6 年前

Unveiling the Veil: Data Science and Explainable AI in…

Handson School Of Data Science Management & Technology 9 个月前

Artificial Intelligence vs. Machine Learning – What’s…

ADEPT Decisions 2 年前

2.???Supervised deep learning rule of thumb: in their deep learning book Goodfellow, Bengio and Courville claim that 5,000 labeled examples per category is enough for a supervised deep learning algorithm to achieve acceptable performance which will match human performance. To exceed human performance, they recommend at least 10 million labeled examples.

3.???Computer vision rule of thumb: When using deep learning for image classification, a good baseline to start from is 1,000 images per class. Pete Warden analyzed entries in the ImageNet classification challenge, where the dataset had 1,000 categories, each being a bit short of 1,000 images for each class. The dataset was large enough to train the early generations of image classifiers like AlexNet, so the author concluded that roughly 1,000 images is a good baseline for computer vision algorithms.

4.???20% of a training set is typically used for the validation: Another recommendation from the Deep Learning book is to use about 80% of the data for learning and 20% for validation. The validation set is the subset of data used to guide the selection of hyperparameters. When our data is small this is ok but when we have large big data this Validation and Test Data ratio can be reduce to 0.1% or less than that also.

5.???Plotting learning curves: To determine the efficiency of your machine learning algorithm, try plotting the learning curve of the sample size against the success rate. If the algorithm was trained adequately, the graph will look similar to a log function. If you found that the last two points plotted with your current sample size still have a positive slope, then you can increase the dataset for a better success rate. As the slope approaches zero, increasing the dataset is unlikely to improve the success rate

There is a Technique to scale your small data is known as Data Augmentation. The performance of most ML models, and deep learning models in particular, depends on the quality, quantity and relevancy of training data. However, insufficient data is one of the most?common?challenges in implementing machine learning in the enterprise. This is because collecting such data can be costly and time-consuming in many cases.

Data?augmentation?is?a?set?of?techniques?to?artificially?increase?the?amount?of?data?by?generating?new?data?points?from?existing?data.?This?includes?making?small?changes?to?data?or?usingdeep?learning?models?to?generate?new?data?points.

We will have deep look on Data Augmentation In Next Article.

Thank You !

How Much Minimum Data Do We Need ?

Akshay Ugalmogale

Digital Marketer | Web Designer | AI ML | Data Engineer | Mechanical Engineer

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Two Poles Apart: An Explainer of ML and DL

Expert systems in the era of machine learning

Causality in Machine Learning

The Importance of Human Interpretable Machine Learning

Artificial Intelligence No 52: An introduction to causal machine learning

The State of Machine Learning in Business Today

A.I VS ML VS DL VS DS

Demystifying Machine Learning and AI: Unveiling Their Relationship

Taming the Evolving Black Box: How to use Pre-Trained Machine Learning Models and Maintain Robustness

领英推荐

The AI Approach to Lean Manufacturing

2023年2月15日

Deep Learning for Smart Manufacturing: Methods and Applications

2023年2月8日

How to Overcome 3D Printing Defects Using Machine Learning & AI ?

2022年11月15日

Use of Data Science & AI-Machine Learning In Sports

2022年10月20日

CASE STUDY - NURSERY BUSINESS SALES DATA ANALYSIS

2022年7月12日