登录查看更多内容

FEATURES AND TRAINING SET SELECTION FOR YOUR MACHINE LEARNING ALGORITHM

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

发布日期: 2015年11月22日

"If you don’t know where you are, a map won’t help" - Watts Humphrey (1927–2010)

A very common issue a data scientist faces is the lack of a methodology he can rely upon to trace the accuracy of the resulted classification model. After building your classification/regression system, and starting to test it on a new set of samples, you may find that it makes unacceptably large errors in its predictions.

The question is: How to trace the problem and improve the accuracy with so many factors and features used?

Well, suppose you have the annotated data, the classification model, and according to your tests, the accuracy is 60%. Now you wonder, is it good? is it bad? can it be improved? There are several possible solutions, maybe you need to get more training data, get more features or maybe get less features.

In order to answer this question, we need to perform a detailed Machine learning diagnostics that would enable us to gain insights on what is wrong with the learning algorithm. And then see the best way we could use to improve the performance.

Usually, your data set is divided into training and testing sets. Where the system is trained on the training set (typically 80% of the data) and tested on the testing set (the reminder 20% of the data). A much better way for performing diagnostics is to divide the data into three sets instead:

Training set: 60%
Development/Validation set: 20%
Testing set: 20%

Both the training set and the development set are used for the detailed diagnostics and hence adjusting the system parameters for the best performance

Basically there are two steps here:

Diagnosing: How to select the right number of features?How to decide the suitable training size?
Fixing/treatment: Solutions to under-fit (high bias), Solutions to over-fit (high variance)

Step 1- Diagnosing

In this stage, we just want to know where we are by running some simple but effective tests and plotting the corresponding curves to where exactly the machine learning is.

How to select the right number of features?

Depending on the number of features selected, you have three scenarios:

Under-fitting (high bias) situation: If the accuracy of the training set and the development set are low. In other words, the system cannot perform well on the training data itself and naturally cannot generalize on the development data.
Over-fitting (high variance) situation: If the accuracy of the training set is high, but the accuracy of the development set is low. This means that, the system memorized the training data very well, but cannot generalize on the development data.
Best-fitting situation: If the accuracy of the training set and the development set are both high, near and acceptable. So, the system learns well from training set and can generalize to development set. This system is expected to perform well on the unseen testing set.

How to decide the suitable training size?

Using the "Learning Curves", while the training set size increases, we plot the training and development sets errors:

High bias situation. High error for both and small gap between training and test error.
High variance situation. Test error decreasing as training set size increases and large gap between training and test sets error.
Desired performance. Is in between these two extreme cases.

Step 2- Fixing/Treatment

After diagnostics and knowing the situation (over or under fitting), it is now the time for fixing the problems and improvement. For the best performance, we are searching for the right-fit (in between the over and under fitting situations). Here are the possible solutions:

Solutions to under-fit (high bias):

Get more strong features should help (current features are not enough)
Getting more training data will not help much (by itself)

Solutions to over-fit (high variance):

Get more training data (we may need more data to generalize on unseen data)
Get less features (model cannot generalize as it has too many features making it memorizing the training and fail to generalize on unseen data)

So, What's next?

After fixing, repeat the diagnostics again and examine the improvements in an tangible and measurable way. If we have the annotated data and the machine learning tool box, it is really easy to train and test your system and report results that may look good. However, it is crucial to perform detailed diagnostics and draw curves to know where actually your machine learning is, and to know if there is a chance for improvement, how to improve and get better accuracy, memory and time savings, by fixing common problems such as over and under fitting.

Regards

要查看或添加评论，请登录

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

2025年3月1日

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

Article created by Perplexity Deep Research. Prompt: "You are a deep-learning experienced researcher.

1 条评论
The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

2025年3月1日

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

Research Report Created by Perplexity Deep Research My Research Question : "Now I want to dig deeper in the human judge…

3 条评论
How to Learn Artificial Intelligence: A Beginner’s Guide

2024年5月31日

How to Learn Artificial Intelligence: A Beginner’s Guide

Artificial Intelligence (AI) is a fascinating field that simulates human intelligence and task performance using…
[????????????] ?????????????????? ???????????? explained with code ??

2023年1月28日

[????????????] ?????????????????? ???????????? explained with code ??

"During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion…

2 条评论
A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

2023年1月21日

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

Hello everyone, and thank you all for being here today! Let me introduce our new star, the ChatGPT, who will discuss…
10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

2022年2月17日

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

In this article, 10 well-known pre-trained object detectors are loaded and used in a standard and easy way. YOLOF: You…

6 条评论
FNet: Do we need the attention layer at all? [Explained with code]

2021年10月30日

FNet: Do we need the attention layer at all? [Explained with code]

FNet: Mixing Tokens with Fourier Transforms "In this work, we investigate whether simpler token mixing mechanisms can…
Patches Are All You Need! [with code]

2021年10月28日

Patches Are All You Need! [with code]

"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have…
MLP is all you need! [with code]

2021年10月23日

MLP is all you need! [with code]

From Google: MLP-Mixer: An all-MLP Architecture for Vision Main idea: "While convolutions and attention are both…

2 条评论
9 Steps for solving any machine learning problem

2021年8月28日

9 Steps for solving any machine learning problem

In this article, we will present a universal blueprint that we can use to attack and solve any machine-learning…

2 条评论

See all articles

FEATURES AND TRAINING SET SELECTION FOR YOUR MACHINE LEARNING ALGORITHM

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了

Feature Scaling in Machine Learning

Cross-Validation: Ensuring Reliable Model Performance

Deep Dive into Training Error vs. Test Error

Data Drift/Shift: A Nemesis in Production Machine Learning

Machine Learning and Cross-Validation Techniques: The Importance of k-Fold Cross-Validation

RULES FOR EVALUATING A MODEL FOR MACHINE LEARNING

A Simple Introduction to Cross-Validation

Article for ML Learning Series: "Understanding Overfitting and Underfitting in Machine Learning"

Why Machine Learning Requires High Quality Data

Is Machine Learning all about loss minimization?

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

How to Learn Artificial Intelligence: A Beginner’s Guide

[????????????] ?????????????????? ???????????? explained with code ??

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

FNet: Do we need the attention layer at all? [Explained with code]

Patches Are All You Need! [with code]

MLP is all you need! [with code]

9 Steps for solving any machine learning problem

社区洞察

其他会员也浏览了

Feature Scaling in Machine Learning

Cross-Validation: Ensuring Reliable Model Performance

Deep Dive into Training Error vs. Test Error

Data Drift/Shift: A Nemesis in Production Machine Learning

Machine Learning and Cross-Validation Techniques: The Importance of k-Fold Cross-Validation

RULES FOR EVALUATING A MODEL FOR MACHINE LEARNING

A Simple Introduction to Cross-Validation

Article for ML Learning Series: "Understanding Overfitting and Underfitting in Machine Learning"

Why Machine Learning Requires High Quality Data

Is Machine Learning all about loss minimization?