FEATURES AND TRAINING SET SELECTION FOR YOUR MACHINE LEARNING ALGORITHM

FEATURES AND TRAINING SET SELECTION FOR YOUR MACHINE LEARNING ALGORITHM

"If you don’t know where you are, a map won’t help" - Watts Humphrey (1927–2010)

A very common issue a data scientist faces is the lack of a methodology he can rely upon to trace the accuracy of the resulted classification model. After building your classification/regression system, and starting to test it on a new set of samples, you may find that it makes unacceptably large errors in its predictions.

The question is: How to trace the problem and improve the accuracy with so many factors and features used?

Well, suppose you have the annotated data, the classification model, and according to your tests, the accuracy is 60%. Now you wonder, is it good? is it bad? can it be improved? There are several possible solutions, maybe you need to get more training data, get more features or maybe get less features.

In order to answer this question, we need to perform a detailed Machine learning diagnostics that would enable us to gain insights on what is wrong with the learning algorithm. And then see the best way we could use to improve the performance.

Usually, your data set is divided into training and testing sets. Where the system is trained on the training set (typically 80% of the data) and tested on the testing set (the reminder 20% of the data). A much better way for performing diagnostics is to divide the data into three sets instead:

  • Training set: 60%
  • Development/Validation set: 20%
  • Testing set: 20%

Both the training set and the development set are used for the detailed diagnostics and hence adjusting the system parameters for the best performance

Basically there are two steps here:

  • Diagnosing: How to select the right number of features?How to decide the suitable training size?
  • Fixing/treatment: Solutions to under-fit (high bias), Solutions to over-fit (high variance)

 Step 1- Diagnosing

In this stage, we just want to know where we are by running some simple but effective tests and plotting the corresponding curves to where exactly the machine learning is. 

How to select the right number of features?

Depending on the number of features selected, you have three scenarios: 

  • Under-fitting (high bias) situation: If the accuracy of the training set and the development set are low. In other words, the system cannot perform well on the training data itself and naturally cannot generalize on the development data.
  • Over-fitting (high variance) situation: If the accuracy of the training set is high, but the accuracy of the development set is low. This means that, the system memorized the training data very well, but cannot generalize on the development data.
  • Best-fitting situation: If the accuracy of the training set and the development set are both high, near and acceptable. So, the system learns well from training set and can generalize to development set. This system is expected to perform well on the unseen testing set.

How to decide the suitable training size? 

Using the "Learning Curves", while the training set size increases, we plot the training and development sets errors:

  • High bias situation. High error for both and small gap between training and test error.
  • High variance situation. Test error decreasing as training set size increases and large gap between training and test sets error.
  • Desired performance. Is in between these two extreme cases.


Step 2- Fixing/Treatment

After diagnostics and knowing the situation (over or under fitting), it is now the time for fixing the problems and improvement. For the best performance, we are searching for the right-fit (in between the over and under fitting situations). Here are the possible solutions: 

Solutions to under-fit (high bias):

  • Get more strong features should help (current features are not enough)
  • Getting more training data will not help much (by itself)

Solutions to over-fit (high variance):

  • Get more training data (we may need more data to generalize on unseen data)
  • Get less features (model cannot generalize as it has too many features making it memorizing the training and fail to generalize on unseen data)

So, What's next?

After fixing, repeat the diagnostics again and examine the improvements in an tangible and measurable way. If we have the annotated data and the machine learning tool box, it is really easy to train and test your system and report results that may look good. However, it is crucial to perform detailed diagnostics and draw curves to know where actually your machine learning is, and to know if there is a chance for improvement, how to improve and get better accuracy, memory and time savings, by fixing common problems such as over and under fitting.


Regards

要查看或添加评论,请登录

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了