Main Challenges to Machine Learning

Main Challenges to Machine Learning

Our main task in machine learning is to select a machine learning algorithm and train it using some data, So, the two things that can go wrong here is - "bad machine learning algorithm" and/or "bad data".

BAD DATA

Insufficient quantity of training data

Machine learning algorithms require thousands of examples for fairly simple problems, and for complex problems like image or speech recognition we may required millions of training examples.

Non-representative training data

It is important to use a training set that is representative of the cases we want to generalize to. If the sample is too small we may have non-representative data as a result of chance(called sampling noise) and even very large samples can be non-representative if the sampling method is flawed(called sampling bias).

It is crucial to look out for nonresponse bias ( happens when the individuals willing to take part in a research study are different than those who do not want to or are unable to take part in it) during sampling.

Poor quality data

If the training data is full of errors , missing values , outliers and noise , it will make it harder for the system to detect underlying patterns in the data during training and so the system might not perform well. It is often well worth the effort to spend time cleaning up the training data.

Irrelevant features

The machine learning system only learns if our training data contains enough relevant features and not many irrelevant ones. Coming up with a good set of features to train on is called feature engineering , and it involves:

  • Feature Selection
  • Feature Extraction (combing related features)
  • Creating new features (gather more data).

BAD ALGORITHM

Overfitting the training data

Overfitting - It means that the model performs well on the training data but does not generalize well on new instances.

Overfitting happens when the model is too complex relative to the amount and noisiness of the data and the model is learning patterns in the noise itself.

Possible solutions-

  • Simplify the model ( select a model with fewer parameters , reducing the number of attributes in the training data , constraining the model )
  • Gather more training data
  • Reduce the noise in training data ( fix errors , missing values , outliers , etc.).

Constraining the model to make it simpler and reduce the risk of overfitting is called Regularization. We need to find the right balance between fitting the training data perfectly & keeping the model simply enough to ensure that it generalizes well.

Underfitting the training data

Underfitting is the opposite of overfitting. It means that our model is too simple to learn the underlying patterns in the data.

Possible solutions-

  • Select a more powerful model (more parameters).
  • Feed better features to the learning algorithm.
  • Reduce the constraints on the model.


Reference Book - https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/

Gunjan Yawalkar

Student at Dr.Vishwanath Karad MIT WORLD PEACE UNIVERSITY|PUNE

8 个月

Very informative

要查看或添加评论,请登录

Pratyush Singh的更多文章

社区洞察

其他会员也浏览了