Hands-on ML Series: Episode 2: Challengs of ML

Hands-on ML Series: Episode 2: Challengs of ML

Welcome to my "Hands-on ML Series".

In Episode 1, We discussed Types of Machine Learning Systems

Today, we will try to cover Main Challenges of Machine Learning 

Main Challenges of Machine Learning 

In short, since your main task is to select a learning algorithm and train it on some data, the two things that can go wrong are “bad algorithm” and “bad data”.
No alt text provided for this image

Let’s start with examples of “bad data”. 

1. Insuffecient Quantity of Training Data 

Most ML Algorithms need alot of data to train on, even for simple problems, so you can rely on it in prediction.

Complex Problems such as image or speech recognition may need more and more data

So, there is a trend that, if it is possible, to spend time and money on getting more data instead of algorithm development.

This is a trade-off : Algorithm development vs. Data development.

Remember that: it is not always easy (or cheap) to get more data

2. Non-representative Training Data 

It is crucial to use a training set that is representative of the cases you want to generalize to. 

It is fair enough, if you are a teacher, to teach your student nearly all cases that you will test him later.

3. Poor-Quality Data

Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well.

 It is often well worth the effort to spend time cleaning up your training data. The truth is, most data scientists spend a significant part of their time doing just that! 

You may heard about the famous speech by Andrew Ng: From Model-centric to Data-centric AI. He touches the importance of data quality improvement instead of going always for model improvements. You can also get the point of this speech in just 3 mins by this short article: Model-centric vs Data-centric View in the age of AI

4. Irrelevant Features

As the saying goes: garbage in, garbage out.

A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process, called feature engineering, involves the following steps:

? Feature selection (selecting the most useful features to train on among existing features)

? Feature extraction (combining existing features to produce a more useful one, dimensionality reduction algorithms can help)

? Creating new features by gathering new data

Now that we have looked at many examples of bad data, let’s look at a couple of examples of bad algorithms.

5. Overfitting the Training Data

Say you are visiting a foreign country and the taxi driver rips you off. You might be tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is something that we humans do all too often, and unfortunately machines can fall into the same trap if we are not careful. In Machine Learning this is called overfitting: it means that the model performs well on the training data, but it does not generalize well. 

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. Here are possible solutions:

? Simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the model.

? Gather more training data.

? Reduce the noise in the training data (e.g., fix data errors and remove outliers). 

Constraining a model to make it simpler and reduce the risk of overfitting is called regularization. This regularization is controlled through some hyperparameter

You want to find the right balance between fitting the training data perfectly and keeping the model simple enough to ensure that it will generalize well. 

6. Underfitting the Training Data

As you might guess, underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data.

Here are the main options for fixing this problem:

? Select a more powerful model, with more parameters.

? Feed better features to the learning algorithm (feature engineering).

? Reduce the constraints on the model (e.g., reduce the regularization hyperparameter). 

No alt text provided for this image

This is the end of "Hands-on ML Series: Episode 2: Challenges of ML"

I hope you found it useful, and any feedback is appreciated

要查看或添加评论,请登录

Ahmad Mostafa的更多文章

社区洞察

其他会员也浏览了