登录查看更多内容

Hands-on ML Series: Episode 2: Challengs of ML

Ahmad Mostafa

Data Scientist | Freelance Instructor

发布日期: 2021年6月26日

+ 关注

Welcome to my "Hands-on ML Series".

In Episode 1, We discussed Types of Machine Learning Systems

Today, we will try to cover Main Challenges of Machine Learning

Main Challenges of Machine Learning

In short, since your main task is to select a learning algorithm and train it on some data, the two things that can go wrong are “bad algorithm” and “bad data”.

Let’s start with examples of “bad data”.

1. Insuffecient Quantity of Training Data

Most ML Algorithms need alot of data to train on, even for simple problems, so you can rely on it in prediction.

Complex Problems such as image or speech recognition may need more and more data

So, there is a trend that, if it is possible, to spend time and money on getting more data instead of algorithm development.

This is a trade-off : Algorithm development vs. Data development.

Remember that: it is not always easy (or cheap) to get more data

2. Non-representative Training Data

It is crucial to use a training set that is representative of the cases you want to generalize to.

It is fair enough, if you are a teacher, to teach your student nearly all cases that you will test him later.

3. Poor-Quality Data

Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well.

It is often well worth the effort to spend time cleaning up your training data. The truth is, most data scientists spend a significant part of their time doing just that!

You may heard about the famous speech by Andrew Ng: From Model-centric to Data-centric AI. He touches the importance of data quality improvement instead of going always for model improvements. You can also get the point of this speech in just 3 mins by this short article: Model-centric vs Data-centric View in the age of AI

4. Irrelevant Features

As the saying goes: garbage in, garbage out.

A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process, called feature engineering, involves the following steps:

? Feature selection (selecting the most useful features to train on among existing features)

? Feature extraction (combining existing features to produce a more useful one, dimensionality reduction algorithms can help)

? Creating new features by gathering new data

Now that we have looked at many examples of bad data, let’s look at a couple of examples of bad algorithms.

5. Overfitting the Training Data

Say you are visiting a foreign country and the taxi driver rips you off. You might be tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is something that we humans do all too often, and unfortunately machines can fall into the same trap if we are not careful. In Machine Learning this is called overfitting: it means that the model performs well on the training data, but it does not generalize well.

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. Here are possible solutions:

? Simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the model.

? Gather more training data.

? Reduce the noise in the training data (e.g., fix data errors and remove outliers).

Constraining a model to make it simpler and reduce the risk of overfitting is called regularization. This regularization is controlled through some hyperparameter

You want to find the right balance between fitting the training data perfectly and keeping the model simple enough to ensure that it will generalize well.

6. Underfitting the Training Data

As you might guess, underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data.

Here are the main options for fixing this problem:

? Select a more powerful model, with more parameters.

? Feed better features to the learning algorithm (feature engineering).

? Reduce the constraints on the model (e.g., reduce the regularization hyperparameter).

This is the end of "Hands-on ML Series: Episode 2: Challenges of ML"

I hope you found it useful, and any feedback is appreciated

要查看或添加评论，请登录

Ahmad Mostafa的更多文章

?? ???? ??????? (Modeling) – ??? ????? ???? ????????! ??

2025年3月3日

?? ???? ??????? (Modeling) – ??? ????? ???? ????????! ??

?? ??? ?? ????? ???????? ?? ????? ??????? ????????? (Data Preparation)? ?????? ??? ??? ???? ??????? ????????…
??????? ?????????? ???????? (EDA) – ????? ?? ????? Data Science ????!

2025年2月25日

??????? ?????????? ???????? (EDA) – ????? ?? ????? Data Science ????!

?? ?? ???? ??????? ???????? ?????? ??? ????? ????????? ?? ? ?? ?????! ???? ????? ???? ???????? ???? ????? ??????? ?????…

1 条评论
????? ????????

2025年2月25日

????? ????????

?? ????? ???????? (Data Preparation/Preprocessing) – ???? ?????? ??? ???? ???????! ?? ?? ??? ?? ????? ???????? ?? ????…

7 条评论
Hands-on ML Series, Episode 4: End-to-End ML project, part 1

2021年7月6日

Hands-on ML Series, Episode 4: End-to-End ML project, part 1

Welcome to my "Hands-on ML Series". In last episodes, We discussed: What is ML? Types of ML Challenges of ML Testting…
Hands-on ML Series, Episode 3: Testting and Validation of ML System

2021年6月27日

Hands-on ML Series, Episode 3: Testting and Validation of ML System

Welcome to my "Hands-on ML Series". In last episodes, We discussed: What is ML? Types of ML Challenges of ML Stepping…
Hands-on ML Series: Episode 1: Types of ML

2021年6月23日

Hands-on ML Series: Episode 1: Types of ML

Welcome to my "Hands-on ML Series". In Episode o, We discussed the following questions: What is Machine Learning? Why…
Hands-on ML Series: Episode 0: What is ML?

2021年6月22日

Hands-on ML Series: Episode 0: What is ML?

Welcome to my "Hands-on ML Series". This is Ahmad Mostafa, and I try to make useful practical concise series of…

3 条评论

See all articles

Hands-on ML Series: Episode 2: Challengs of ML

Ahmad Mostafa

Data Scientist | Freelance Instructor

Main Challenges of Machine Learning

1. Insuffecient Quantity of Training Data

2. Non-representative Training Data

3. Poor-Quality Data

4. Irrelevant Features

5. Overfitting the Training Data

6. Underfitting the Training Data

Ahmad Mostafa的更多文章

社区洞察

其他会员也浏览了

What are the Best Practices in Machine Learning Implementation?

What is machine learning and how does it work?

Lessons in Machine Learning

A Primer to Interpretable Machine Learning

How to apply Machine Learning in case of limited data set?

Conquer Machine Learning: A Structured Roadmap with Resources and Kaggle Winning Solutions

Top Machine Learning Algorithms You Should Know to Become a Data Scientist

Collecting News Datasets And Training AI Models

Machine Learning Myths

Machine Learning: Classification Models

Main Challenges of Machine Learning

1. Insuffecient Quantity of Training Data

2. Non-representative Training Data

3. Poor-Quality Data

4. Irrelevant Features

5. Overfitting the Training Data

6. Underfitting the Training Data

Ahmad Mostafa的更多文章

?? ???? ??????? (Modeling) – ??? ????? ???? ????????! ??

??????? ?????????? ???????? (EDA) – ????? ?? ????? Data Science ????!

????? ????????

Hands-on ML Series, Episode 4: End-to-End ML project, part 1

Hands-on ML Series, Episode 3: Testting and Validation of ML System

Hands-on ML Series: Episode 1: Types of ML

Hands-on ML Series: Episode 0: What is ML?

社区洞察

其他会员也浏览了

What are the Best Practices in Machine Learning Implementation?

What is machine learning and how does it work?

Lessons in Machine Learning

A Primer to Interpretable Machine Learning

How to apply Machine Learning in case of limited data set?

Conquer Machine Learning: A Structured Roadmap with Resources and Kaggle Winning Solutions

Top Machine Learning Algorithms You Should Know to Become a Data Scientist

Collecting News Datasets And Training AI Models

Machine Learning Myths

Machine Learning: Classification Models