WHY SPLITTING DATA IS IMPORTANT?

WHY SPLITTING DATA IS IMPORTANT?

I want to talk about why splitting data is essential in Machine Learning.

Model evaluation and validation are critical in supervised learning.

Ok, what Supervised Learning is? Supervised learning is a popular type of ML; it’s widely applicable. Its name comes from the need for a supervisor – someone who can show the correct answers.

A Supervised algorithm needs to learn by example. Essentially, it needs a teacher who uses training data to help it determine the patterns and relationships between the inputs and outputs. Something like a dad that teaches his kids, “Here in this picture is a car. Here is a car in another picture.” The model is trained on this labeled data to accurately identify where a car is in a new picture it hasn’t seen before.

Let’s come back to evaluation and validation. To summarize: in Machine Learning, a model has to generalize well, and if we want to evaluate the predictive performance, the process must be unbiased.

In supervised learning, we create models that map the inputs (independent variables or predictors) to the given outputs (dependent variables or responses). How to measure the accuracy of our model depends on the problem we want to solve.

It is crucial to understand that we need an unbiased evaluation to assess the model's predictive performance and validate it so we can't evaluate the predictive performance with the same data used to train the model. We need something fresh, something never seen before, and we need to split our dataset before using it.

We use to split data into three subsets: Training set, Validation test, Test set.

No alt text provided for this image

TRAINING SET (80% of our dataset): we’ll use this set to train or fit the model. Here we can find the optimal weights for neural network, linear regression or logistic regression.

VALIDATION SET (10% of our dataset): we’ll use this set for unbiased the model evaluation when we perform hyperparameters tuning. For example we’ll experiment with different values to find the optimal number of neurons in a neural network.

TEST SET (10% of our dataset): we’ll perform on this set an unbiased evaluation of the final model (ready for production environment).

[80%, 10%, and 10% is a good methodology in data splitting but, we could use other "sizes" such as 70%, 15%, and 15% if our data set is huge (for example).]

Author Fabrizio Angelelli

要查看或添加评论,请登录

Virtualmente的更多文章

社区洞察

其他会员也浏览了