登录查看更多内容

WHY SPLITTING DATA IS IMPORTANT?

Virtualmente

IT's easier with us

发布日期: 2022年3月16日

+ 关注

I want to talk about why splitting data is essential in Machine Learning.

Model evaluation and validation are critical in supervised learning.

Ok, what Supervised Learning is? Supervised learning is a popular type of ML; it’s widely applicable. Its name comes from the need for a supervisor – someone who can show the correct answers.

A Supervised algorithm needs to learn by example. Essentially, it needs a teacher who uses training data to help it determine the patterns and relationships between the inputs and outputs. Something like a dad that teaches his kids, “Here in this picture is a car. Here is a car in another picture.” The model is trained on this labeled data to accurately identify where a car is in a new picture it hasn’t seen before.

Let’s come back to evaluation and validation. To summarize: in Machine Learning, a model has to generalize well, and if we want to evaluate the predictive performance, the process must be unbiased.

In supervised learning, we create models that map the inputs (independent variables or predictors) to the given outputs (dependent variables or responses). How to measure the accuracy of our model depends on the problem we want to solve.

It is crucial to understand that we need an unbiased evaluation to assess the model's predictive performance and validate it so we can't evaluate the predictive performance with the same data used to train the model. We need something fresh, something never seen before, and we need to split our dataset before using it.

领英推荐

Machine Learning for Product Managers

DornerWorks 2 年前

Exploring Hyperparameter Tuning in Machine Learning:…

Codalien Technologies 10 个月前

Training Personal Language Models

Vault Security 1 年前

We use to split data into three subsets: Training set, Validation test, Test set.

TRAINING SET (80% of our dataset): we’ll use this set to train or fit the model. Here we can find the optimal weights for neural network, linear regression or logistic regression.

VALIDATION SET (10% of our dataset): we’ll use this set for unbiased the model evaluation when we perform hyperparameters tuning. For example we’ll experiment with different values to find the optimal number of neurons in a neural network.

TEST SET (10% of our dataset): we’ll perform on this set an unbiased evaluation of the final model (ready for production environment).

[80%, 10%, and 10% is a good methodology in data splitting but, we could use other "sizes" such as 70%, 15%, and 15% if our data set is huge (for example).]

WHY SPLITTING DATA IS IMPORTANT?

Virtualmente

IT's easier with us

领英推荐

Virtualmente的更多文章

社区洞察

其他会员也浏览了

Data Scientist interview questions and answers based on my experience ( Understand and Read for never getting rejected)

IRIS CLASSIFICATION

What is Information Retrieval (IR) in Machine Learning?

Handwritten Text Recognition using Deep Learning (CNN & RNN)

Understanding the variance of Variational Autoencoders

Artificial Intelligence basics for Non-techies - 50 Simplified terms

Chapter 3: Model Development and Training Phase of a Machine Learning Project

Teaching a Machine to Learn Command Voices

Deep learning can transform manufacturing

What is Machine Learning?

领英推荐

Virtualmente的更多文章

Collect your network data in a more granular way using Cisco Flexible Netflow.

Amazon Kinesis

社区洞察

其他会员也浏览了

Data Scientist interview questions and answers based on my experience ( Understand and Read for never getting rejected)

IRIS CLASSIFICATION

What is Information Retrieval (IR) in Machine Learning?

Handwritten Text Recognition using Deep Learning (CNN & RNN)

Understanding the variance of Variational Autoencoders

Artificial Intelligence basics for Non-techies - 50 Simplified terms

Chapter 3: Model Development and Training Phase of a Machine Learning Project

Teaching a Machine to Learn Command Voices

Deep learning can transform manufacturing

What is Machine Learning?