登录查看更多内容

Model Training - K Fold Cross Validation

Mage

Build, deploy, & run data pipelines through an intuitive interface in minutes. Run at any scale instantly with Mage Pro!

发布日期: 2022年2月28日

+ 关注

TLDR

Learn how to split your data for training and testing your machine learning models with K Fold Cross Validation.

Glossary

Definition
Conceptual example
How to code
Magical no-code solution ???

Definition

K-fold cross-validation is a data partitioning technique which splits an entire dataset into k groups. Then, we train and test k different models using different combinations of the groups of data we just partitioned, and use the results from these k models to check the model’s overall performance and generality.

In the context of machine learning, a?fold?is a set of rows in a dataset. We will use k-folds to describe a number of groups we decide to partition the data, so in an example of 20 rows, we can split them into 2 folds with 10 rows each, 4 folds with 5 rows each, or 10 folds with 2 rows each.

A simple explanation of how?k-fold cross validation?scores a model’s performance is:

The entire dataset is randomly split into equally-sized, independent?k-folds, without reusing any of the rows in another fold.
We use?k-1?folds?for model training, and once that model is complete, we test it using the remaining 1 fold to obtain a score of the model’s performance.
We repeat this process?k times, so we have?k number?of models and scores for each.
Lastly, we take the mean of the?k number?of scores to evaluate the model’s performance.

Conceptual example

To improve your understanding?twice-fold???, consider this analogy about k-fold cross validation with?Twice, a K-pop girl group. Say we are trying to see how well a?model?can dance by inviting different subsets of Twice girls (called?folds) as training and test samples.

Source:?Twice Official Twitter

If the entire dataset has 9 girls, which are our data points, then we need to manually choose how many folds to split our data into. I’m going with 3 for our example, but there are strategies to?pick the best k.

Since we need an equal amount of data in each fold, we randomly pick 3 girls from Twice for each of the three folds, with no overlaps:

Fold 1

Fold 2

Fold 3

With these 3 folds, we will train and evaluate 3 models (because we picked k=3) by training it on 2 folds (k-1 folds) and use the remaining 1 as a test. We pick different combinations of folds for the 3 models we’re evaluating.

Model 1: Trained on Fold 1 + Fold 2, Tested on Fold 3

Model 2: Trained on Fold 2 + Fold 3, Tested on Fold 1

Model 3: Trained on Fold 1 + Fold 3, Tested on Fold 2

The performance scores would get skewed if the same Twice girls who taught you how to dance were also your judges. So whichever six girls (data points) the model from, the remaining three girls would judge and score you.?

Now that you have 3 models and their scores, we can choose a model evaluation method (discussed in another lesson) to determine– generally– whether this model dances well. This is also to ensure that, in one metric, the opinions of all 9 judges/test samples are included.

The resulting evaluation metric would tell us whether we did a good job at dancing. So did we do a good job?

Nayeon is rooting for you!

How to code

Let’s try to evaluate how well a model learns to predict whether customers of a tourism company flake on their plans or not using Tejashvi’s?dataset. Maybe this model could tell us whether we’d follow through with our dreams of vacationing overseas this year, too?

1 import pandas as pd
2 df = pd.read_csv("Customertravel.csv")
3
4 df

Since scikit-learn takes numpy arrays, we’d first have to use Pandas to convert our data frame into a numpy array.

Then, we can use the “KFold” class to configure our evaluation. Our next step is to choose the amount of folds to split our rows of data into. Above, we can see that our dataset has 954 rows, which divides nicely into 9 folds with 106 rows of data each.?

This means we’d build and evaluate 9 models total, using 8 folds as training and 1 for scoring each.?


1 from sklearn.model_selection import KFold
2
3 # 2nd + 3rd param: shuffle data before splitting into folds
4 kfold = KFold(n_splits=9, shuffle=True, random_state=1)
5
6 model = 1
7 # displaying indices for the rows that will be for training/testing
8 for train, test in kfold.split(np_array):
9  print('Model #%d:' % model)
10  print('train: %s, test: %s' % (train, test))
11  model = model+1

Now that we’re done splitting our data into 9 folds, we’re ready to continue onto the next lesson of evaluating the model!

Magical no-code solution ???

To skip all those configuration steps for K-fold cross validation, Mage provides an easy, no-code experience of training and testing a dataset. Although we, as users, aren’t able to customize how much of our data is split, Mage uses an algorithm to decide. For this dataset, Mage decided on approximately a 9:1 training to testing split.

You can find further details about the training/test split under “Review > Statistics” on our Mage web application.

Want to learn more about machine learning (ML)? Visit?Mage Academy!????

Tommy Dang

??♀? CEO @ Mage ??

3 年

Such an important technique in ML.

要查看或添加评论，请登录

Mage的更多文章

See all articles

TLDR

Glossary

Definition

Conceptual example

How to code

Magical no-code solution ???

Mage的更多文章

Step-by-step guide to connecting Mage Pro SQL blocks with Databricks

Streamline data transfer: The ultimate data integration guide SFTP to BigQuery

Consolidate your data stack with Mage Pro: Cut costs and complexity

Guide to accuracy, precision, and recall

Model Improvement - Data leakage

Model evaluation - MAP@K

Music genre classification part 2

Loan prediction

Youtube's Machine learning (ML) algorithm

League of Legends rank(ing) guide