Model Training - K Fold Cross Validation
TLDR
Learn how to split your data for training and testing your machine learning models with K Fold Cross Validation.
Glossary
Definition
K-fold cross-validation is a data partitioning technique which splits an entire dataset into k groups. Then, we train and test k different models using different combinations of the groups of data we just partitioned, and use the results from these k models to check the model’s overall performance and generality.
In the context of machine learning, a?fold?is a set of rows in a dataset. We will use k-folds to describe a number of groups we decide to partition the data, so in an example of 20 rows, we can split them into 2 folds with 10 rows each, 4 folds with 5 rows each, or 10 folds with 2 rows each.
A simple explanation of how?k-fold cross validation?scores a model’s performance is:
Conceptual example
To improve your understanding?twice-fold???, consider this analogy about k-fold cross validation with?Twice, a K-pop girl group. Say we are trying to see how well a?model?can dance by inviting different subsets of Twice girls (called?folds) as training and test samples.
Source:?Twice Official Twitter
If the entire dataset has 9 girls, which are our data points, then we need to manually choose how many folds to split our data into. I’m going with 3 for our example, but there are strategies to?pick the best k.
Since we need an equal amount of data in each fold, we randomly pick 3 girls from Twice for each of the three folds, with no overlaps:
With these 3 folds, we will train and evaluate 3 models (because we picked k=3) by training it on 2 folds (k-1 folds) and use the remaining 1 as a test. We pick different combinations of folds for the 3 models we’re evaluating.
Model 1: Trained on Fold 1 + Fold 2, Tested on Fold 3
Model 2: Trained on Fold 2 + Fold 3, Tested on Fold 1
Model 3: Trained on Fold 1 + Fold 3, Tested on Fold 2
The performance scores would get skewed if the same Twice girls who taught you how to dance were also your judges. So whichever six girls (data points) the model from, the remaining three girls would judge and score you.?
Now that you have 3 models and their scores, we can choose a model evaluation method (discussed in another lesson) to determine– generally– whether this model dances well. This is also to ensure that, in one metric, the opinions of all 9 judges/test samples are included.
The resulting evaluation metric would tell us whether we did a good job at dancing. So did we do a good job?
Nayeon is rooting for you!
How to code
Let’s try to evaluate how well a model learns to predict whether customers of a tourism company flake on their plans or not using Tejashvi’s?dataset. Maybe this model could tell us whether we’d follow through with our dreams of vacationing overseas this year, too?
1 import pandas as pd
2 df = pd.read_csv("Customertravel.csv")
3
4 df
Since scikit-learn takes numpy arrays, we’d first have to use Pandas to convert our data frame into a numpy array.
Then, we can use the “KFold” class to configure our evaluation. Our next step is to choose the amount of folds to split our rows of data into. Above, we can see that our dataset has 954 rows, which divides nicely into 9 folds with 106 rows of data each.?
This means we’d build and evaluate 9 models total, using 8 folds as training and 1 for scoring each.?
1 from sklearn.model_selection import KFold
2
3 # 2nd + 3rd param: shuffle data before splitting into folds
4 kfold = KFold(n_splits=9, shuffle=True, random_state=1)
5
6 model = 1
7 # displaying indices for the rows that will be for training/testing
8 for train, test in kfold.split(np_array):
9 print('Model #%d:' % model)
10 print('train: %s, test: %s' % (train, test))
11 model = model+1
Now that we’re done splitting our data into 9 folds, we’re ready to continue onto the next lesson of evaluating the model!
Magical no-code solution ???
To skip all those configuration steps for K-fold cross validation, Mage provides an easy, no-code experience of training and testing a dataset. Although we, as users, aren’t able to customize how much of our data is split, Mage uses an algorithm to decide. For this dataset, Mage decided on approximately a 9:1 training to testing split.
You can find further details about the training/test split under “Review > Statistics” on our Mage web application.
Want to learn more about machine learning (ML)? Visit?Mage Academy!????
??♀? CEO @ Mage ??
3 年Such an important technique in ML.