Types of cross-validation techniques

Cross-validation (CV) is dividing training data into a few parts. We train the model on some of these parts and test on the remaining parts. CV is one of the important step in the process of building a machine learning model which helps us ensure that our models fit the data accurately and is highly generalizable.

There are a few types of cross-validation techniques that are popular and widely used.

These include:

?k-fold cross-validation : Divide the data into k different sets which are exclusive of each other.

No alt text provided for this image

?Stratified k-fold cross-validation : This is typically recommended when we have skewed dataset for binary classification. Consider the scenario where we have 95% positive samples and only 5% negative samples. Using simple k-fold cross-validation for a dataset like this can result in folds with all positive samples. Stratified k-fold cross-validation keeps the ratio of labels in each fold constant. So,in each fold, you will have the same 95% positive and 5% negative samples

No alt text provided for this image

?hold-out based validation : This is recommended when we have a large amount of data. We randomly divide our data into three partitions. The below figures helps understand the role of each partition.

No alt text provided for this image

?leave-one-out cross-validation : The leave-one-out cross-validation procedure is appropriate when you have a small dataset or when an accurate estimate of model performance is more important than the computational cost. We can opt for a type of k-fold cross-validation where k=N, where N is the number of samples in the dataset. This means that in all folds of training, we will be training on all data samples except 1. The number of folds for this type of cross-validation is the same as the number of samples that we have in the dataset.

No alt text provided for this image

?group k-fold cross-validation : GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. For example, We would like to build a model to detect lung cancer from lung images of patients. Our task is to build a binary classifier which takes an input image and predicts the probability of being benign or malignant. In these kinds of datasets, you might have multiple images for the same patient in the training dataset. So, to build a good cross-validation system here, you must have stratified k-folds, but you must also make sure that patients in training data do not appear in validation data.

No alt text provided for this image

One thing to note is, All the cross-validation techniques mentioned above can be used for regression problems except stratified k-fold. Mostly, simple k-fold cross-validation works for any regression problem. However, if you see that the distribution of targets is not consistent, you can use stratified k-fold. To use stratified k-fold for a regression problem, we have first to divide the target into bins, and then we can use stratified k-fold in the same way as for classification problems. There are several choices for selecting the appropriate number of bins. If you have a lot of samples( > 10k, > 100k), then you don’t need to care about the number of bins. Just divide the data into 10 or 20 bins. If you do not have a lot of samples, you can use a simple rule like Sturge’s Rule to calculate the appropriate number of bins.

Sturge’s rule:

Number of Bins = 1 + log 2 (N) Where N is the number of samples you have in your dataset.

Cross-validation is the first and most essential step for any machine learning models. If you want to do feature engineering, split your data first. If you're going to build models, split your data first. A good cross-validation scheme in which validation data is representative of training and real-world data, we will be able to build a good machine learning model which is highly generalizable.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了