登录查看更多内容

Types of cross-validation techniques

Krishna Akurathi

Director

发布日期: 2020年11月21日

Cross-validation (CV) is dividing training data into a few parts. We train the model on some of these parts and test on the remaining parts. CV is one of the important step in the process of building a machine learning model which helps us ensure that our models fit the data accurately and is highly generalizable.

There are a few types of cross-validation techniques that are popular and widely used.

These include:

?k-fold cross-validation : Divide the data into k different sets which are exclusive of each other.

?Stratified k-fold cross-validation : This is typically recommended when we have skewed dataset for binary classification. Consider the scenario where we have 95% positive samples and only 5% negative samples. Using simple k-fold cross-validation for a dataset like this can result in folds with all positive samples. Stratified k-fold cross-validation keeps the ratio of labels in each fold constant. So,in each fold, you will have the same 95% positive and 5% negative samples

?hold-out based validation : This is recommended when we have a large amount of data. We randomly divide our data into three partitions. The below figures helps understand the role of each partition.

?leave-one-out cross-validation : The leave-one-out cross-validation procedure is appropriate when you have a small dataset or when an accurate estimate of model performance is more important than the computational cost. We can opt for a type of k-fold cross-validation where k=N, where N is the number of samples in the dataset. This means that in all folds of training, we will be training on all data samples except 1. The number of folds for this type of cross-validation is the same as the number of samples that we have in the dataset.

?group k-fold cross-validation : GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. For example, We would like to build a model to detect lung cancer from lung images of patients. Our task is to build a binary classifier which takes an input image and predicts the probability of being benign or malignant. In these kinds of datasets, you might have multiple images for the same patient in the training dataset. So, to build a good cross-validation system here, you must have stratified k-folds, but you must also make sure that patients in training data do not appear in validation data.

One thing to note is, All the cross-validation techniques mentioned above can be used for regression problems except stratified k-fold. Mostly, simple k-fold cross-validation works for any regression problem. However, if you see that the distribution of targets is not consistent, you can use stratified k-fold. To use stratified k-fold for a regression problem, we have first to divide the target into bins, and then we can use stratified k-fold in the same way as for classification problems. There are several choices for selecting the appropriate number of bins. If you have a lot of samples( > 10k, > 100k), then you don’t need to care about the number of bins. Just divide the data into 10 or 20 bins. If you do not have a lot of samples, you can use a simple rule like Sturge’s Rule to calculate the appropriate number of bins.

Sturge’s rule:

Number of Bins = 1 + log 2 (N) Where N is the number of samples you have in your dataset.

Cross-validation is the first and most essential step for any machine learning models. If you want to do feature engineering, split your data first. If you're going to build models, split your data first. A good cross-validation scheme in which validation data is representative of training and real-world data, we will be able to build a good machine learning model which is highly generalizable.

Types of cross-validation techniques

Krishna Akurathi

Director

更多精彩文章

社区洞察

其他会员也浏览了

FiftyOne Computer Vision Community Update – September 2023

ML Model Evaluation Technique

ML Model Evaluation Technique

Data Preparation for Computer Vision Success: Practical Tips & Techniques

Model Training - K Fold Cross Validation

How do we evaluate the Multimodal Models for key enterprise tasks?

Unveiling the Potential of Support Vector Machines in Feature Engineering

DIMENSIONALITY REDUCTION

Support Vector Machine (SVM) Classification

Unveiling the Art of Feature Selection in Machine Learning

How to handle Multicollinearity?

2020年11月23日

Normalization (Vs) Standardization

2020年11月6日

What do we mean by the variance and bias of a statistical learning method?

2020年11月1日

社区洞察

其他会员也浏览了

FiftyOne Computer Vision Community Update – September 2023

ML Model Evaluation Technique

ML Model Evaluation Technique

Data Preparation for Computer Vision Success: Practical Tips & Techniques

Model Training - K Fold Cross Validation

How do we evaluate the Multimodal Models for key enterprise tasks?

Unveiling the Potential of Support Vector Machines in Feature Engineering

DIMENSIONALITY REDUCTION

Support Vector Machine (SVM) Classification

Unveiling the Art of Feature Selection in Machine Learning