Understanding Cross-Validation: Different Approaches
Rizwana Malik
??AI intern | Generative ai ??Machine Learning Engineer |NLP | ?? Transforming Data into Intelligent Solutions
Cross-validation is a statistical technique used in machine learning and data analysis to evaluate how well a model is able to generalize to new data.
In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first-seen data) against which the model is tested (called the validation dataset or testing set).
The basic idea is to divide the available data into two parts:
Types of CV:
2. K-fold cross-validation:
In k-fold cross-validation, the available data is divided into k equal parts or "folds". The model is then trained on k-1 of the folds and validated on the remaining fold. This process is repeated k times, with each fold being used once as the validation set. The results from each fold are then averaged to obtain an overall estimate of the model's performance.
For example, if we set the value k=5, the dataset will be divided into five equal parts.?Following the general cross-validation procedure, the process will run five times, each time with a different holdout set.
the advantages and disadvantages of k-fold cross-validation.
example using code:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
# Load the Iris dataset
iris = load_iris()
# Define the Gaussian Naive Bayes model
nb = GaussianNB()
# Perform k-fold cross-validation with k=5
scores = cross_val_score(nb, iris.data, iris.target, cv=5, scoring='accuracy')
# Print the scores for each fold and the mean score
print("Scores for each fold:", scores)
print("Mean score:", scores.mean())
print("Standard deviation:", scores.std())
K-fold Cross Validation vs. train_test split
K-fold cross-validation and train-test split are two popular techniques used in machine learning to evaluate the performance of a model. Here are some key differences between the two:
In k-fold cross-validation, the data is split into k equal parts or "folds". The model is trained on k-1 of the folds and validated on the remaining fold. This process is repeated k times, with each fold being used once as the validation set.
In contrast, train-test split divides the data into two parts: a training set and a testing set, typically with a ratio of 70-30 or 80-20. The model is trained on the training set and evaluated on the testing set.
K-fold cross-validation is often used when the dataset is relatively small, as it allows for better use of the available data.
In contrast, train-test split is typically used when the dataset is larger, as it is faster to implement and may be sufficient for evaluating the model's performance.
K-fold cross-validation provides a more accurate estimate of the model's performance, as it evaluates its performance on multiple independent subsets of the data. This helps to reduce the variance of the performance estimate and detect overfitting.
In contrast, train-test split provides a less accurate estimate of the model's performance, as it depends on the specific subset of the data used for testing.
K-fold cross-validation can be computationally expensive, as it requires training and validating the model k times.
In contrast, train-test split is faster to implement and requires training and validating the model only once.
Overall, k-fold cross-validation is a more robust and accurate technique for evaluating the performance of a machine learning model, especially when the dataset is relatively small.
2. Train-test split:
Train-test split is a faster and simpler technique that can be used when the dataset is larger and a quick estimate of the model's performance is needed.
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# Load the tips dataset
tips = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
# Define the features and target variable
X = tips[['total_bill', 'tip', 'size']]
y = tips['sex']
# Define the Gaussian Naive Bayes model
model = GaussianNB()
# Perform k-fold cross-validation with k=5
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
# Print the scores for each fold and the mean score
print("Scores for each fold:", scores)
print("Mean score:", scores.mean())
print("Standard deviation:", scores.std())
Plot of k-fold CV:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Define the decision tree classifier
clf = DecisionTreeClassifier()
# Perform k-fold cross-validation with k=8
scores = cross_val_score(clf, X, y, cv=8)
# Plot the results
plt.plot(range(1,9), scores, color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=12)
plt.title('K-Fold Cross-Validation Results')
plt.xlabel('Fold Number')
plt.ylim([0.5, 1.0])
3. Leave One Out Cross Validation (LOOCV)
This variation on cross-validation leaves one data point out of the training data. For instance, if there are n data points in the original data sample, then the pieces used to train the model are n-1, and p points will be used as the validation set.?
This cycle is repeated in all of the combinations where the original sample can be separated in such a way. After this, the mean of the error is taken for all trials to give overall effectiveness.?
We consider that the number of possible combinations is equal to the number of data points in the original sample represented by.
5. Time Series Cross-Validation:
Time Series Cross-Validation extends traditional cross-validation techniques to handle the temporal structure inherent in time series data. Unlike traditional cross-validation, where random data splits are used, TSCV preserves the temporal order of observations. It ensures that the model is evaluated on past data and tested on future data, mimicking real-world scenarios.
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
import numpy as np
# Load time series data
data = pd.read_csv('your_time_series_data.csv', parse_dates=['date_column'], index_col='date_column')
# Define number of splits
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)
Model Building and Evaluation
# Initialize lists to store evaluation metrics
mse_scores = []
# Iterate over train-test splits and train models
for train_index, test_index in tscv.split(data):
????train_data, test_data = data.iloc[train_index], data.iloc[test_index]
????# Fit ARIMA model
????model = ARIMA(train_data, order=(5, 1, 0))?
# Example order for ARIMA
????fitted_model = model.fit()
????# Make predictions
????predictions = fitted_model.forecast(steps=len(test_data))
????# Calculate Mean Squared Error
????mse = mean_squared_error(test_data, predictions)
????print(f'Mean Squared Error for current split: {mse}')
# Calculate average Mean Squared Error across all splits
average_mse = np.mean(mse_scores)
print(f'Average Mean Squared Error across all splits: {average_mse}')
Learning about cross-validation methods in machine learning through practical coding is an excellent way to deepen our understanding! It's one thing to grasp the theory, but applying it hands-on really solidifies the concepts. Kudos to Muhammad Irfan, Muhammad Haris Tariq, and Dr. Sheraz Naseer for breaking down this complex topic in a comprehensible manner. Excited to delve into the article!