Last updated on 2024年9月24日

How do you split your data into training, validation, and test sets for predictive modeling?

由人工智能和领英社区提供技术支持

If you want to build a predictive model that can generalize well to new data, you need to split your data into three sets: training, validation, and test. Each set has a different role and purpose in the modeling process. In this article, you will learn how to split your data into these sets and why it is important for predictive analytics.

本文章的要点总结

Stratified splitting:

Ensuring your data sets reflect the original distribution prevents skewing the model's learning process. Just like balancing flavors in a dish, it helps you maintain the integrity of your predictive analytics.
Use cross-validation:

K-fold cross-validation is a real game-changer for those with limited data. It's like trying different combinations of a recipe to see which one truly hits the spot before you serve it to your guests.

本摘要由 AI 和以下专家提供支持

Pranay Chandur

Data Science & Analytics Leader |…

1 Training Set

The training set is the largest subset of your data that you use to fit your model. It contains the input features and the target variable that you want to predict. The size of the training set depends on your data size, complexity, and availability, but a common rule of thumb is to use 60-80% of your data for training. The training set should be representative of the population and the problem you are trying to solve.

添加您的观点

Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
When splitting data for predictive modeling, allocate 60-80% for training, 10-20% for validation, and 10-20% for testing. The training set should reflect the data distribution for effective learning. Use the validation set for hyperparameter tuning and to detect overfitting, while keeping the test set untouched to assess generalization and avoid data leakage. Stratify splits for imbalanced data. Cross-validation is beneficial for limited or complex datasets to improve robustness and reduce variance. For high-dimensional or time-series data, use techniques like dimensionality reduction or temporal validation, and consider edge cases like multi-class or multi-label tasks.

已翻译

赞
Pankaj Rajan

?? ????-?????????????? ???? ???????????????? | ?? Transforming Knowledge Work with AI - Effortlessly ??
(已编辑)
举报内容
Use @markovml and it would take care of the rest that includes handling imbalanced data, identifying high quality discriminative points, doing k fold validation , model evaluation , and more. Go from data to fully deployed model with monitoring and restful endpoints set up in a few hours.

已翻译

赞
Jeffrey Day

Director Marketing Analytics
举报内容
Think of the Training Set as a 6+ lane highway, you are not going to capture precise factors, but the model needs a general direction. More is good, but ensuring the input data is clean and in the same unit of measure for each attribute is more important.

已翻译

赞
Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
When we developed our first predictive model, we carefully divided our data to ensure robust results. We allocated 70% of our data to the training set. This set was large enough to capture the complexity of our problem and diverse enough to represent the population. We used this data to train our model, allowing it to learn the patterns and relationships between the input features and the target variable. This approach ensured our model was well-trained and could generalize effectively. The training set's size and representativeness were crucial in developing a reliable predictive model, demonstrating the importance of careful data splitting. #predictiveanaytics #predictivemodel

已翻译

赞

2 Validation Set

The validation set is a smaller subset of your data that you use to evaluate and tune your model. It contains the same input features and target variable as the training set, but the data points are different and unseen by the model. The size of the validation set also depends on your data size, complexity, and availability, but a common rule of thumb is to use 10-20% of your data for validation. The validation set should be similar to the training set in terms of distribution and characteristics.

添加您的观点

Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
When creating a validation set, allocate 10-20% of your data, ensuring it reflects the training set distribution for accurate performance evaluation. Use the validation set for hyperparameter tuning, detecting overfitting, and selecting the best model. Avoid overfitting by not reusing the same validation set too often, as this can lead to overly optimistic results. Techniques like early stopping and validation curves can help mitigate this risk. For imbalanced datasets, stratify the data. In smaller datasets, cross-validation maximizes data usage and robustness. Keep the validation set separate from training until tuning is complete to avoid biased estimates.

已翻译

赞
Jeffrey Day

Director Marketing Analytics
举报内容
There are a few approaches to creating a Validation Set. One way is to have an app randomly select records from your Training Set. Though this will produce a more fine tuning of your model, it will be the same analysis that the training set went through and will only tighten your statistical variables. An approach I like to take is to factor in seasonality by ensuring you have at least 1 year of data (3 years of "normal" recent operations is best). Using 25% of your data will yield more precise coefficients.

已翻译

赞
Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
In one PoC, we faced challenges tuning our predictive model. We allocated 15% of our data to the validation set. This subset, unseen by the model during training, helped us evaluate its performance. By ensuring the validation set was similar in distribution and characteristics to the training set, we could accurately assess how well our model was likely to perform on new data. Iterating on this process, we fine-tuned our model’s parameters, using the validation set to check for improvements. This step was critical in enhancing the model's accuracy and preventing overfitting. Properly utilizing the validation set was a key factor in our model's success. #predictiveanalytics #predictivemodel

已翻译

赞

3 Test Set

The test set is the smallest subset of your data that you use to assess the final performance and accuracy of your model. It contains the same input features and target variable as the training and validation sets, but the data points are different and unseen by the model and the validation process. The size of the test set also depends on your data size, complexity, and availability, but a common rule of thumb is to use 10-20% of your data for testing. The test set should be similar to the training and validation sets in terms of distribution and characteristics, but it should also reflect the real-world conditions and scenarios that your model will face.

添加您的观点

Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
When creating a test set for predictive modeling, allocate 10-20% of your data, ensuring it mirrors the distribution of the training and validation sets for reliable evaluation. The test set must remain completely unseen during training and validation to provide an accurate measure of generalization. It should also reflect real-world conditions the model will face in production, especially in time-series or evolving datasets. Avoid any model adjustments based on test set performance to prevent overfitting. For imbalanced datasets, stratify the test set to preserve class distributions. Finally, use the test set only once to assess final model performance, as repeated use risks misleading results.

已翻译

赞
Pranay Chandur

Data Science & Analytics Leader | Applied Gen AI & LLMs | Walmart eCommerce
举报内容
Let’s say you are trying to predict dimensions of an item in retail using item listing content then feeding any new items to the site as part of the test group will examine the model effectiveness

已翻译

赞
Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
In a PoC project, we allocated 15% of our data to the test set. This subset, unseen by the model during training and validation, was essential for assessing its final performance. We ensured the test set mirrored the distribution and characteristics of the training and validation sets, while also reflecting real-world conditions. After tuning the model with the validation set, we used the test set to evaluate its accuracy and generalizability. This step was pivotal in verifying our model's robustness and reliability before deployment. Properly leveraging the test set confirmed our model's readiness for real-world application. #predictiveanalytics #predictivemodel

已翻译

赞

4 Data Splitting Methods

Various methods and techniques can be utilized to divide data into training, validation, and test sets. Random splitting randomly assigns data points to each set according to a predefined ratio or number, though it may not maintain the original structure of the data. Stratified splitting ensures each set has the same proportion of data points for each class or category of the target variable, which is ideal for imbalanced or skewed data. K-fold cross-validation splits the data into k equal folds or groups, using one fold as the validation set and the rest as the training set. It then averages the validation scores to estimate model performance, making it useful for small or limited data. Lastly, Time-based splitting divides the data based on temporal or chronological order, such as date or time, which is beneficial for time-series or sequential data to preserve temporal dependencies and patterns.

添加您的观点

Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
When splitting data for predictive modeling, choose a method based on your dataset's characteristics. Random splitting works for balanced datasets but may disrupt underlying patterns. For imbalanced datasets, stratified splitting preserves class proportions, improving generalization. K-fold cross-validation is ideal for small datasets, reducing overfitting risk by averaging performance over multiple splits, though it can be computationally expensive. For time-series data, time-based splitting is crucial to preserve temporal dependencies and avoid data leakage. Tailor your approach to the data and task to ensure a reliable, unbiased evaluation and avoid overfitting or improper generalization.

已翻译

赞
Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
During a project with imbalanced data, we faced challenges in achieving accurate predictions. We initially tried random splitting but quickly realized it didn't maintain the original data structure, leading to poor model performance. Switching to stratified splitting, we ensured each set had the same proportion of classes, significantly improving our results. In another project with limited data, we used k-fold cross-validation. This technique allowed us to maximize data usage and obtain reliable performance estimates by averaging validation scores across folds. These experiences highlighted the importance of choosing the right data-splitting method to enhance model accuracy and reliability. #predictiveanalytics #predictivemodel

已翻译

赞

加载更多内容

5 Data Splitting Best Practices

To split your data into training, validation, and test sets for predictive modeling, you should follow some best practices. Start by splitting your data before any preprocessing or feature engineering, to avoid data leakage or contamination between the sets. Additionally, it is important to use a consistent and reproducible random seed or state for random splitting, so that you can replicate and compare your results. Furthermore, use appropriate methods and techniques for your data type, size, and problem domain to ensure that your sets are representative, balanced, and realistic. Finally, keep your test set separate and untouched until the final evaluation to measure the true generalization and accuracy of your model.

添加您的观点

Pranay Chandur

Data Science & Analytics Leader | Applied Gen AI & LLMs | Walmart eCommerce
举报内容
It’s an interesting theory to attempt splitting data before feature engineering. Will try this best practice in my next project.

已翻译

赞
Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
When splitting data for predictive modeling, always split before any preprocessing, such as scaling or imputation, to avoid data leakage. Use a consistent random seed for reproducibility across experiments. Select the most appropriate splitting method—random, stratified, or time-based—based on your data characteristics to ensure balanced and representative sets. Keep the test set untouched until the final evaluation to get an unbiased measure of model performance on unseen data. For imbalanced or sparse datasets, use stratified or specialized methods to prevent skewed results and ensure reliable model evaluation.

已翻译

赞
Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
In PoC projects, we learned the importance of following best practices for data splitting. We began by splitting our data into training, validation, and test sets before any preprocessing or feature engineering to prevent data leakage. Using a consistent random seed ensured our splits were reproducible, allowing us to compare results accurately. We chose methods suited to our data type and problem domain to maintain representativeness and balance. Importantly, we kept the test set untouched until the final evaluation. This approach ensured our model's true generalization and accuracy, highlighting the value of disciplined data splitting practices. #predictiveanalytics #predicitvemodel

已翻译

赞

6 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

添加您的观点

Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
When splitting data for predictive modeling in K-12 education, address specific challenges like imbalanced student performance data and missing values. Stratified splitting ensures representative class distributions, and proper imputation methods, such as mean imputation or model-based approaches, help manage missing data effectively. In longitudinal studies, time-based splitting is essential to preserve student progress patterns across years. Ensure compliance with FERPA to avoid legal and ethical issues. Additionally, focus on fairness, ensuring the model doesn't introduce bias against underserved student groups. By tackling these factors, you can build accurate, compliant, and fair predictive models tailored to K-12 education needs.

已翻译

赞
Abdulla Pathan

Next CIO Winner | AIML Icon | Driving competitive edge and operational excellence through AI/Cloud/Data analytics. I foster growth with agile, innovative solutions, align technology with business goals, and mentor teams
举报内容
When splitting data for predictive modeling, use a three-way split: training (60-70%), validation (15-20%), and test (15-20%). Ensure the splits are representative of the overall dataset to maintain data integrity. Employ stratified sampling if dealing with imbalanced classes. For time series, maintain temporal order to prevent data leakage. Regularly monitor model performance on the validation set and only use the test set for final evaluation to avoid overfitting. Consider cross-validation for robust validation results. Document all decisions for reproducibility and compliance with educational data privacy standards.

已翻译

赞

Predictive Analytics

+ 关注

给文章评分

我们借助人工智能创建了此文章。您认为这篇文章怎么样？

很棒不太好

举报此文章

查看全部

How do you split your data into training, validation, and test sets for predictive modeling?

1

2

3

4

5

6

1 Training Set

2 Validation Set

3 Test Set

4 Data Splitting Methods

5 Data Splitting Best Practices

6 Here’s what else to consider

Predictive Analytics

给文章评分

感谢您的反馈

更多Predictive Analytics相关文章

更多相关阅读内容