ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

What are the best practices for splitting your dataset into training and testing sets?

ç”±äººå·¥æ™ºèƒ½å’Œé¢†è‹±ç¤¾åŒºæä¾›æŠ€æœ¯æ”¯æŒ

When venturing into the realm of data science, understanding how to properly split your dataset is crucial for model training and evaluation. This process is fundamental to avoid overfitting, where the model performs well on the training data but poorly on unseen data. By dividing your dataset into separate training and testing sets, you provide a more accurate assessment of your model's performance. The training set is used to teach the model to recognize patterns, while the testing set is reserved to validate the model's predictions. This separation ensures that the model's accuracy is tested on fresh, unlearned data, reflecting its potential performance in real-world scenarios.

æ¤æ–‡ç« ä¸çš„ä¸šç•Œè¾¾äºº

ç”±ç¤¾åŒºä»Ž 72 æ¡å†…å®¹ä¸ç²¾é€‰ã€‚äº†è§£æ›´å¤š

Bushra Amjad

I help Data Professionals sign clients through LinkedIn | From Data Professional myself â†’ to Lead Gen Expert for Dataâ€¦
Subodh Kumar

S&P Global
Kannan Singaravelu, CQF

Machine Learning & GenAI Consultant

1 Data Splitting

In data science, splitting your dataset effectively is a pivotal step towards building a robust model. Generally, you'll want to allocate a larger portion of your data for training, often around 70-80%, with the remaining 20-30% for testing. This allows your model to learn from a substantial amount of data while still retaining enough unique data points to test its predictions. However, the exact split can depend on the size and diversity of your dataset. For smaller datasets, you might need to use techniques like cross-validation to maximize the use of your data for training while still getting a reliable estimate of model performance.

æ·»åŠ æ‚¨çš„è§‚ç‚¹

Subodh Kumar

S&P Global
ä¸¾æŠ¥å†…å®¹
In our recent project, we divided our data into 80% for training and 20% for testing (129k samples) using the train_test_split method from Scikit-Learn. This initial split ensures that our model is trained on a significant portion of the data, allowing it to learn the underlying patterns effectively. During training, we further refined this approach by using 5% of the training data as a validation set. This validation set is crucial as it helps us monitor the model's performance during training, providing an early indication of overfitting. We incorporated two key callbacks: EarlyStopping and ModelCheckpoint. The EarlyStopping callback monitors the validation loss and halts training when the model stops improving, preventing overfitting

å·²ç¿»è¯‘

èµž
Kannan Singaravelu, CQF

Machine Learning & GenAI Consultant
ä¸¾æŠ¥å†…å®¹
There is no right percentage and split depends on the context. In general don't go below 70:30 while 80:20 is preferred. The split percentage should be higher for training set if you do a train/validation/test (as a part of it goes to validation).

å·²ç¿»è¯‘

èµž
Lokesh Metta

Actively Seeking Full-Time Opportunities in Automation | Software Development | Mechatronics | Robotics | Machine Learning | Computer Vision
ä¸¾æŠ¥å†…å®¹
When splitting a dataset into training and testing sets, a best practice is using k-fold cross-validation. This method divides the data into k parts, using each part once as the test set while training on the other k-1 parts. This way, every data point gets used for both training and testing, giving a more reliable measure of the model's performance and reducing the risk of overfitting.

å·²ç¿»è¯‘

èµž
Prem Vishnoi

Sr. Manager - Data & AI | Large-Scale Systems | Azure, AWS, Alibaba Cloud, Tencent Cloud | Ex-Lazada
ä¸¾æŠ¥å†…å®¹
Splitting your dataset into training and testing sets is crucial for developing reliable machine learning models. Firstly, understand your dataset's characteristics, such as size and distribution, to decide on the appropriate splitting strategy. Use stratified sampling to maintain the same class proportions in both sets. Common split ratios are 70/30, 80/20, or 90/10, depending on the dataset size. Implement k-fold cross-validation for better generalization. For time-series data, ensure temporal consistency by not shuffling. Avoid data leakage by isolating the training set from the testing set during preprocessing. Finally, set a random state for reproducibility.

å·²ç¿»è¯‘

èµž
Patrick Ndayizigamiye (PhD)

PhD I MCom I Executive MBA I CIPS (Canada) I IBM Champion (2022&2023)
ä¸¾æŠ¥å†…å®¹
Common splits are 70:30 or 80:20, meaning 70% or 80% of the data is used for training, and the rest for testing. For very large datasets, a smaller percentage (such as 90:10 or 95:5) can still provide enough data for testing. Ensure that the training and testing sets represent the overall dataset. This is particularly important for datasets with imbalanced classes. Stratified sampling keeps the proportion of classes consistent between the training and testing sets. Shuffle the data before splitting to ensure that the training and testing sets are random and representative of the overall dataset. This prevents any bias that might be introduced by the order of the data.

å·²ç¿»è¯‘

èµž

åŠ è½½æ›´å¤šå†…å®¹

2 Holdout Method

The Holdout Method is a straightforward approach where you randomly partition the dataset into two sets: one for training and one for testing. It's essential to randomize this split to avoid bias in the selection process. You can use functions like train_test_split from Python's scikit-learn library to perform this task efficiently. The key is to ensure that both sets are representative of the overall dataset, meaning they should have a similar distribution of classes in classification tasks or a range of values in regression tasks.

æ·»åŠ æ‚¨çš„è§‚ç‚¹

Varun Mishra

Assistant Manager at Fuld & Company || Data Analytics || Product Analytics || Data Consulting
ä¸¾æŠ¥å†…å®¹
The Holdout Method indeed offers simplicity in splitting data for training and testing. Its randomized partitioning helps in preventing bias, crucial for fair evaluation. Utilizing tools like train_test_split from scikit-learn enhances efficiency. However, it's vital to ensure both sets maintain representativeness, preserving class distributions in classification or value ranges in regression tasks. This method's straightforwardness is its strength, especially for quick model evaluation, yet its reliance on randomness can sometimes lead to variance in results. Thus, while efficient, careful consideration of dataset characteristics is still necessary for robust model assessment.

å·²ç¿»è¯‘

èµž
Ankush Hujare

Data Analytics | Data Science | ETL | PowerBI | SQL | Python | Excel | Tableau | Data Visualization
ä¸¾æŠ¥å†…å®¹
Think of the Holdout Method like splitting a bag of candies into two partsâ€”one for immediate enjoyment and one for later tasting. You shuffle the candies to make sure each part has a good mix. Similarly, in data science, you randomly divide your dataset into two partsâ€”one for training your model and one for testing its accuracy. It's crucial that both parts represent the entire dataset well, so your model learns and performs accurately.

å·²ç¿»è¯‘

èµž
Daniel Eduardo Diaz Almeida, PhD

AI Engineer | Artificial Intelligence Consultant | Python | LangChain | MongoDB | Neo4j | VertexAI | Data Scientists | Google Cloud Platform |
ä¸¾æŠ¥å†…å®¹
I agree, the Holdout Method is a straightforward approach where you randomly partition the dataset into two sets: one for training and one for testing. It's essential to randomize this split to avoid bias in the selection process. For example, if you have a dataset of customer transactions over time, a bad application of this method would be to use the first 70% of the transactions for training and the last 30% for testing. This could introduce temporal bias, as trends and behaviors might change over time, leading to a model that performs well on past data but poorly on future data. Randomizing the split ensures both sets are representative of the overall dataset, promoting better generalization.

å·²ç¿»è¯‘

èµž
Arun Ram Ilayarasan

Final Year student @ SRM TRP Engineering College | IETE Fellow | IoT & AI Specialist | Machine Learning & Data Analytics | Embedded Systems | Flutter Developer
ä¸¾æŠ¥å†…å®¹
This straightforward approach involves randomly splitting the dataset into a training set and a test set. It's essential to ensure that the split is random to avoid any bias in the data partitioning process.

å·²ç¿»è¯‘

èµž
Muhammad Munawar Baig

Senior Web Developer | Python | React Native,React Js | WordPress
ä¸¾æŠ¥å†…å®¹
The holdout method divides the dataset into separate training and testing subsets, often with proportions such as 70% for training and 30% for testing. This approach assesses the modelâ€™s ability to generalize to unseen data. However, a major limitation is that the model's performance might fluctuate greatly, depending on the specific data division, potentially leading to misleading results if the split isn't representative.

å·²ç¿»è¯‘

èµž

åŠ è½½æ›´å¤šå†…å®¹

3 Stratified Split

For datasets with imbalanced classes, where some outcomes are much rarer than others, a stratified split ensures that both training and testing sets have proportionate representations of each class. This prevents scenarios where the model is not adequately tested on rarer outcomes. In Python, you can use the StratifiedShuffleSplit function from scikit-learn to perform a stratified split, which maintains the percentage of samples for each class as in the full dataset.

æ·»åŠ æ‚¨çš„è§‚ç‚¹

Muhammad Munawar Baig

Senior Web Developer | Python | React Native,React Js | WordPress
ä¸¾æŠ¥å†…å®¹
Stratified splitting maintains class balance in training and testing sets, crucial for accurate model performance assessment, especially with imbalanced data. By preserving class distribution, it ensures robust training outcomes and enhances model generalizability across all classes. This method is particularly valuable for achieving reliable predictions in classification tasks where class representation is uneven.

å·²ç¿»è¯‘

èµž
Varun Mishra

Assistant Manager at Fuld & Company || Data Analytics || Product Analytics || Data Consulting
ä¸¾æŠ¥å†…å®¹
Using a stratified split is crucial when dealing with imbalanced datasets as it maintains the class distribution in both training and testing sets, ensuring fair evaluation and preventing the model from being biased towards the majority class. By employing StratifiedShuffleSplit from scikit-learn in Python, you prioritize representative sampling, enabling the model to learn from all classes effectively. This approach fosters robustness, particularly in scenarios where rare outcomes hold significant importance, such as in medical diagnoses or fraud detection. In essence, it's a proactive measure to tackle the challenges posed by class imbalance, enhancing the model's reliability and real-world applicability.

å·²ç¿»è¯‘

èµž
Bushra Amjad

I help Data Professionals sign clients through LinkedIn | From Data Professional myself â†’ to Lead Gen Expert for Data professionals | DM me "CLIENTS" to get started
ä¸¾æŠ¥å†…å®¹
For imbalanced datasets, use a stratified split to maintain the same class proportions in both training and testing sets. Tools like StratifiedShuffleSplit in scikit-learn can facilitate this process.

å·²ç¿»è¯‘

èµž
Vimukthi Sripa

Connecting Education and Industry through Data Science, AI, EdTech Development, and Career Coaching
ä¸¾æŠ¥å†…å®¹
This method combines random splitting with stratification. It randomly shuffles the data while maintaining the class proportions within each fold. This helps mitigate biases that might arise from a simple random split. For example, let's say you're analyzing customer churn data. A stratified shuffle split ensures the training set has a representative mix of customers who churned and those who remained loyal, leading to a more generalizable model for predicting future churn.

å·²ç¿»è¯‘

èµž
Ankush Hujare

Data Analytics | Data Science | ETL | PowerBI | SQL | Python | Excel | Tableau | Data Visualization
ä¸¾æŠ¥å†…å®¹
Imagine you're dividing a collection of toys between two kidsâ€”one loves cars, the other dolls. With a stratified split, you ensure each child gets a fair share of their favorite toys. Similarly, in data science, when dealing with imbalanced data where certain outcomes are rare, a stratified split ensures that both the training and testing sets have a balanced representation of each outcome. This ensures your model learns and is tested on all outcomes equally.

å·²ç¿»è¯‘

èµž

åŠ è½½æ›´å¤šå†…å®¹

4 Time Series Split

When dealing with time series data, the temporal order must be preserved during the split. Instead of random sampling, you should split the data based on time, ensuring that the training set contains only past data and the testing set contains future data relative to the training set. This mimics real-world scenarios where you predict future events based on past observations. Libraries such as scikit-learn provide specific functions like TimeSeriesSplit to handle this type of data appropriately.

æ·»åŠ æ‚¨çš„è§‚ç‚¹

Kannan Singaravelu, CQF

Machine Learning & GenAI Consultant
(å·²ç¼–è¾‘)
ä¸¾æŠ¥å†…å®¹
Never do a K-fold for timeseries as these are time ordered data that is to be preserved. Forward chaining or sliding window could be attempted. When preforming splits, make sure there are no data leakages between train and test samples.

å·²ç¿»è¯‘

èµž
Vimukthi Sripa

Connecting Education and Industry through Data Science, AI, EdTech Development, and Career Coaching
ä¸¾æŠ¥å†…å®¹
For time series data such as stock prices over time, chronological splitting is crucial. You ca split the data by time, ensuring the training set precedes the testing set, to prevent the model from using future information to predict past events, creating an unrealistic scenario. For example, if you're building a model to predict future sales based on historical sales data, time series split ensures the model is only trained on past sales data and doesn't "peek" at future sales figures during training, leading to a more reliable prediction for upcoming periods.

å·²ç¿»è¯‘

èµž
Bushra Amjad

I help Data Professionals sign clients through LinkedIn | From Data Professional myself â†’ to Lead Gen Expert for Data professionals | DM me "CLIENTS" to get started
ä¸¾æŠ¥å†…å®¹
In time series data, maintain the temporal order to prevent future data from leaking into training. Techniques like rolling-window cross-validation can help validate model performance over time.

å·²ç¿»è¯‘

èµž
Harpreet Singh Sachdev

Lead Data Scientist and Senior AI Researcher | Technical Leader | RET OPC Engineer
ä¸¾æŠ¥å†…å®¹
For time series data, the chronological order is sacrosanct to avoid the inadvertent leakage of future information into the training set. Techniques like rolling-window cross-validation are mostly used, as they allow for the evaluation of the model's performance over successive time frames, safeguarding the temporal integrity of the data. Other time series split techniques includes - Expanding Window Split - Walk-Forward Validation - Time Series Cross-Validation - Blocked Time Series Split Choosing a time series analysis technique requires evaluating the data's traits, business context, and goals, considering seasonality and historical trends. Each method brings unique advantages suited to these specific factors.

å·²ç¿»è¯‘

èµž
Daniel Eduardo Diaz Almeida, PhD

AI Engineer | Artificial Intelligence Consultant | Python | LangChain | MongoDB | Neo4j | VertexAI | Data Scientists | Google Cloud Platform |
ä¸¾æŠ¥å†…å®¹
The temporal order must be preserved during the split. Instead of random sampling, you should split the data based on time, ensuring that the training set contains only past data and the testing set contains future data relative to the training set. For example, a bad application of time series splitting would be to randomly select training and testing data points without considering their temporal order. This approach would mix past and future data, leading to a model that doesn't realistically simulate future predictions based on historical data. Instead, you should use the first 70% of the time series for training and the remaining 30% for testing, ensuring the model trains on historical data and predicts future values accurately.

å·²ç¿»è¯‘

èµž

åŠ è½½æ›´å¤šå†…å®¹

5 Feature Distribution

Before finalizing your split, it's wise to check the distribution of features in both training and testing sets. They should be similar to avoid introducing bias that could skew the model's performance. Discrepancies in feature distribution can lead to a model that performs well on the training set but fails to generalize to new data. Tools like histograms or density plots can help visualize and compare feature distributions across different splits.

æ·»åŠ æ‚¨çš„è§‚ç‚¹

Bushra Amjad

I help Data Professionals sign clients through LinkedIn | From Data Professional myself â†’ to Lead Gen Expert for Data professionals | DM me "CLIENTS" to get started
ä¸¾æŠ¥å†…å®¹
Ensure the feature distributions in training and testing sets are consistent to avoid bias. Use visualizations like histograms or box plots to compare distributions and confirm similarity.

å·²ç¿»è¯‘

èµž
Rinit Lulla

Data Scientist || IIT Jodhpur || Generative AI || LLMs || Deep Learning || NLP || ML || Computer Vision || Azure x6 || AWS x2 || GCP Ã—1
ä¸¾æŠ¥å†…å®¹
Feature distribution across both sets. This entails ensuring that the statistical characteristics, such as mean and variance, of the features remain comparable in both subsets. By adhering to this practice, we mitigate the risk of introducing bias into the model and enhance its performance when encountering unseen data. Techniques like stratified sampling prove particularly valuable in achieving this balance, especially in scenarios involving imbalanced datasets where certain classes or outcomes are underrepresented. This approach safeguards the proportional representation of each class in both training and testing sets, which is paramount for developing reliable predictive models.

å·²ç¿»è¯‘

èµž
Steven Lu

Financial Crimes Senior Associate Data Scientist
ä¸¾æŠ¥å†…å®¹
Although we've maximized the odds of avoiding bias in the split, there's still the off chance that the split failed to eliminate it entirely. This is especially likely when the dataset is smaller. For example, let's say we are collecting income as a feature and the train dataset has an average income of 70k, while the test dataset has an average income of 55k. This means the true average income in the overall dataset was around 65k, but you've trained your model on a train dataset with average income 70k, and are now predicting on a test dataset with 55k. In order to identify biases like this, exploratory data analysis (EDA) is needed. Visualizations like histograms, density plots, bar charts, etc. can help you identify biases.

å·²ç¿»è¯‘

èµž
Vimukthi Sripa

Connecting Education and Industry through Data Science, AI, EdTech Development, and Career Coaching
ä¸¾æŠ¥å†…å®¹
After splitting, you need to verify that the distribution of features such as let's say income levels, in the training and testing sets is similar to the original data, because a significant discrepancies can lead to biased models. For instance, in a loan application analysis, ensure the distribution of income levels in both training and testing sets reflects the real-world distribution of applicants' income. This helps the model make fair decisions on unseen loan applications.

å·²ç¿»è¯‘

èµž
Varun Mishra

Assistant Manager at Fuld & Company || Data Analytics || Product Analytics || Data Consulting
ä¸¾æŠ¥å†…å®¹
Ensuring similar feature distributions between training and testing sets is crucial for model robustness. A skewed distribution could mislead the model during training, resulting in poor generalization to unseen data. For instance, if the training set contains mostly one class while the testing set has a balanced distribution, the model may struggle to predict minority classes accurately. Using visualization tools like histograms aids in detecting such discrepancies, guiding adjustments to achieve balanced splits and improve model performance on diverse data.

å·²ç¿»è¯‘

èµž

åŠ è½½æ›´å¤šå†…å®¹

6 Data Leakage Prevention

Data leakage occurs when information from the testing set is inadvertently used during training. This can lead to overly optimistic performance estimates and a model that fails in real-world applications. To prevent data leakage, keep your testing set strictly separate from any preprocessing or feature engineering steps. Treat it as unseen data until the very end of your modeling process. This ensures that your model's performance metrics are based on its ability to generalize from the training data alone.

æ·»åŠ æ‚¨çš„è§‚ç‚¹

Varun Mishra

Assistant Manager at Fuld & Company || Data Analytics || Product Analytics || Data Consulting
ä¸¾æŠ¥å†…å®¹
The prevention of data leakage is paramount in ensuring the reliability of machine learning models. By maintaining a strict separation between the testing set and preprocessing/feature engineering, we safeguard against the inadvertent use of testing data during training. This practice fosters a more realistic evaluation of the model's generalization capabilities, mitigating the risk of overfitting and ensuring its effectiveness in real-world scenarios. Additionally, emphasizing the treatment of testing data as entirely unseen until the final stages of modeling reinforces the importance of unbiased evaluation, ultimately leading to more robust and dependable models.

å·²ç¿»è¯‘

èµž
Arun Ram Ilayarasan

Final Year student @ SRM TRP Engineering College | IETE Fellow | IoT & AI Specialist | Machine Learning & Data Analytics | Embedded Systems | Flutter Developer
ä¸¾æŠ¥å†…å®¹
It is crucial to prevent data leakage, where information from the test set inadvertently influences the training process. To avoid this, perform data preprocessing steps (like scaling or encoding) within a cross-validation loop or only after splitting the data.

å·²ç¿»è¯‘

èµž
Yan Machado ArÃªas

Data Analytics - Business Intelligence | Tableau | Dados | Power BI | Alteryx | Figma | Excel | ETL | SQL | Python | Dataviz | SISS | Pentaho | Power Platform
ä¸¾æŠ¥å†…å®¹
O vazamento de dados ocorre quando informa??es do conjunto de teste influenciam o modelo durante o treinamento, levando a avalia??es excessivamente otimistas. Para prevenir isso, Ã© essencial que o conjunto de teste permane?a completamente separado durante o processo de treinamento, e qualquer prÃ©-processamento deve ser aplicado separadamente a cada conjunto.

å·²ç¿»è¯‘

èµž
Vimukthi Sripa

Connecting Education and Industry through Data Science, AI, EdTech Development, and Career Coaching
ä¸¾æŠ¥å†…å®¹
To avoid false confidence, you can maintain a clear separation between training and testing data. You need to avoid using information from the testing set to influence model training, as this creates artificial accuracy and provides a misleading impression of the model's true performance. For example, imagine you're training a spam filter model. Don't use any emails from your spam folder to "fine-tune" the model during training. This leakage of information would give the model an unfair advantage and lead to overfitting â€“ performing well on the testing data but failing to generalize to unseen spam emails.

å·²ç¿»è¯‘

èµž
Uzair Shafique

Data Scientist & AI/ML Specialist | Pharm.D Candidate Bridging Medicine & Technology | Expert in Python, SQL, Data Analysis, Generative AI & ML/DL Model Training | Building Scalable AI Solutions
ä¸¾æŠ¥å†…å®¹
Data leakage is a sneaky culprit that can artificially inflate your model's performance. It occurs when information from the testing set unintentionally influences the training process. To prevent this, avoid using any testing data for tasks like feature engineering or model selection. Keep the training and testing sets entirely separate!

å·²ç¿»è¯‘

èµž

7 Hereâ€™s what else to consider

This is a space to share examples, stories, or insights that donâ€™t fit into any of the previous sections. What else would you like to add?

æ·»åŠ æ‚¨çš„è§‚ç‚¹

Hamidreza Moeini

Data Analyst
ä¸¾æŠ¥å†…å®¹
Randomization: Randomly shuffle the data before splitting. Proportion: Allocate a sufficient portion of the data to the training set to adequately train the model. Stratification: Maintain class proportions when splitting to ensure representative samples in both sets. Temporal splitting: If dealing with time-series, split based on time to avoid data leakage. Cross-validation: Utilize techniques like k-fold cross-validation to assess model performance more robustly. Separate data sources: If merging data from different sources, ensure that data from the same source is not split across training and testing sets. Validation set: Set aside a validation set for hyperparameter tuning to avoid overfitting the model to the test set.

å·²ç¿»è¯‘

èµž
Arun Ram Ilayarasan

Final Year student @ SRM TRP Engineering College | IETE Fellow | IoT & AI Specialist | Machine Learning & Data Analytics | Embedded Systems | Flutter Developer
ä¸¾æŠ¥å†…å®¹
Additionally, consider the use of cross-validation, particularly k-fold cross-validation, to maximize the use of your dataset and provide a more robust evaluation of your model's performance. This method involves splitting the data into k subsets and iteratively using each subset as a test set while training on the remaining data.

å·²ç¿»è¯‘

èµž

Data Science

+ å…³æ³¨

ç»™æ–‡ç« è¯„åˆ†

å¾ˆæ£’ ä¸å¤ªå¥½

ä¸¾æŠ¥æ¤æ–‡ç«

æŸ¥çœ‹å…¨éƒ¨

What are the best practices for splitting your dataset into training and testing sets?

1

2

3

4

5

6

7

1 Data Splitting

2 Holdout Method

3 Stratified Split

4 Time Series Split

5 Feature Distribution

6 Data Leakage Prevention

7 Hereâ€™s what else to consider

Data Science

ç»™æ–‡ç« è¯„åˆ†

æ„Ÿè°¢æ‚¨çš„åé¦ˆ

æ›´å¤šData Scienceç›¸å…³æ–‡ç«

æ›´å¤šç›¸å…³é˜…è¯»å†…å®¹

What are the best practices for splitting your dataset into training and testing sets?

1

2

3

4

5

6

7

1 Data Splitting

2 Holdout Method

3 Stratified Split

4 Time Series Split

5 Feature Distribution

6 Data Leakage Prevention

7 Hereâ€™s what else to consider

Data Science

ç»™æ–‡ç« è¯„åˆ†

æ„Ÿè°¢æ‚¨çš„åé¦ˆ

æŸ¥çœ‹å…¶ä»–æŠ€èƒ½

æ„Ÿè°¢æ‚¨çš„åé¦ˆ