登录查看更多内容

Why & how do we split our data set before model building?

Amit Pandey

? In pursuit of superiority ? Generative AI Enthusiast

发布日期: 2020年10月14日

We split the data set because using the same dataset for both training and testing may instigate miscalculations, thus increases the chances of inaccurate predictions. The train_test_split function allows us to break a dataset with ease while pursuing an ideal model. We also need to ensure that that model should not be overfitting or underfitting.

How do we split the data set into subsets of training & testing in python?

?from sklearn.model_selection import train_test_split sklearn.model_selection.train_test_split(*arrays, **options)

*arrays sequence of indexable with same length / shape[0] -> allowed inputs are lists, NumPy arrays, SciPy-sparse matrices or pandas data frames.

test_size: float or int, default=None -> If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train size is also None, it will be set to 0.25.

train_size: float or int, default=None -> If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state: int or RandomState instance, default=None -> Controls the shuffling applied to the data before applying the split. Pass an int** for reproducible output across multiple function calls.

What is random state??

Random State exposes a number of methods for generating random numbers drawn from a variety of probability distributions. In addition to the distribution-specific arguments, each method takes a keyword argument size that defaults to None. If size is None, then a single value is generated and returned.

**an int -> use a new random number generator seeded by the given integer. Using an int will produce the same results across different calls. However, it may be worthwhile checking that your results are stable across a number of different distinct random seeds. Popular integer random seeds are 0 and 42.** Why "42" ??**

The number "42" was apparently chosen as a tribute to the "Hitch-hiker's Guide" books by Douglas Adams, as it was supposedly the answer to the great question of "Life, the universe, and everything" as calculated by a computer (named "Deep Thought") created specifically to solve it.

Shuffle: bool, default=True -> Whether or not to shuffle the data before splitting. If shuffle=False, then stratify must be None. shuffle: if True then it shuffles the data before splitting

Stratify: array-like, default=None -> if not None, data is split in a stratified fashion, using this as the class labels.

But why stratify?? Some classification problems do not have a balanced number of examples for each class label. As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset. This is the reason we need a stratified train-test split.

要查看或添加评论，请登录

Amit Pandey的更多文章

Underfitting and Overfitting in ML

2022年5月7日

Underfitting and Overfitting in ML

In Machine Learning when the model performs well on the training data but does not generalize well then we call it…

2 条评论
Exploratory Data Analysis - Critical step for AI / ML based solution

2020年10月16日

Exploratory Data Analysis - Critical step for AI / ML based solution

Data exploration is the very first step in the data analysis process to understand the characteristics of data set…
Feature Scaling in Machine Learning

2020年10月13日

Feature Scaling in Machine Learning

Feature scaling is a method used to normalize the range of independent variables or features of data. In data…
Important Metrics for classification problems in ML

2020年10月9日

Important Metrics for classification problems in ML

Learning from data is virtually universally useful. Master it and you will be welcomed anywhere.

3 条评论
Random Forest Algorithm in Machine Learning

2020年10月3日

Random Forest Algorithm in Machine Learning

Random Forest or random decision forests are an ensemble learning method for classification, regression and other tasks…
?????????????? ?????? ???????? ????????????????????

2020年8月26日

?????????????? ?????? ???????? ????????????????????

The intent of any app is not only to have maximum download but engaging users to build brand loyalty. App engagement…

1 条评论
B2B Marketing with power of AI - Key to success

2020年7月1日

B2B Marketing with power of AI - Key to success

B2B marketing refers to the marketing of products or services from one business to other businesses and organizations…

12 条评论
Human Resource with Data Science Skills can be dominant in Post-COVID-19 World

2020年6月9日

Human Resource with Data Science Skills can be dominant in Post-COVID-19 World

As we are witnessing the ongoing pandemic, Asian countries have demonstrated greater agility in tackling the situation…

5 条评论
Edging closer to a recession, everyone can't be ASHWATHAMA in career path. Let's face it

2020年5月21日

Edging closer to a recession, everyone can't be ASHWATHAMA in career path. Let's face it

We all know what happened in 2008. Recession loomed over the US that impacted many business in India.

6 条评论
Machine Learning - hype or hope?

2020年2月17日

Machine Learning - hype or hope?

Machine Learning or ML has created substantial impacts on businesses around the world, but today also many…

1 条评论

See all articles

Why & how do we split our data set before model building?

Amit Pandey

? In pursuit of superiority ? Generative AI Enthusiast

Amit Pandey的更多文章

社区洞察

其他会员也浏览了

Data Analysis made very simple ( Must read )

Time-Series-Analysis-with-Statsmodels - Chapter 3

A detailed K-nearest Neighbors classifier in Python

Pre-processing data in Python for Machine Learning

6th Story – If You can Visualize It. You can Explain It

Least Cost Path Analysis with A* Algorithm

Heuristics Search Technique to Find shortest distance between two cities.

How to Estimate Chance with Dice Rolls Using Convolutions and Recursion

The Chance Framework: How to Explain A/B Test Results to Managers Using Probability (Without p-values)

Using Multiple Regression To Examine What Variables Are Most Correlated With A Movie’s Box Office Success

Amit Pandey的更多文章

Underfitting and Overfitting in ML

Exploratory Data Analysis - Critical step for AI / ML based solution

Feature Scaling in Machine Learning

Important Metrics for classification problems in ML

Random Forest Algorithm in Machine Learning

?????????????? ?????? ???????? ????????????????????

B2B Marketing with power of AI - Key to success

Human Resource with Data Science Skills can be dominant in Post-COVID-19 World

Edging closer to a recession, everyone can't be ASHWATHAMA in career path. Let's face it

Machine Learning - hype or hope?

社区洞察

其他会员也浏览了

Data Analysis made very simple ( Must read )

Time-Series-Analysis-with-Statsmodels - Chapter 3

A detailed K-nearest Neighbors classifier in Python

Pre-processing data in Python for Machine Learning

6th Story – If You can Visualize It. You can Explain It

Least Cost Path Analysis with A* Algorithm

Heuristics Search Technique to Find shortest distance between two cities.

How to Estimate Chance with Dice Rolls Using Convolutions and Recursion

The Chance Framework: How to Explain A/B Test Results to Managers Using Probability (Without p-values)

Using Multiple Regression To Examine What Variables Are Most Correlated With A Movie’s Box Office Success