Why & how do we split our data set before model building?
We split the data set because using the same dataset for both training and testing may instigate miscalculations, thus increases the chances of inaccurate predictions. The train_test_split function allows us to break a dataset with ease while pursuing an ideal model. We also need to ensure that that model should not be overfitting or underfitting.
How do we split the data set into subsets of training & testing in python?
?from sklearn.model_selection import train_test_split sklearn.model_selection.train_test_split(*arrays, **options)
*arrays sequence of indexable with same length / shape[0] -> allowed inputs are lists, NumPy arrays, SciPy-sparse matrices or pandas data frames.
test_size: float or int, default=None -> If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train size is also None, it will be set to 0.25.
train_size: float or int, default=None -> If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state: int or RandomState instance, default=None -> Controls the shuffling applied to the data before applying the split. Pass an int** for reproducible output across multiple function calls.
What is random state??
Random State exposes a number of methods for generating random numbers drawn from a variety of probability distributions. In addition to the distribution-specific arguments, each method takes a keyword argument size that defaults to None. If size is None, then a single value is generated and returned.
**an int -> use a new random number generator seeded by the given integer. Using an int will produce the same results across different calls. However, it may be worthwhile checking that your results are stable across a number of different distinct random seeds. Popular integer random seeds are 0 and 42.** Why "42" ??**
The number "42" was apparently chosen as a tribute to the "Hitch-hiker's Guide" books by Douglas Adams, as it was supposedly the answer to the great question of "Life, the universe, and everything" as calculated by a computer (named "Deep Thought") created specifically to solve it.
Shuffle: bool, default=True -> Whether or not to shuffle the data before splitting. If shuffle=False, then stratify must be None. shuffle: if True then it shuffles the data before splitting
Stratify: array-like, default=None -> if not None, data is split in a stratified fashion, using this as the class labels.
But why stratify?? Some classification problems do not have a balanced number of examples for each class label. As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset. This is the reason we need a stratified train-test split.