登录查看更多内容

Three Approaches to prepare datasets.

Sravanthi kuruva

Software Engineer | WordPress Developer | Transitioning to React.js | Passionate about Frontend Development | Open to New Opportunities.

发布日期: 2021年10月11日

+ 关注

Why you use Fit_transform on the train set, but just transform on the test set?

Continuation.... To the previous post.

1.why we should not apply 100% data into train set?

Ans:-If you train your model with a training set and test it with that same set, of course your model will do well! You want to evaluate performance of your model on a set of data it has never seen before.

example:-

Professors often give sample exams to students to give them a chance to study better, but if the professor decided to have that same sample exam as the actual exam for the students, you can bet that almost all the students would ace the exam. It wouldn’t be an accurate measure of the student’s actual knowledge.
The best way to test students with exam that follows the same structure, but has problems that students have never seen before. Similarly, the best way to test a machine learning model is to test it with data that the model has never seen before, but you believe still has some underlying structure that the model can learn through training.

2) i) Keep 70 to 80% data into train set remaining 20 to 30% in the test set.

Ans:-

Example:-If my dataset consists of 10,000 rows. It has 30 patterns.

领英推荐

Pearson's χ2 for 2×2 contingency table IS the 2-sample…

Adrian Olszewski 5 天前

How to Learn Intermediate Statistics for Data Science…

Naresh Maddela 7 个月前

Why is it so important to avoid data with high…

Vinícius Santos 1 年前

train-->70%-->7000 rows-->25 patterns

Test-->30%-->3000 rows-->10 patterns.(Here 5 patterns are common, 5 patterns are unknown to test the model. We can observe the performance of our model in that unknown patterns)

ii) Shuffle data, keep 70 to 80% into train set remaining 20 to 30% in the test set. Among the two methods when we have to apply and which method we have to choose?

Example:-If my dataset consists of 10,000 rows. It has 30 patterns.

Shuffle data--> The reason to shuffle data is may be data set contains continuous repeats like 10,10,20,20,500,500,500...numerical...dog,dog,dog,cat,cat,d

train-->70%-->7000 rows-->25 patterns

Test-->30%-->3000 rows-->10 patterns.(Here 5 patterns are common, 5 patterns are unknown to test the model. We can observe the performance of our model in that unknown patterns)

Avoids overfitting. We can test model performance, in unknown pattern condition

要查看或添加评论，请登录

Sravanthi kuruva的更多文章

Why we use Fit_transform on the train set but just transform on the test set?

2021年9月27日

Why we use Fit_transform on the train set but just transform on the test set?

fit_transform() and transform() are the methods of class sklearn.preprocessing.
OUTLIER ANALYSIS

2021年9月17日

OUTLIER ANALYSIS

What are outliers in the data? An outlier is an observation that lies an abnormal distance from other values in a…

Three Approaches to prepare datasets.

Sravanthi kuruva

Software Engineer | WordPress Developer | Transitioning to React.js | Passionate about Frontend Development | Open to New Opportunities.

领英推荐

Sravanthi kuruva的更多文章

其他会员也浏览了

How to Learn Intermediate Statistics for Data Science As A Self Starter[ Day - 10 ]

How to Learn Intermediate Statistics for Data Science As A Self Starter[ Day - 13 ]

Day Science Basics ! Day 4 !

Unleashing Insights with Analyze & Amplify PCA: Assumptions, Applications, and App Demo

?? Day 10 with Data Structures and Algorithms (DSA) : Selection Sort??

Beyond Software: The Art and Science of Data Analysis

Confused With Terms : Sample, Batch and Epoch?

Day 45 New Day New Learning

QUICK TIPS: DATA ANALYSIS CHEAT SHEETS

R - Essential Skill for every one

领英推荐

Sravanthi kuruva的更多文章

Why we use Fit_transform on the train set but just transform on the test set?

OUTLIER ANALYSIS

其他会员也浏览了

How to Learn Intermediate Statistics for Data Science As A Self Starter[ Day - 10 ]

How to Learn Intermediate Statistics for Data Science As A Self Starter[ Day - 13 ]

Day Science Basics ! Day 4 !

Unleashing Insights with Analyze & Amplify PCA: Assumptions, Applications, and App Demo

?? Day 10 with Data Structures and Algorithms (DSA) : Selection Sort??

Beyond Software: The Art and Science of Data Analysis

Confused With Terms : Sample, Batch and Epoch?

Day 45 New Day New Learning

QUICK TIPS: DATA ANALYSIS CHEAT SHEETS

R - Essential Skill for every one