Three Approaches to prepare datasets.

Why you use Fit_transform on the train set, but just transform on the test set?

Continuation.... To the previous post.

1.why we should not apply 100% data into train set?

Ans:-If you train your model with a training set and test it with that same set, of course your model will do well! You want to evaluate performance of your model on a set of data it has never seen before.

example:-

  • Professors often give sample exams to students to give them a chance to study better, but if the professor decided to have that same sample exam as the actual exam for the students, you can bet that almost all the students would ace the exam. It wouldn’t be an accurate measure of the student’s actual knowledge.
  • The best way to test students with exam that follows the same structure, but has problems that students have never seen before. Similarly, the best way to test a machine learning model is to test it with data that the model has never seen before, but you believe still has some underlying structure that the model can learn through training.

2) i) Keep 70 to 80% data into train set remaining 20 to 30% in the test set.

Ans:-

Example:-If my dataset consists of 10,000 rows. It has 30 patterns.

train-->70%-->7000 rows-->25 patterns

Test-->30%-->3000 rows-->10 patterns.(Here 5 patterns are common, 5 patterns are unknown to test the model. We can observe the performance of our model in that unknown patterns)

ii) Shuffle data, keep 70 to 80% into train set remaining 20 to 30% in the test set. Among the two methods when we have to apply and which method we have to choose?

Example:-If my dataset consists of 10,000 rows. It has 30 patterns.

Shuffle data--> The reason to shuffle data is may be data set contains continuous repeats like 10,10,20,20,500,500,500...numerical...dog,dog,dog,cat,cat,d

train-->70%-->7000 rows-->25 patterns

Test-->30%-->3000 rows-->10 patterns.(Here 5 patterns are common, 5 patterns are unknown to test the model. We can observe the performance of our model in that unknown patterns)

Avoids overfitting. We can test model performance, in unknown pattern condition


要查看或添加评论,请登录

Sravanthi kuruva的更多文章

其他会员也浏览了