Machine Learning Project on Imbalanced Data set in R
Lot of us get rejected during data science / machine learning interviews. Do you know why? Because, their resumes never get shortlisted for telephonic interviews. Yes. Recruiters don't want to waste time evaluating candidates whose resume don't show promising accomplishments.
If you are learning data science by self, totally by your own dedication, this article is meant for you. While you get busy in learning ML techniques, I want you to understand that showcasing your achievement is necessary. You might have worked on several data sets, but if you don't manage & present them, you'll have hard time getting shortlisted.
While you keep yourself busy in your work, I've created this ML project which you can showcase in your resume. Yes, this project is different. But, don't show it without understanding it. You can't deceive recruiters. Keep this one rule in your head:
If you are not confident about something, don't write it on your resume.
An honest resume is 1000 times better than a fabricated one.
In this project, I've used an imbalanced classification problem which is tricky, challenging and based on fairly large data. If you use R and passionate about data science, this project should interest you.
Table of Contents
- Problem Statement & Hypothesis Generation
- Data Exploration
- Data Cleaning
(a) Missing Value Imputation
4. Data Manipulation a.k.a Feature Engineering
5. Machine Learning
> Imbalanced Techniques
> Oversampling
> Undersampling
> SMOTE
(b) naive Bayes
(c) XgBoost
> Homework – Top 20 Features
(d)AUC Threshold
(e) SVM
> Homework – Class weight
View Complete Project
The homework assignments are given with sufficient hints. For SVM, I've given the code also, you just need to run & evaluate the model to see if it beats xgboost model.
Now, Open R and Start working with me!