Winning Strategies for ML Competitions from Past Winners
Kunal Jain
Founder & CEO @ Analytics Vidhya | AI Evangelist | Author | Blogger | Keynote Speaker
Knocktober starts in 6 hours. We thought of providing you the winning strategies from our past competitions. Read on, to know the hackathon approach of three top data scientists. They have also shared useful tips & tricks, that will definitely help you to improve your leaderboard position.
Let's dig in and find out what are the ways that can help you win Knocktober.
1.Sudalai Rajkumar (SRK) , Senior Data Scientist, AV Rank 1 (Read detailed article here)
His approach in past competitions:
- Understanding the problem and dataset
- Pre-processing the data: Data cleansing, Outlier removal, Normalization / Standardization, Dummy variable creation
- Feature engineering : Feature selection, Feature transformation, Variable interaction and Feature creation
- Selecting the modeling algorithm
- Parameter tuning through cross validation
- Building the model
- Checking the results by making a submission
Once you’ve executed these 7 steps, a basic framework will be ready to do more experimentation. Further, you can concentrate more on:
- Feature engineering – This is where bigger improvements come from most of the times
- Building varied kind of models and ensembling them – This will help go that extra mile towards the end
Last but not the least, we must perform a solid local validation. Else, we might end up over fitting on the public leader board.
Tips from SRK:
1. Understanding the problem – It is really important to have a thorough understanding of the problem that we are trying to solve. Only after we’ve understood the problem clearly, we can derive suitable insights from data to tackle the problem and obtain good results.
2. Structured Thinking – It’s a unique way of thinking through the problems. Being a data scientist, one needs to be more structured in his/her thinking in order to obtain good results.
3. Effective communication of results – Effective communication of derived results is as important as performing the data analysis.
2. Rohan Rao, Lead Data Scientist, AV Rank 4 (Read detailed article here)
His approach in past competitions:
- Understand the problem / objective you are trying to solve.
- Understand and summarize what data you have / need.
- Carefully read about the evaluation metric.
- Explore and visualize the data, build simple, base models for benchmark.
- Setup a robust / thorough validation framework consistent with the evaluation conditions.
- Work on feature engineering and optimizing algorithms.
- Try out as many different models / ideas as you can.
- Ensemble / Blend / Stack multiple models.
- Never hesitate in asking questions, taking help or even teaming up with others.
Tips from Rohan:
- Gauge the complexity of the problem: Explore the data as much as possible. Plot features, summarize columns, build benchmark models, and during the process, get a sense of the problem, data, time, complexity, etc. And then slowly build a good solid concrete solution by working on one idea after another.
- Algorithm: I use XGBoost and feature engineering for building ML solutions and it’s been a part of my winning solution for most of the contests I’ve done well in, so a big thanks to the community who are actively developing and improving it each day. I also like Collaborative Filtering techniques, which I’ve implemented very often in my work.
- Feature Selection Ways: My thumb rule of feature selection is based on CV or Val scores. If selecting a feature improves CV score, I use it, else discard. For large number of features, I usually build small quick models and check variable importance or information gain, and select the top-x from them.
3. Steve Donoho, Top Data Scientist (Read detailed article here)
His approach in past competitions:
- Well, I start by simply familiarizing myself with the data. I plot histograms and scatter plots of the various variables and see how they are correlated with the dependent variable. I sometimes run an algorithm like GBM or Random Forest on all the variables simply to get a ranking of variable importance.
- I usually start very simple and work my way toward more complex if necessary. My first few submissions are usually just “baseline” submissions of extremely simple models – like “guess the average” or “guess the average segmented by variable X.” These are simply to establish what is possible with very simple models. You’d be surprised that you can sometimes come very close to the score of someone doing something very complex by just using a simple model.
Tips from Steve:
- Making Predictions: This is an important step that is often missed by many – they just throw the raw dependent variable into their favorite algorithm and hope for the best. But sometimes you want to create a derived dependent variable.
- I probably spend 50% of my time on data exploration and cleansing depending on the problem.
Go on & use these tips from the winners and takeaway your first win in Knocktober.
If you still haven't registered, don't waste anymore time.
Register Now
Winning prize amount:
- INR 50K (~$750) - 1st Place
- INR 25K (~$350) - 2nd Place
- INR 15K (~$225) - 3rd Place
All the Best !!