登录查看更多内容

10 differences between a Kaggle competition and real life project

Sergii Makarevych

Data scientist / Machine learning engineer

发布日期: 2020年1月8日

There are some very important differences between a Kaggle competition and real life project which beginner Data Scientists should know about. Kaggle creates a fantastic competition spirit. It’s leaderboard drives people to deliver better and better solutions pushing accuracy to the limit. Kaggle's Notebooks and Discussions make it easy to share knowledge and learn. However real life projects are somewhat different. I hope this article will be helpful for people who consider moving into Data Science starting with Kaggle competitions. I remember I was a little bit overwhelmed when on my first real life project all the models, that typically worked well on Kaggle, miserably failed. I wish I was prepared for this.

1. Data makes the difference, not a number of models. In a Kaggle competition you are typically limited to the dataset provided by organisers. In real life there are no limitations like this. You can mine (i.e. collect, prepare) as much new data as you can imagine. Typically it’s data that makes the difference. Also there might be a case when you do not have any dataset at all and you have to define what data you would like to collect. You are starting from scratch which is a topic for another day.

2. Domain expertise does matter. When Jeremy Howard said it does not - he actually meant the Kaggle competitions. On Kaggle you have a dataset already created. It is all about understanding the data distribution, crafting features, building and stacking models. You do not need much knowledge about the data domain. However you can benefit a lot from the domain expertise whenever you have some control over the data collection process. It certainly helps you to make better assumptions about the data, which might improve your modelling. I have found that it is much harder to collect proper data than to do the modelling.

3. Damn, there is no leaderboard. Kaggle leaderboards make you notice an error you have made. It is really difficult to benchmark your model performances if you are the only Data Scientist working on a project. The leaderboard shows the upper limit of accuracy and you always know how much more you can possibly do to improve your model. Unfortunately, that's not applicable in real life scenarios.

4. Data cleaning matters. There is no one to clean up the data for you. You are all alone with all the errors you made during the data collection process. There are typically plenty of them. And lots of improvement comes from data cleaning.

5. Solution complexity matters. In a Kaggle competition you have to reproduce your code only once. In real life, sometimes you have to reproduce your code every 5 or 15 minutes. A new breed of Kaggle competitions (aka Kernel competitions) are there to close that gap but they are rather rare. Model response time is typically very important for predictions made in real-time.

6. Your test set is dynamic. Your train set is dynamic. In a Kaggle competition you perform lots of tests to pick the best features and models. Your environment is fixed most of the time - the train and test sets do not change. Often you won’t have such a comfortable setup. What’s done once in a Kaggle competition will have to be redone over and over again once your train set gets changed. In some domains your train set will change on an hourly basis! Your whole solution (including features and model selection) should be completely automated.

7. Testing and refactoring. Your code’s lifecycle might be longer than 2 months and you are typically going to do several deep refactoring rounds. Good testing blocks will be of help with this one. Take a look at pytest.

8. Model deployment. Your trained model is great but useless for the company. It has to be deployed into production in one way or another. Deployed typically means that the model is loaded into memory, has some interface to receive a request, can process it and return a prediction. WEB servers like tornado/flask might be helpful. MOOCs rarely mention that, neither does Kaggle.

9. Performance control. Once your model is deployed you need to control its performance. Typically you would like to retrain your model once your dataset has changed significantly or you if have collected new data. Reporting (say open source Dash or RShiny) would not only help you to control model’s accuracy over time but to attract more business attention to data science projects in your company as well. McKinsey’s “Say it with charts” or Dona M. Wong’s “The Wall Street Journal Guide to Information Graphics” are like bibles in a data presentation world.

10. Logging. It is typically used on Kaggle only to control the training flow. There is no need to have different logging levels to log every point of your training/prediction pipeline. In real life project anything can happen as the environment is much more complex - it is not limited to 1 notebook on a laptop or server. Good logging practices would help you to spend less time on debugging.

Sergii Makarevych, 08.01.2020, https://www.dhirubhai.net/in/sergii-makarevych-78b62339/

Monodeep Saha

Consultant

4 年

Very well articulated! A very big yes for young data scientists in setting up the real world expectation.

1 次回应

Angeline Malembe Nzita

4 年

Very interesting

Christophe Corsi

Software Engineer

4 年

Alexandre Bonnet. This is really interesting ;)

1 次回应

Opeyemi Bamigbade

PhD Student - Data Scientist - Machine learning & Ai Engineer - SWE

4 年

Sergii Makarevych?this is really a great article. you have not only mentioned the need to start from kaggle but also to go beyond kaggle for best practices. It will be great if you can make some recommendations on textbooks/writeups that tackle real-life projects handling?end-to-end.

Krishna Kishor Kammaje

4 年

Loved reading it. I guess another point is explainability. You rarely need to explain your model in Kaggle, but you need to tell your business users how did the model arrive at a decision.?

4 次回应

查看更多评论

要查看或添加评论，请登录

Sergii Makarevych的更多文章

Efficient data processing with AWS Athena

2024年9月7日

Efficient data processing with AWS Athena

In a small series of posts I would like to share my experience of building a data science function in a company from…
Train a GPT model on your favourite book

2023年9月20日

Train a GPT model on your favourite book

It’s always a pity when you finish an interesting book. Especially when it is a series.

2 条评论
Few Interesting Things to Know About Querying BigQuery

2023年8月24日

Few Interesting Things to Know About Querying BigQuery

SUM/AVG are non-deterministic. Although this is mentioned in the function documentation, it might not be something you…
A guide for anti-war disputes in the family and at work 17 answers to the most common arguments justifying war

2022年2月28日

A guide for anti-war disputes in the family and at work 17 answers to the most common arguments justifying war

Original source: https://doxajournal.ru/anti_war_handbook#rec418563313, this is just a translation from ru to eng.

3 条评论

10 differences between a Kaggle competition and real life project

Sergii Makarevych

Data scientist / Machine learning engineer

Sergii Makarevych的更多文章

社区洞察

其他会员也浏览了

Winning Strategies for ML Competitions from Past Winners

Finding Similarities

Shiny Data Modeling Web App

Decision Tree Classifier - Explained

How to speed up tabular data processing by 1053x in pandas/cudf

DataHack season is here ~ Grab $8000 cash prizes in Machine Learning Hackathons

A week in Kusto and SQL

Why R is One of My Tools—and a Secret to Better Decisions in Sports Science

What is Random Forest?

Binary Search Trees & Tree Traversal

Sergii Makarevych的更多文章

Efficient data processing with AWS Athena

Train a GPT model on your favourite book

Few Interesting Things to Know About Querying BigQuery

A guide for anti-war disputes in the family and at work 17 answers to the most common arguments justifying war

社区洞察

其他会员也浏览了

Winning Strategies for ML Competitions from Past Winners

Finding Similarities

Shiny Data Modeling Web App

Decision Tree Classifier - Explained

How to speed up tabular data processing by 1053x in pandas/cudf

DataHack season is here ~ Grab $8000 cash prizes in Machine Learning Hackathons

A week in Kusto and SQL

Why R is One of My Tools—and a Secret to Better Decisions in Sports Science

What is Random Forest?

Binary Search Trees & Tree Traversal