登录查看更多内容

Lessons learnt from Kaggle competition

Stelios Christodoulou

发布日期: 2024年10月5日

My original plan for the Kaggle used cars price prediction competition was to copy paste code from Matt Harrison's statistics course. The course covered linear regression with sklearn and then compared it with XGBoost. In practice I never got to the XGBoost bit.

The exploration of the data revealed that some of the categorical columns actually contained numerical information:

I used it to extract additional numerical columns for horsepower, L and Cylinders

The original data exploration had already revealed some data quality issues with empty fields, and the additional columns I created raised new questions: What do we use as the number of cylinders e.g. of an electric car? It did make me think again that linear regression is far from an ideal approach to this problem, but I came to this competition late in the month, and for other reasons my time was even more limited. I wasn't expecting to get a great result, but wanted to use the competition to learn and try things, so to fully expose the issues with linear regression I used the Yellowbrick library for residual analysis.

The residuals plot shows more of the issues, even with my removal of outliers. Linear regression will give impossible predictions (negative prices!) and while the histograms look good, the scatterplot has some suspicious diagonal features. The R^2 numbers are far from impressive too.

Overall a good learning experience for me. People complain that Kaggle is not real life, but it doesn't mean you don't get useful skills by attempting the competitions.

Stelios Christodoulou

5 个月

Notebook at https://github.com/stelios-c/used_cars_regression_kaggle

要查看或添加评论，请登录

Stelios Christodoulou的更多文章

Issues with Kaggle notebooks

2024年9月22日

Issues with Kaggle notebooks

I've complained in previous posts about Colab Python notebooks, so it's only fair to also complain about Kaggle. I'm…
So, what kind of Neo-Generalist am I?

2024年8月13日

So, what kind of Neo-Generalist am I?

I just finished reading “the Neo-Generalist” by Kenneth Mikkelsen and Richard Martin. Along with writing about the book…

1 条评论
Colab epic fail

2024年8月1日

Colab epic fail

For my sins, I was trying to port a notebook from a more orthodox Jupyter implementation to Colab. Colab seems to be an…

Lessons learnt from Kaggle competition

Stelios Christodoulou

Stelios Christodoulou的更多文章

社区洞察

其他会员也浏览了

Gaussian Pipelines

Optimised randomisation - Explained

?? Understanding the Dummy Variable Trap and How to Avoid It ??

Links of the week #36

Knowing what you do matters, makes a huge difference

Links of the week #14

Links of the week #12

Links of the week #10

Why Random Is so Critical?

Stelios Christodoulou的更多文章

Issues with Kaggle notebooks

So, what kind of Neo-Generalist am I?

Colab epic fail

社区洞察

其他会员也浏览了

Gaussian Pipelines

Optimised randomisation - Explained

?? Understanding the Dummy Variable Trap and How to Avoid It ??

Links of the week #36

Knowing what you do matters, makes a huge difference

Links of the week #14

Links of the week #12

Links of the week #10

Why Random Is so Critical?