Lessons learnt from Kaggle competition

Lessons learnt from Kaggle competition

My original plan for the Kaggle used cars price prediction competition was to copy paste code from Matt Harrison's statistics course. The course covered linear regression with sklearn and then compared it with XGBoost. In practice I never got to the XGBoost bit.

The exploration of the data revealed that some of the categorical columns actually contained numerical information:


engine types of gasoline fueled cars

I used it to extract additional numerical columns for horsepower, L and Cylinders


histogram of car horsepower

The original data exploration had already revealed some data quality issues with empty fields, and the additional columns I created raised new questions: What do we use as the number of cylinders e.g. of an electric car? It did make me think again that linear regression is far from an ideal approach to this problem, but I came to this competition late in the month, and for other reasons my time was even more limited. I wasn't expecting to get a great result, but wanted to use the competition to learn and try things, so to fully expose the issues with linear regression I used the Yellowbrick library for residual analysis.


residuals from yellowbrick library

The residuals plot shows more of the issues, even with my removal of outliers. Linear regression will give impossible predictions (negative prices!) and while the histograms look good, the scatterplot has some suspicious diagonal features. The R^2 numbers are far from impressive too.

Overall a good learning experience for me. People complain that Kaggle is not real life, but it doesn't mean you don't get useful skills by attempting the competitions.

要查看或添加评论,请登录

Stelios Christodoulou的更多文章

  • Issues with Kaggle notebooks

    Issues with Kaggle notebooks

    I've complained in previous posts about Colab Python notebooks, so it's only fair to also complain about Kaggle. I'm…

  • So, what kind of Neo-Generalist am I?

    So, what kind of Neo-Generalist am I?

    I just finished reading “the Neo-Generalist” by Kenneth Mikkelsen and Richard Martin. Along with writing about the book…

    1 条评论
  • Colab epic fail

    Colab epic fail

    For my sins, I was trying to port a notebook from a more orthodox Jupyter implementation to Colab. Colab seems to be an…

社区洞察

其他会员也浏览了