Lessons learnt from Kaggle competition
My original plan for the Kaggle used cars price prediction competition was to copy paste code from Matt Harrison's statistics course. The course covered linear regression with sklearn and then compared it with XGBoost. In practice I never got to the XGBoost bit.
The exploration of the data revealed that some of the categorical columns actually contained numerical information:
I used it to extract additional numerical columns for horsepower, L and Cylinders
The original data exploration had already revealed some data quality issues with empty fields, and the additional columns I created raised new questions: What do we use as the number of cylinders e.g. of an electric car? It did make me think again that linear regression is far from an ideal approach to this problem, but I came to this competition late in the month, and for other reasons my time was even more limited. I wasn't expecting to get a great result, but wanted to use the competition to learn and try things, so to fully expose the issues with linear regression I used the Yellowbrick library for residual analysis.
The residuals plot shows more of the issues, even with my removal of outliers. Linear regression will give impossible predictions (negative prices!) and while the histograms look good, the scatterplot has some suspicious diagonal features. The R^2 numbers are far from impressive too.
Overall a good learning experience for me. People complain that Kaggle is not real life, but it doesn't mean you don't get useful skills by attempting the competitions.
Notebook at https://github.com/stelios-c/used_cars_regression_kaggle