Real estate broker working with Linear Regression on imbalanced data
I used Housing price data for this analysis. Previous blog based on the same dataset are:
The objective of this linear regression is to predict housing prices (in Millions) based on other features like area, number of bedrooms, parking facility, etc. Except for area, other variables are either binary or discrete variables.
After performing a forward feature selection with necessary transformations based on residual analysis, this is the equation I came uo with:
price ~ ln(area) + stories + bathrooms + airconditioning + prefarea + parking + basement + hotwaterheating + mainroad + D(furnishing_status)
Looks decent, but since I used a lot of features I want to check for overfitting. So I do a test-train split. (Can do Cross Validation as well, but the test train split will help we show the global fit — local fit trade off)
A perfect y_pred vs y_test plot will be a 45-degree straight line (the red dashed one). However, we have some residuals, so our plot should scatter around the 45-degree line (meaning no bias in prediction). If the width of the scatter (variance) around the 45-degree line is constant, it’ll help us quantify the uncertainty in the predictions we make.
For instance, since the residual standard deviation is 1.2 Million, I should be able to tell with 95% confidence that the true price will be within a +/- 2.4 M range. (Approx interval. For exact, one should calculate the prediction interval)
Most of the points within the green circle satisfy these criteria. They are scattered well around the 45-degree line with constant variance. These predictions can be made well.
But the points inside the red circle are being systematically under-estimated. Basically, the model under-estimated the prices of costly homes. Why?
One definite reason is that the dataset has very few cases of costly homes as you can see in the plot itself. Most houses cost less than 8M.
The OLS linear regression optimizes for a global fit. It just tries to minimize the overall residuals (RMSE). As a result, it makes sense for the regression line to cut across densely populated points to reduce the residual there. A huge one-off residual is less than hundreds of small residuals.
In more technical words, in OLS each point has equal weightage. Costly houses, being less in number, will have less weightage in influencing the regression model. Cheaper houses influenced the model more being more in number. And thus the resultant model under-estimates the price of costly houses.
领英推荐
If I am a real-estate broker, how should I use my model now?
Internet is aplenty with what to do when there is imbalance in the dataset. Most of them has something to do with creating synthetic data (SMOTE, upsampling etc). However, I will stick with the raw data for now and discuss various ways of dealing with our model. We can appreciate better these other methods as well by the end.
1. Be an efficient middle-class (local) broker:
If a client asks for a massive, fully furnished, main-road facing house with 3-4 bedrooms etc., (features of those costly homes) — I will refrain from predicting a price for them. Or give my prediction as a liberal lower bound. In technical terms, I’ll infer only about the population my sample data most represents and I’ll do it well.
2. Be a less efficient global broker:
I would gain more commission on selling costly houses, not middle-class ones. So, I want to be able to predict prices of costly homes fairly well. How? I can do that by increasing the weightage of these costly homes during the RMSE optimization. This is Weighted Least Squares (WLS) Regression.
The z-score of price is chosen as weights in this. So that outliers would get more weight. There are other standard ways of taking weights as well, depedning on what you want to acheive.
Now I can predict costly houses price better than before. Atleast till upto 10M. If I further adjust weights, I may do even better. But can you see what I’ve lost?
I’ve lost my efficiency. Overall residuals have increased (OLS reduces the most in Linear Regression. Anything but that will increase them). My performance with middle-class homes deteriorated.
Basically, I traded my efficiency with a majority of houses with some efficiency at costly houses.
It is not necessarily bad.
If my compromised efficiency is still practical enough in real world, I can actually become a better broker with this weighted least squares regression.
3. Explore Generalised Linear models (GLM) or other ML models
WLS is one such GLM. But we can also try others like fitting two different lines — one for middleclass homes and one for rich. Or we can explore other ML models like Decision Trees Regressor, SVM, etc. However, they may not be as explainable as Linear Regression models.