Real estate broker working with Linear Regression on imbalanced data

Real estate broker working with Linear Regression on imbalanced data

I used Housing price data for this analysis. Previous blog based on the same dataset are:

  1. How’d you lose in real-estate if you don’t understand non-linearity?
  2. Misleading residual plots when dependent variable is log transformed


The objective of this linear regression is to predict housing prices (in Millions) based on other features like area, number of bedrooms, parking facility, etc. Except for area, other variables are either binary or discrete variables.

After performing a forward feature selection with necessary transformations based on residual analysis, this is the equation I came uo with:

price ~ ln(area) + stories + bathrooms + airconditioning + prefarea + parking + basement + hotwaterheating + mainroad + D(furnishing_status)

Looks decent, but since I used a lot of features I want to check for overfitting. So I do a test-train split. (Can do Cross Validation as well, but the test train split will help we show the global fit — local fit trade off)

A perfect y_pred vs y_test plot will be a 45-degree straight line (the red dashed one). However, we have some residuals, so our plot should scatter around the 45-degree line (meaning no bias in prediction). If the width of the scatter (variance) around the 45-degree line is constant, it’ll help us quantify the uncertainty in the predictions we make.

For instance, since the residual standard deviation is 1.2 Million, I should be able to tell with 95% confidence that the true price will be within a +/- 2.4 M range. (Approx interval. For exact, one should calculate the prediction interval)

Most of the points within the green circle satisfy these criteria. They are scattered well around the 45-degree line with constant variance. These predictions can be made well.

But the points inside the red circle are being systematically under-estimated. Basically, the model under-estimated the prices of costly homes. Why?

One definite reason is that the dataset has very few cases of costly homes as you can see in the plot itself. Most houses cost less than 8M.

The OLS linear regression optimizes for a global fit. It just tries to minimize the overall residuals (RMSE). As a result, it makes sense for the regression line to cut across densely populated points to reduce the residual there. A huge one-off residual is less than hundreds of small residuals.

In more technical words, in OLS each point has equal weightage. Costly houses, being less in number, will have less weightage in influencing the regression model. Cheaper houses influenced the model more being more in number. And thus the resultant model under-estimates the price of costly houses.

If I am a real-estate broker, how should I use my model now?

Internet is aplenty with what to do when there is imbalance in the dataset. Most of them has something to do with creating synthetic data (SMOTE, upsampling etc). However, I will stick with the raw data for now and discuss various ways of dealing with our model. We can appreciate better these other methods as well by the end.

1. Be an efficient middle-class (local) broker:

If a client asks for a massive, fully furnished, main-road facing house with 3-4 bedrooms etc., (features of those costly homes) — I will refrain from predicting a price for them. Or give my prediction as a liberal lower bound. In technical terms, I’ll infer only about the population my sample data most represents and I’ll do it well.

2. Be a less efficient global broker:

I would gain more commission on selling costly houses, not middle-class ones. So, I want to be able to predict prices of costly homes fairly well. How? I can do that by increasing the weightage of these costly homes during the RMSE optimization. This is Weighted Least Squares (WLS) Regression.

The z-score of price is chosen as weights in this. So that outliers would get more weight. There are other standard ways of taking weights as well, depedning on what you want to acheive.

Now I can predict costly houses price better than before. Atleast till upto 10M. If I further adjust weights, I may do even better. But can you see what I’ve lost?

I’ve lost my efficiency. Overall residuals have increased (OLS reduces the most in Linear Regression. Anything but that will increase them). My performance with middle-class homes deteriorated.

Basically, I traded my efficiency with a majority of houses with some efficiency at costly houses.

It is not necessarily bad.

If my compromised efficiency is still practical enough in real world, I can actually become a better broker with this weighted least squares regression.

3. Explore Generalised Linear models (GLM) or other ML models

WLS is one such GLM. But we can also try others like fitting two different lines — one for middleclass homes and one for rich. Or we can explore other ML models like Decision Trees Regressor, SVM, etc. However, they may not be as explainable as Linear Regression models.

要查看或添加评论,请登录

Sai Krishna Dammalapati的更多文章

  • LogProbs

    LogProbs

    LogProbs is one of the basic skills for a prompt engineer to have. Some background before implementing it: An LLM model…

    1 条评论
  • When to brush your teeth? A good ANOVA study!

    When to brush your teeth? A good ANOVA study!

    I found this paper which did a simple ANOVA study to find out when should one brush their teeth! TL;DR Brush twice a…

  • Statistical issues in this paper studying relation between air quality and LULC

    Statistical issues in this paper studying relation between air quality and LULC

    A paper got published in Environmental Monitoring and Assessment. It studied relation between land-use classes (Urban…

  • Bayesian probabilistic forecasts using categorical information | Part 1

    Bayesian probabilistic forecasts using categorical information | Part 1

    In this blog, I will make Bayesian forecasts of Ozone concentrations. My previous blog on Bayesian analysis: Bayesian…

  • 100% Mediation in Action

    100% Mediation in Action

    I wrote about Mediators in the previous article. This is a follow-up to it.

  • Mediators

    Mediators

    I one of my previous blogs, we saw Omitted Variable Bias. In this blog, we’ll do mediation analysis using the same…

  • Visualize Collider Bias with me

    Visualize Collider Bias with me

    It’s 2020. You are a doctor.

  • A Statistician counts well

    A Statistician counts well

    I’ve come across an article Counting as Statistics in Saket Choudhary's blog. The blog has a story on how statisticians…

  • Omitted Variable Bias (OVB)

    Omitted Variable Bias (OVB)

    You performed a regression between house prices and area and obtained a coefficient (β) for area. You’d interpret it…

  • Clarifications into Regression Discontinuity Design (RDD)

    Clarifications into Regression Discontinuity Design (RDD)

    I came across one RDD study last week where observational data was used to find the causal link between air pollution…

社区洞察

其他会员也浏览了