登录查看更多内容

Real estate broker working with Linear Regression on imbalanced data

Sai Krishna Dammalapati

Civic Technology | Statistics | Data | Science

发布日期: 2024年10月30日

I used Housing price data for this analysis. Previous blog based on the same dataset are:

The objective of this linear regression is to predict housing prices (in Millions) based on other features like area, number of bedrooms, parking facility, etc. Except for area, other variables are either binary or discrete variables.

After performing a forward feature selection with necessary transformations based on residual analysis, this is the equation I came uo with:

price ~ ln(area) + stories + bathrooms + airconditioning + prefarea + parking + basement + hotwaterheating + mainroad + D(furnishing_status)

Looks decent, but since I used a lot of features I want to check for overfitting. So I do a test-train split. (Can do Cross Validation as well, but the test train split will help we show the global fit — local fit trade off)

A perfect y_pred vs y_test plot will be a 45-degree straight line (the red dashed one). However, we have some residuals, so our plot should scatter around the 45-degree line (meaning no bias in prediction). If the width of the scatter (variance) around the 45-degree line is constant, it’ll help us quantify the uncertainty in the predictions we make.

For instance, since the residual standard deviation is 1.2 Million, I should be able to tell with 95% confidence that the true price will be within a +/- 2.4 M range. (Approx interval. For exact, one should calculate the prediction interval)

Most of the points within the green circle satisfy these criteria. They are scattered well around the 45-degree line with constant variance. These predictions can be made well.

But the points inside the red circle are being systematically under-estimated. Basically, the model under-estimated the prices of costly homes. Why?

One definite reason is that the dataset has very few cases of costly homes as you can see in the plot itself. Most houses cost less than 8M.

The OLS linear regression optimizes for a global fit. It just tries to minimize the overall residuals (RMSE). As a result, it makes sense for the regression line to cut across densely populated points to reduce the residual there. A huge one-off residual is less than hundreds of small residuals.

In more technical words, in OLS each point has equal weightage. Costly houses, being less in number, will have less weightage in influencing the regression model. Cheaper houses influenced the model more being more in number. And thus the resultant model under-estimates the price of costly houses.

领英推荐

Daily Update: The Hazardous Nostalgia of Real Estate…

S&P Global 1 年前

The role of data analytics in the multifamily industry

Tim Safransky, CPA 1 年前

What are the best leading indicators for real estate?…

Julian Khursigara 1 年前

If I am a real-estate broker, how should I use my model now?

Internet is aplenty with what to do when there is imbalance in the dataset. Most of them has something to do with creating synthetic data (SMOTE, upsampling etc). However, I will stick with the raw data for now and discuss various ways of dealing with our model. We can appreciate better these other methods as well by the end.

1. Be an efficient middle-class (local) broker:

If a client asks for a massive, fully furnished, main-road facing house with 3-4 bedrooms etc., (features of those costly homes) — I will refrain from predicting a price for them. Or give my prediction as a liberal lower bound. In technical terms, I’ll infer only about the population my sample data most represents and I’ll do it well.

2. Be a less efficient global broker:

I would gain more commission on selling costly houses, not middle-class ones. So, I want to be able to predict prices of costly homes fairly well. How? I can do that by increasing the weightage of these costly homes during the RMSE optimization. This is Weighted Least Squares (WLS) Regression.

The z-score of price is chosen as weights in this. So that outliers would get more weight. There are other standard ways of taking weights as well, depedning on what you want to acheive.

Now I can predict costly houses price better than before. Atleast till upto 10M. If I further adjust weights, I may do even better. But can you see what I’ve lost?

I’ve lost my efficiency. Overall residuals have increased (OLS reduces the most in Linear Regression. Anything but that will increase them). My performance with middle-class homes deteriorated.

Basically, I traded my efficiency with a majority of houses with some efficiency at costly houses.

It is not necessarily bad.

If my compromised efficiency is still practical enough in real world, I can actually become a better broker with this weighted least squares regression.

3. Explore Generalised Linear models (GLM) or other ML models

WLS is one such GLM. But we can also try others like fitting two different lines — one for middleclass homes and one for rich. Or we can explore other ML models like Decision Trees Regressor, SVM, etc. However, they may not be as explainable as Linear Regression models.

要查看或添加评论，请登录

Sai Krishna Dammalapati的更多文章

LogProbs

2025年3月21日

LogProbs

LogProbs is one of the basic skills for a prompt engineer to have. Some background before implementing it: An LLM model…

1 条评论
When to brush your teeth? A good ANOVA study!

2025年1月10日

When to brush your teeth? A good ANOVA study!

I found this paper which did a simple ANOVA study to find out when should one brush their teeth! TL;DR Brush twice a…
Statistical issues in this paper studying relation between air quality and LULC

2024年12月24日

Statistical issues in this paper studying relation between air quality and LULC

A paper got published in Environmental Monitoring and Assessment. It studied relation between land-use classes (Urban…
Bayesian probabilistic forecasts using categorical information | Part 1

2024年12月13日

Bayesian probabilistic forecasts using categorical information | Part 1

In this blog, I will make Bayesian forecasts of Ozone concentrations. My previous blog on Bayesian analysis: Bayesian…
100% Mediation in Action

2024年12月5日

100% Mediation in Action

I wrote about Mediators in the previous article. This is a follow-up to it.
Mediators

2024年12月2日

Mediators

I one of my previous blogs, we saw Omitted Variable Bias. In this blog, we’ll do mediation analysis using the same…
Visualize Collider Bias with me

2024年11月30日

Visualize Collider Bias with me

It’s 2020. You are a doctor.
A Statistician counts well

2024年11月27日

A Statistician counts well

I’ve come across an article Counting as Statistics in Saket Choudhary's blog. The blog has a story on how statisticians…
Omitted Variable Bias (OVB)

2024年11月23日

Omitted Variable Bias (OVB)

You performed a regression between house prices and area and obtained a coefficient (β) for area. You’d interpret it…
Clarifications into Regression Discontinuity Design (RDD)

2024年11月19日

Clarifications into Regression Discontinuity Design (RDD)

I came across one RDD study last week where observational data was used to find the causal link between air pollution…

See all articles

Real estate broker working with Linear Regression on imbalanced data

Sai Krishna Dammalapati

Civic Technology | Statistics | Data | Science

领英推荐

If I am a real-estate broker, how should I use my model now?

1. Be an efficient middle-class (local) broker:

2. Be a less efficient global broker:

3. Explore Generalised Linear models (GLM) or other ML models

Sai Krishna Dammalapati的更多文章

社区洞察

其他会员也浏览了

The Future of Real Estate: Where Big Data Meets the Human Touch

Using Business Analytics for Smart Decision-Making in Real Estate

Best in Texas?: Austin Real Estate Monthly Trend Model Released by ALEX.realestate

Cross-Market Intelligence: Why It’s Your Secret Weapon in Real Estate

Getting Creative with Cap Rates

Leveraging Real Estate Data Analytics with Expert Diana Zaya

The Future of Real Estate: What Lies Ahead

House prices: why most of the property platforms get it wrong.

Scrape Dubizzle Property Data | UAE

SQONE Partners Portal: Your Ultimate Real Estate Resource

领英推荐

If I am a real-estate broker, how should I use my model now?

1. Be an efficient middle-class (local) broker:

2. Be a less efficient global broker:

3. Explore Generalised Linear models (GLM) or other ML models

Sai Krishna Dammalapati的更多文章

LogProbs

When to brush your teeth? A good ANOVA study!

Statistical issues in this paper studying relation between air quality and LULC

Bayesian probabilistic forecasts using categorical information | Part 1

100% Mediation in Action

Mediators

Visualize Collider Bias with me

A Statistician counts well

Omitted Variable Bias (OVB)

Clarifications into Regression Discontinuity Design (RDD)

社区洞察

其他会员也浏览了

The Future of Real Estate: Where Big Data Meets the Human Touch

Using Business Analytics for Smart Decision-Making in Real Estate

Best in Texas?: Austin Real Estate Monthly Trend Model Released by ALEX.realestate

Cross-Market Intelligence: Why It’s Your Secret Weapon in Real Estate

Getting Creative with Cap Rates

Leveraging Real Estate Data Analytics with Expert Diana Zaya

The Future of Real Estate: What Lies Ahead

House prices: why most of the property platforms get it wrong.

Scrape Dubizzle Property Data | UAE

SQONE Partners Portal: Your Ultimate Real Estate Resource