Towards constructing an unbiased Machine Learning regression model — an example of Airbnb rental prices
Photo by Kevin Ku (https://unsplash.com/@ikukevk) on Unsplash

Towards constructing an unbiased Machine Learning regression model — an example of Airbnb rental prices

As a follow-up to one of my recent articles, I have received a comment from? Eugenia Anello about the usage of label log-transformation in Machine Learning regression tasks. Indeed, I am using this method many times, and thus will try to shed more light on its potential caveats, and also discuss a viable alternative.The result of this study is also available?in this public Kaggle notebook.


Step 1 — data preprocessing

Here I take?the same?dataset?and use the same preprocessing steps as in the?abovementioned article.


Step 2 — the impact of label log-transformation on model residuals

Here, I look at model residuals (predicted minus actual log-transformed Airbnb price in EUR for two people and two nights) for 2 models:

  • the model with log-transformed (x -> np.log10(x)) price (dubbed?log-transform);
  • the model without log-transformed price (dubbed?no log-transform).

For ease of comparison, I also plot the?median binned values?obtained with the help of?scipy.stats.binned_statistic.

Because of log-transformation, the model residuals are measured in?dex?units.

No alt text provided for this image
Source: author, https://www.kaggle.com/code/dima806/log-transform-example-airbnb


The provided?median binned values?show the remarkable drawback of both produced models. While the?root mean squared error?(RMSE) of the log-transformed models is only about 0.145 dex (roughly corresponding to 40% variation), there are much higher systematic differences between the predicted and actual prices,?especially for the highest-priced apartments?(so that their?predicted listing price is up to?10**1.7 = 50?times smaller than the actual listing price).


Step 3 — reducing the model bias with minority oversampling

In reality, choosing a proper method to mitigate the model bias usually relies on business requirements. Namely, what are the consequences if our model significantly overestimates or underestimates the actual rental price? What are the “accepted” limits? If there are any?Service-level agreements (SLAs)?regarding that?

As an example, assume that our SLA is to provide the model price that its median deviates from the actual price?no more than 1.0 dex (or 10 times). As we see from the screenshot above, this is violated for high-priced apartments (>?10**4 = 10,000?EUR for two people and two nights). A simple correction would be an?oversampling?of high-priced apartments so that the model will take more attention about such samples, for example,?by increasing the number of apartments priced at about 1,000 EUR, by 30 times.

As expected, the oversampled model performs much better (now within the SLA ranges) for high-priced listings and still performs well for both small-priced and medium-priced listings:
No alt text provided for this image
Source: author, https://www.kaggle.com/code/dima806/log-transform-example-airbnb

I hope these results can be useful for you. In case of questions/comments, do not?hesitate to write in the comments below?or?reach me directly?through?LinkedIn?or?Twitter.

Viktor Begun

Data Scientist with strong background in physics and math, modelling and statistics.

1 年

How are you going to deal with overfit? The problem is that you artificially added data, which may not exist. If the real data are too different from your oversampled case, then you introduced an arbitrary bias, isn't it?

回复
Isaac Yakubu

Curious Software Engineer | Technical Writer | A.I Enthusiast | Content Creator

1 年

Love this ??

Eugenia Anello

Data scientist | Technical Writer at Towards Data Science | Geospatial data analysis

1 年

Happy that my comment inspired you with a follow-up?? It has really helped me a lot in my problem.

要查看或添加评论,请登录

Dmytro Iakubovskyi的更多文章

社区洞察

其他会员也浏览了