登录查看更多内容

Towards constructing an unbiased Machine Learning regression model — an example of Airbnb rental prices

Dmytro Iakubovskyi

Senior Data Scientist at Velux

发布日期: 2023年6月24日

As a follow-up to one of my recent articles, I have received a comment from? Eugenia Anello about the usage of label log-transformation in Machine Learning regression tasks. Indeed, I am using this method many times, and thus will try to shed more light on its potential caveats, and also discuss a viable alternative.The result of this study is also available?in this public Kaggle notebook.

Step 1 — data preprocessing

Here I take?the same?dataset?and use the same preprocessing steps as in the?abovementioned article.

Step 2 — the impact of label log-transformation on model residuals

Here, I look at model residuals (predicted minus actual log-transformed Airbnb price in EUR for two people and two nights) for 2 models:

the model with log-transformed (x -> np.log10(x)) price (dubbed?log-transform);
the model without log-transformed price (dubbed?no log-transform).

For ease of comparison, I also plot the?median binned values?obtained with the help of?scipy.stats.binned_statistic.

Because of log-transformation, the model residuals are measured in?dex?units.

领英推荐

An Introduction to Z-Streams (and Collective…

Peter Cotton 4 年前

Machine learning

Darshika Srivastava 11 个月前

The Emotional Journey of Machine Learning: How Models…

Vinay Kumar Sharma 5 个月前

No alt text provided for this image — Source: author, https://www.kaggle.com/code/dima806/log-transform-example-airbnb

The provided?median binned values?show the remarkable drawback of both produced models. While the?root mean squared error?(RMSE) of the log-transformed models is only about 0.145 dex (roughly corresponding to 40% variation), there are much higher systematic differences between the predicted and actual prices,?especially for the highest-priced apartments?(so that their?predicted listing price is up to?10**1.7 = 50?times smaller than the actual listing price).

Step 3 — reducing the model bias with minority oversampling

In reality, choosing a proper method to mitigate the model bias usually relies on business requirements. Namely, what are the consequences if our model significantly overestimates or underestimates the actual rental price? What are the “accepted” limits? If there are any?Service-level agreements (SLAs)?regarding that?

As an example, assume that our SLA is to provide the model price that its median deviates from the actual price?no more than 1.0 dex (or 10 times). As we see from the screenshot above, this is violated for high-priced apartments (>?10**4 = 10,000?EUR for two people and two nights). A simple correction would be an?oversampling?of high-priced apartments so that the model will take more attention about such samples, for example,?by increasing the number of apartments priced at about 1,000 EUR, by 30 times.

As expected, the oversampled model performs much better (now within the SLA ranges) for high-priced listings and still performs well for both small-priced and medium-priced listings:

I hope these results can be useful for you. In case of questions/comments, do not?hesitate to write in the comments below?or?reach me directly?through?LinkedIn?or?Twitter.

Viktor Begun

Data Scientist with strong background in physics and math, modelling and statistics.

1 年

How are you going to deal with overfit? The problem is that you artificially added data, which may not exist. If the real data are too different from your oversampled case, then you introduced an arbitrary bias, isn't it?

Isaac Yakubu

Curious Software Engineer | Technical Writer | A.I Enthusiast | Content Creator

1 年

Love this ??

1 次回应

Eugenia Anello

Data scientist | Technical Writer at Towards Data Science | Geospatial data analysis

1 年

Happy that my comment inspired you with a follow-up?? It has really helped me a lot in my problem.

1 次回应

查看更多评论

要查看或添加评论，请登录

Dmytro Iakubovskyi的更多文章

Anomaly Detection in Python: Best Practices and Techniques

2023年7月7日

Anomaly Detection in Python: Best Practices and Techniques

When writing previous articles related to detailed data analytics, such as this one, I often receive questions about…

2 条评论
What would be the next “big thing” in AI?

2023年7月1日

What would be the next “big thing” in AI?

During the last time, generative AI models such as ChatGPT and Midjourney attracted a lot of our attention. A natural…

3 条评论
Deep Learning or classical Machine Learning — which one to use for your project?

2023年6月11日

Deep Learning or classical Machine Learning — which one to use for your project?

During the last decade, Deep Learning has received a lot of attention throughout the globe. Indeed, given the famous…
Data Engineer, Data Analyst, Data Scientist— which role to choose?

2023年6月6日

Data Engineer, Data Analyst, Data Scientist— which role to choose?

Below, I briefly summarise their main activities, typical yearly gross salaries (interquartile ranges, calculated for…
How to switch to Data Science?

2023年6月4日

How to switch to Data Science?

This article is also published in Medium and Substack. After becoming a data scientist, I have often been asked by my…

7 条评论

See all articles

Towards constructing an unbiased Machine Learning regression model — an example of Airbnb rental prices

Dmytro Iakubovskyi

Senior Data Scientist at Velux

Step 1 — data preprocessing

Step 2 — the impact of label log-transformation on model residuals

领英推荐

Step 3 — reducing the model bias with minority oversampling

Dmytro Iakubovskyi的更多文章

社区洞察

其他会员也浏览了

The Role of Machine Learning in Enhancing Actionable Intelligence

Stock Price Prediction with Regression Algorithms

Ten lessons that you can apply to your own AI strategy: What we can learn from Zillow on basing a business around Machine Learning

Gradient Descent in Machine Learning: Unleashing its Power in Financial Equity Markets

Some Fundamentals – Process, Data and Models

How (not) to use Machine Learning for time series forecasting: The sequel

What Is Polynomial Regression in Machine Learning?

3 real-world AI Failures and why we need to keep Humans in the Loop

Uncover new, more meaningful KPIs with Machine Learning

Dinner with Data Buddies: Demystifying the ROC Curve

Step 1 — data preprocessing

Step 2 — the impact of label log-transformation on model residuals

领英推荐

Step 3 — reducing the model bias with minority oversampling

Dmytro Iakubovskyi的更多文章

Anomaly Detection in Python: Best Practices and Techniques

What would be the next “big thing” in AI?

Deep Learning or classical Machine Learning — which one to use for your project?

Data Engineer, Data Analyst, Data Scientist— which role to choose?

How to switch to Data Science?

社区洞察

其他会员也浏览了

The Role of Machine Learning in Enhancing Actionable Intelligence

Stock Price Prediction with Regression Algorithms

Ten lessons that you can apply to your own AI strategy: What we can learn from Zillow on basing a business around Machine Learning

Gradient Descent in Machine Learning: Unleashing its Power in Financial Equity Markets

Some Fundamentals – Process, Data and Models

How (not) to use Machine Learning for time series forecasting: The sequel

What Is Polynomial Regression in Machine Learning?

3 real-world AI Failures and why we need to keep Humans in the Loop

Uncover new, more meaningful KPIs with Machine Learning

Dinner with Data Buddies: Demystifying the ROC Curve