Towards constructing an unbiased Machine Learning regression model — an example of Airbnb rental prices
As a follow-up to one of my recent articles, I have received a comment from? Eugenia Anello about the usage of label log-transformation in Machine Learning regression tasks. Indeed, I am using this method many times, and thus will try to shed more light on its potential caveats, and also discuss a viable alternative.The result of this study is also available?in this public Kaggle notebook.
Step 1 — data preprocessing
Here I take?the same?dataset?and use the same preprocessing steps as in the?abovementioned article.
Step 2 — the impact of label log-transformation on model residuals
Here, I look at model residuals (predicted minus actual log-transformed Airbnb price in EUR for two people and two nights) for 2 models:
For ease of comparison, I also plot the?median binned values?obtained with the help of?scipy.stats.binned_statistic.
Because of log-transformation, the model residuals are measured in?dex?units.
领英推荐
The provided?median binned values?show the remarkable drawback of both produced models. While the?root mean squared error?(RMSE) of the log-transformed models is only about 0.145 dex (roughly corresponding to 40% variation), there are much higher systematic differences between the predicted and actual prices,?especially for the highest-priced apartments?(so that their?predicted listing price is up to?10**1.7 = 50?times smaller than the actual listing price).
Step 3 — reducing the model bias with minority oversampling
In reality, choosing a proper method to mitigate the model bias usually relies on business requirements. Namely, what are the consequences if our model significantly overestimates or underestimates the actual rental price? What are the “accepted” limits? If there are any?Service-level agreements (SLAs)?regarding that?
As an example, assume that our SLA is to provide the model price that its median deviates from the actual price?no more than 1.0 dex (or 10 times). As we see from the screenshot above, this is violated for high-priced apartments (>?10**4 = 10,000?EUR for two people and two nights). A simple correction would be an?oversampling?of high-priced apartments so that the model will take more attention about such samples, for example,?by increasing the number of apartments priced at about 1,000 EUR, by 30 times.
As expected, the oversampled model performs much better (now within the SLA ranges) for high-priced listings and still performs well for both small-priced and medium-priced listings:
Data Scientist with strong background in physics and math, modelling and statistics.
1 年How are you going to deal with overfit? The problem is that you artificially added data, which may not exist. If the real data are too different from your oversampled case, then you introduced an arbitrary bias, isn't it?
Curious Software Engineer | Technical Writer | A.I Enthusiast | Content Creator
1 年Love this ??
Data scientist | Technical Writer at Towards Data Science | Geospatial data analysis
1 年Happy that my comment inspired you with a follow-up?? It has really helped me a lot in my problem.