Understanding downstream impact calculation as part of causal inference analysis
Mochamad Kautzar Ichramsyah
Analytics Lead (Competitive Intelligence, Omnicommerce) at Grab (GrabFood) | Analytics Content Writer at Medium (4K+ Followers)
Using propensity score matching (PSM) or Mahalanobis distance matching (MDM)
What is the downstream impact (DSI) calculation?
Downstream impact calculation or DSI refers to an effect that results from any decision we made that could be a benefit if the effect is positive, or a loss if the effect is negative. Usually, we are using experiments such as A/B testing to know the effect that will be made if we implement new treatments, so we could know beforehand, whether it will bring benefits or losses. For some reason such as?expensive cost,?no guarantee to prevent contamination?between each group (control vs treatment), and?disturbing the equilibrium?which means there is a possibility if we improve a specific product in our platform it might affect the other similar product negatively, we can not always use A/B testing as a method to be done before the product launched 100%.
In these conditions, there is a method called downstream impact calculation (DSI), which is explained above, there are many methods to perform this analysis. In this post, I want to share three methods that my team is using to solve problems in our company. In this post we are going to explore these methods using an example: we will investigate the effectiveness of a premium subscription program for a retail company with online and offline stores.
Online and offline retail stores example
Let’s say you are working in the company mentioned above and?you wanted to test the effect of a premium subscription program on your company.
First of all, it’s hard to randomize individual users to be chosen as the treatment group. Second, it’s very expensive to test this, the subscription program itself is not free and it will give the users more benefits such as free shipping cost for online purchases, and additional cashback by the percentages of the amount purchased, which will make the cost through the ceiling.
I generate a?dummy dataset?that observes a daily transaction in the company, all by means to make the explanation easier to understand.
As we can see in Image 3, the difference in the daily revenue amount before and after the subscription program launched is not noticeable. In Image 4, using the boxplot to visualize the spread of the revenue amount by each consumer type:
I know it’s confusing. My team also got confused because of this, we are trying to get a better method to calculate the effect of the subscription program launch, does it benefits us or not, or is it a loss?
Limitations on the traditional approach
Before vs after comparison
Subscribers vs Non-Subscribers
Twin pairing using propensity score matching (PSM)
After doing some research, we decide to use PSM to make a “twin-pair” from each user group,?non-subscribers?and?subscribers. The simple explanation for this would be we need to:
Baseline variables: Covariates
The next question is:?how can we decide which users are the best twin pair??There is a method called propensity score matching (PSM). The idea is to generate a score for each?non-subscribers?and?subscribers?, after that, we set them as a twin pair if their score is near enough by balancing the PSM result for each variable used to generate the PSM.
How to measure Distance
After having the appropriate covariates, we need to calculate the ‘distance’ between a subscriber and sets of non-subscribers and find the closest one to become the twin.
When measuring the distance, one usually uses Euclidean distance, which is really intuitive to understand. But it suffers at least two problems:
It’s really easy to do this by using the?MatchIt?package in R, a nonparametric preprocessing for parametric causal inference. The explanation about this package and how to use it is very clear in this?documentation. In short, it’s?using logistic regression to generate the PSM which will be used to decide which will be a twin pair based on their groups?non-subscribers?or?subscribers.
But, there is a limitation to the PSM approach as one of the matching approaches, it has been criticized many times, and become less recommended to imply causality. The main reason to look at other methods is that the balancing results are not still good enough. Most of the baseline variables still have Standardized Mean Difference (SMD) to be near or more than +/- 0.1. This indicates that there is still an imbalance left in our data. One of the possible reasons for the bad matching result is that our propensity score model was not good enough.
Twin pairing using the Mahalanobis distance method (MDM)
To overcome the previously mentioned problem, we use another method called Mahalanobis distance matching (MDM). The main difference between MDM and PSM is that we no longer use proxy variables (which is the propensity score), but we rather use all the baseline variables and calculate the distance between each?subscribers?user with?non-subscribers?users. The distance being calculated is not a regular cartesian distance, but rather a Mahalanobis distance.
Instead of using Euclidean distance, we could use?Mahalanobis distance?which has advantages as follow:
In short, Mahalanobis distance is calculating a Z-score for two or more dimensions. This is useful particularly if we have strong covariance between each variable. High covariance means that there is a strong correlation between each variable. If a high correlation does exist, the same cartesian distance between two pairs of points might have different Mahalanobis distance.?The distance between two points that are located in line with the direction of the correlation will have a smaller Mahalanobis distance compared to the distance between two points that are located perpendicular to the direction of the correlation.?This approach calculates a fairer distance between each point, which in our implementation will result in much better twin-pair estimation. The?non-subscribers?users twins are those who have the closest Mahalanobis distance with a?subscribers?user.
Using MDM, we were able to generate a more accurate matching. This is indicated by a minimal SMD difference for every baseline variable. As you can be seen on the love plot below, all of the baseline variables have SMD close to 0.
Evaluate the matching results
After finding the twin pair we can evaluate how good is our matching process in several ways:
A successful Matching process is indicated by a similar density plot between the control and treatment on each covariate. And the?SMD after matching is ranging from -0.1 to 0.1.
Calculating Impact
Finally, after having a good match result we could proceed to calculate the impact of the subscription program (treatment) on the outcome, without any intervention from the baseline variables.
To estimate the impact we could run one of the following:
Additional sharing: Synthetic control method (SCM) also as part of the causal inference analysis
SCM is a statistical method used to estimate causal effects from binary treatments on observational panel (longitudinal) data.?SCM is a technique to create an artificial control group by taking a weighted average of untreated units in such a way that it reproduces the characteristics of the treated units before the intervention (treatment).
The SCM acts as the counterfactual for a treatment unit and the estimate of a treatment effect is the difference between the observed outcome in the post-treatment period and the SCM’s outcome. SCM allows us to do causal inference analysis when we have as few as one treated unit and many control units and we observe them over time. These untreated units combined will create a synthetic unit or synthetic control unit.
For the best explanation about this method, you can read this?https://towardsdatascience.com/understanding-synthetic-control-methods-dd9a291885a1, I have learned a lot from there, and I think there is no better explanation to understand SCM.
Conclusion
In this article, we have explored a great method to calculate downstream impact with a better approach, which is:
The advantage to explore these methods to calculate the effect of changes in your product is it’s carefully assigning which users need to be checked, set as twin pair, before calculating the effect. And yes, of course, it can be your primary option when A/B testing is not feasible to calculate the effect of any changes you made in your product.
Thank you for reading!
Also, I want to say thanks to?Abdul Rachim Winata,?Ahmad Yusuf Albadri,?Philip Thomas,?Rajeev NCSTR,?Gaurav Khanna, and many others that helped my team to learn and use these methods to solve a lot of problems in our company and also their help, review, and feedback to post this article.
I am learning to write, mistakes are unavoidable, even when I try my best. If you find any problems/mistakes, please let me know!
I write more articles about data analytics on my Medium page. Feel free to read and learn on Kautzar Ichramsyah's Medium page. This post is a repost from here.
References
Data Analyst at Traveloka
9 个月Nice mas ??. Just want to ask, How to evaluate our Covariates variable ? I assume it’s the cofounding variables within the causal - effect relation, which can be tricky and affect the conclusion
Driving Growth with Data | Data Analyst & Scientist | Consultant | Mentor | Aspiring Data Leader
1 年Thank you for sharing Mochamad Kautzar Ichramsyah ?? , i will implement on my side hahaha
Data Analytics & Strategy
1 年sangat menginspirasi??
Computational Statistician || User Insights @ Docquity
1 年Cakep!