Causal Inference Application for Effective Root Cause(s) Analytics
Vinamra Vikram Vishen
Data Leader|Bandhan AMC | ZEE | American Express | FMS Delhi MBA
<Article is laid out in 4 Parts, ~1.5K words and it is a 20-minute read. Suitable for tenured working professional working with digital native companies>
?
Part I: Context:
?
As part of digital native experiences, organizations set up key business goals or KPIs (Key Performance Indicators). Often these defined goals are an effect of multiple business decisions (change in - marketing, offers, product quality), external factors (change in - competition approach) and product experiences (change in - task success and responsiveness, funnel effectiveness) etc.?
?
This is an outright complex problem to solve considering factors will have weight and different sensitivities to outcome business KPIs. Additionally, the factors also have a built-in interaction that’s material to the outcome KPI (Key Performance Indicator) change.
?
Part II: Challenges with ‘traditional’ RCA (Root Cause Analysis) approaches
?
In traditional RCA approaches (one at a time attribution factor analysis or hypothesis testing or 5 Whys etc.), while the cause effect relationship is respected, we may often end up giving more importance (read weightage) than true for a factor.
?
Key Differences : RCA vs Causal Inference
?
?
Especially factors that nose-dived more or were newly introduced in the period coinciding with the business KPI drop. This is a grave mistake that fixes the KPI in the short term but often impacts negatively long term KPI or strategic outcomes for the enterprise.
?
Some of consequences can result in decision churn in terms of following actions:
?
For e.g. We may inadvertently miss out by pulling back a product feature that was launched exactly when a business KPI changed course.
For e.g., we may turn off a business campaign/marketing while the competitor strategy or launches was the core reason for the drop etc.
?
Experienced analytics teams often take holdouts or simulate holdouts to mitigate risks of such wrong decision making. In practice, however such approaches are still tough to implement in complex systems (e.g., availability of true holdout for all causal factors) ?
?
At the end of the day, it is still diverse from reality that there are often multiple causes that interact between each other, and that move the outcome in varying degrees.
?
?
Part III: Causal Based RCA Explained
?
Let us understand with a real-world example:
?
A cornerstone metrics for any platform will be session success engagement (for OTT (Over the Top) – e.g., session engaged with at least one video watched)
?
This is core experience KPI but feeds into the business KPIs of time spent that drives revenue for the platform. Let us see we see a sudden drop in this KPI, how do we approach this as a problem?
?
?
Step Zero:? Establish the ‘trueness’ of KPI drop (session success % drop)
?
Key step to proceeding before proceeding will be to establishing the drop-in session engage issue is real, we first establish a basic understanding of this KPI in terms ‘learnt volatility,’ short term and long-term moving trends etc. We then build a framework for red alert, yellow alert, and no alert classification. We also review this framework over time to make it more robust. This is a key step to eliminate any trend declines etc. (potentially not a cause for concern i.e.).
?
It is notable, that while causal establishes as is explanation of KPI whether it is dropping or not, it is still important to note that establishing ‘trueness’ of KPI drops a vital step before further analysis.
?
It is also helpful to set the useful context on number of times the aberration occurred for KPI in recent past.
?
?
Step 1: Feature Selection:
?
Vital to explain this will be discovery experiences (and player experiences, in case of OTT example). However, let us explore more up-funnel factors to make the RCA concrete.
?
Factors that explain session exits (comprehensive but not exhaustive list):
?
1)?? Traffic: Traffic Mix (new vs repeat, paid vs organic, paid by channels contribution)
2)?? Competitor Factors: major competitor action/ product launches etc. (quantify new installs and user base in relative sense - a lot of panel data sources are available for the same)
3)?? External Events: major external events that may take up time share of user (e.g., election day)
4)?? New Product Launches: Incumbent new product launches/ it is relative quality
5)?? Offers & Promotions/ Price Changes
6)?? Personalization: Landing page experiences, any change in user personalization
7)?? User Journey: Product discovery experiences (e.g., navigation, search task success, any new feature release that interacts with discovery)
8)?? User Journey: Product Details Experiences (e.g., for OTT it may be a content details page)
9)?? Tech Task Success: e.g., App Task Success, Video Task Success in case of OTT etc.
?
?
?
Step 2: Model Selection, Performance & Interpretation:
?
领英推荐
Model Selection & Performance
?
-??????? We are proposing Ridge Regression technique as it helped us not only solve issues related to multicollinearity (unlike OLS, PLS) but also did not end up reducing the number of feature attributes (unlike Lasso).
?
-??????? Notably, our prediction of Core KPI – Session Success % is not as good as the other innovative models. However, our soul objective is not to predict the outcome of our core KPI – session success % with highest accuracy. It is to have an extremely high degree of confidence in the coefficients of explanatory features selected, to establish weights of equation amidst correlated features and while avoid feature reduction i.e., Ridge offers solutions to all these chief concerns in our perspective balancing the predictability as well as making the factors explainable.
?
?
Interpretation
?
Let us assume hypothetical equation with coefficient and directionality, of let us say only three attributes or feature
-??????? Feature 1: Our App User Share vis-à-vis Industry’s % (expected: higher this metric better the end KPI)
-??????? Feature 2: New User vs Repeat User Share % (expected: higher this metrics, lower the end KPI)
-??????? Feature 3: Discovery Journey – More Explore Behaviour % (expected: higher this metrics – better the end KPI)
?
?
Let us say equation comes out like below,
Session Success %. = [0.1* Feature_1] + [ (-0.2) Feature 2] + [ 0.3Feature 3] + Error
?
It clearly, suggests that unit change in Feature 3 has most powerful and positive impact on the end outcome relative to unit change in Feature 2. And, that the Feature 2 is the negatively contributing factor.
?
We will also build alerting control charts for each of the features 1,2,3 in addition to Session Success % - Outcome KPI. This will help us to immediately conclude which explanatory variable broke that led to the outcome KPI to breakdown. It will also help us understand immediately which another feature is trending up or supporting the outcome KPI.
?
?
Step 3: Decision Framework:
?
·????? We track not only significant deviations to the outcome KPI in ‘session success %’ but also to each one of the individual explanatory attributes (as listed in feature selection) so that to be proactive in explaining what broke quickly.
?
·????? We can then overlay the model interpretation to quickly interpret the top attributes that ‘negatively’ effect the outcome are they up significantly?
?
·????? Which of the attributes that ‘positively’ effect the outcome is up and how are they supporting the outcome or mitigating the drop?
?
Combining the above 3 pointers, we are in a better situation to shape the resolution and we have well rounded information on the KPI overall.?
?
Here is how the output may look like, with each feature having appropriate weights and directionality to help guide decision framework.
?
?
Part IV: Other Applications of Causal Inference Framework
?
?
Performance Optimization:
?
-???? It helps identify factors that positively or negatively affect video viewership. This knowledge allows for effective resource allocation and strategy optimization to improve the session engagement/ lower session exits basis above example
-???? Analysing predictor influence informs decisions on ad campaigns, user experience, content selection, and targeting. This prioritizes efforts and investments in areas with the greatest impact on increasing the metrics.
?
Experimentation at Scale
?
-???? Predictor impact can help optimize other metrics. By experimenting with predictor values, we can fine-tune predictors for desired outcomes without complex development or a clean holdout. This is possible because we already have a good level of confidence in the simulated equation of the business KPI as it moves in the real world.
?
?
?
?
?
?
?
Informed Decision-making:
?
-???? Analysing predictor influences & informs decisions on ad campaigns, user experience, content selection, and targeting. This prioritizes efforts and investments in areas with the greatest impact on increasing the essential engagement metrics
?
Causal Inference:
?
-???? ?Causal inference models reveal direct relationships between predictors and the ratio, beyond correlation. Identifying causality enables confident predictor changes with predictable effects on the KPI metrics.
?
(Contributing Authors)
Vinamra Vikram Vishen SVP and Head of CX Analytics at ZEE
Balaji Selvaraj Senior CX Analyst at ZEE
?
?
?