登录查看更多内容

Inferring Causality in Financial Data (1/3)

Jagan Wadhwa

Fintech | Lending | IIM-A | Ex-Paytm, Morgan Stanley, ICICI

发布日期: 2021年2月4日

Causality is the key to unlocking maximum business value in any process. Identifying causal association networks of multiple variables and quantifying causal strength are key challenges in the analysis of complex dynamical systems. Given the multitude of data generating processes ranging from a linear to a synergistic process. This complicates the nature of relationship amongst the various data series.

This series of articles focusses on discussing the same and iterating the contemporary methods that can be used for detecting causality. The current article gives a basic description of the problem statement followed by two more articles which describe the various methodologies available and the last one creating the novel approach for identifying causality.

A data scientist can use various different methods to estimate the causal effects of a factor. The “levels of evidence ladder” is a great mental model that introduces the ideas of causal inference. The ladder mentioned here is an indication of the level of proof each method provides. The higher the method is on the ladder, the easier it is to compute estimates that constitute evidence of a strong causal relationship.

Methods at the top of the ladder typically (but not always) require more focus on the experimentation setup. On the other end, methods at the bottom of the ladder use more observational data, but require more focus on robustness checks (more on this later)

Ideally causal inference over random variables, representing different events can be arrived at by running simulations. The most common example are two variables, each representing one alternative of an A/B test, and each with a set of samples/observations associated with it. Since there is no free lunch, these methods come with their own costs and limitations. Although these simulations can be done but are too expensive an approximation. Further, in case of natural calamities, financial world etc. phenomena RCT (Randomly Controlled Trials) cannot be executed (we can’t have an experiment for earthquake vs. non earthquake regions). Alas! we cannot use them in a time series dataset since history cannot be altered :P

Quasi-experiments

Sometimes it’s just not possible to set up an experiment. Here are a few reasons why A/B tests won’t work in every situation; A quasi-experiment (rung two) is an experiment where your treatment and control group are divided by a natural process that isn’t truly random, but are considered close enough to compute estimates.

Quasi-experiments frequently occur in product companies, for example, when a feature rollout happens at different dates in different countries, or if eligibility for a new feature is dependent on the behavior of other features (like in the case of a deprecation). In order to compute causal estimates when the control group is divided using a non-random criterion, you’ll use different methods that correspond to different assumptions on how “close” you are to the random situation.

One of the used methods in this case is a linear regression with fixed effects. In this method, the assumption is that we’ve collected data on all factors that divide individuals between treatment and control group. If that is true, then a simple linear regression on the metric of interest, controlling for these factors, gives a good estimate of the causal effect of being in the treatment group.

The parallel trends assumption for differences-in-differences. In the absence of treatment, the difference between the ‘treatment’ and ‘control’ group is a constant. Plotting both lines in a temporal graph like this can help check the validity of the assumption.

Counterfactuals

There will be cases when you’ll want to try to detect causal factors from data that only consists of observations of the treatment. A classic example in tech is estimating the effect of a new feature that was released to all the user base at once: no A/B test was done and there’s absolutely no one that could be the control group. In this case, you can try making a counterfactual estimation.

The idea behind counterfactual estimation is to create a model that allows you to compute a counterfactual control group. In other words, you estimate what would happen had this feature not existed. It isn’t always simple to compute an estimate.

Hence ascertaining causality is a critical task which has various facets to it. The goal in time series causal discovery is to ascertain the complex dynamical system generating the entire process and reliably estimate the causal links, including their time lags. The current blog iterates the various techniques used in the time series methods

Importance of Methodology and Key Questions to Ask

In a system comprising dozens to hundreds of variables (e.g., different regional climate indices, Stock market data), correlations will arise not only because of direct causal effects but also because of autocorrelation effects within each time series, indirect links, or common drivers.

A causal discovery method detects as many true causal relationships as possible (high detection power) and controls the number of false positives (incorrect link detections). Data-driven causal inference in such systems is challenging since datasets are often high dimensional and nonlinear with limited sample sizes.

Given a finite time series sample, every causal discovery method has to balance the trade-off between too many false positives (incorrect link detections) and too few true positives (correct link detections). A causality method ideally controls false positives at a predefined significance level (e.g., 5%) and maximizes detection power.

The power of a method to detect a causal link depends on the available sample size, the significance level, the dimensionality of the problem (e.g., the number of coefficients in an autoregressive model), and the effect size, which, here, is the magnitude of the effect as measured by the test statistic (e.g., the partial correlation coefficient).

To strengthen the credibility of causal interpretations, we need to include more variables that might explain a spurious relationship, but these lead to lower power to detect true causal links due to higher dimensionality and possibly lower effect size. Low detection power also implies that causal effect estimates become less reliable as we show in Results. Ideally, we want to condition only on the few relevant variables that actually explain a relationship.

In light of the above constructives; for financial data the typical methods applicable for a cross sectional data are futile. I would end the article here by posting a few questions which the next article would address:

Is the causality linear or non-linear ? Any possibility of a feedback effect in the data?
Are the set of variables in the data sufficient for capturing the entire data generating process?
Are there any latent variables? Are there any confounding variables in the data structure?
What is the ideal time frame over which the observed causality would stay. What if the causality varies over different time periods?
How reliable is the method when it comes to the false positive rate?

Inferring Causality in Financial Data (1/3)

Jagan Wadhwa

Fintech | Lending | IIM-A | Ex-Paytm, Morgan Stanley, ICICI

Quasi-experiments

Counterfactuals

Importance of Methodology and Key Questions to Ask

社区洞察

其他会员也浏览了

The good, the bad and the ugly: Battles of bias in statistical stories

Simple Exponential Smoothing

Harry Potter and the hidden variables.

Dynamic Time Warping (DTW): A Powerful Tool for Time Series Analysis

Out-of-sample forecasting challenges in time series data

How Big of a Sample Size do you need for Factor Analysis?

How mean was the mean and data biases

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

4th Story – Lies, Damned Lies and Statistics