The Problem of Data Snooping in Financial Analysis and Machine Learning
Tamer Khraisha (Ph.D)
Software Engineer | O'Reilly Author | Financial Data Management and Technology
In non-experimental fields such as economics and finance, the use of statistics for inferring relationships in the data is a central practice. However, several studies and scholars have pointed out the fact that the practices followed in the statistical fields might be misleading. Particularly in financial analysis
In this article, I will shed light on the problem of data snooping in finance, give some explanations on why it happens, and provide a list of references for further reading.
Introductory example
To illustrate the problem of data snooping, I will borrow an interesting example from (Lo, 1994). The example is based on a mathematical discovery by the French mathematician Pierre de Fermat. Fermat discovered that prime numbers p- numbers greater than 1 that are not a product of two smaller natural numbers (e.g. 3, 5, 7) - have the following property: When 2^(p-1) is divided by p, there is always a remainder of 1. For example, when 2^(3-1) = 2^(2) = 4 is divided by 3, the result is 3 plus a remainder of 1. This finding holds for all prime numbers. However, the converse is not true: if a number q satisfies the property that 2^(q-1) / q has a remainder of 1, this does not necessarily imply that q is a prime number. Yet it was found that there are non-prime numbers that satisfy this property. Such numbers are called “Carmichael numbers”. For example, in the first 10000 numbers, there are only seven Carmichael numbers: 561, 1,105, 1,729, 2,465, 2,821, 6,601, and 8,911. Lo illustrates that a stock selection strategy that consists in selecting those stocks with one of these seven Charmichael numbers in their CUSIP had achieved remarkable performance over the period 1926-1991. For example, the figure below illustrates the performance of the stock with CUSIP 03110510, which demonstrates extraordinary performance.
Here comes the data-snooping issue: is this a valid finding? Does stock performance really depend on what type of a number of included in its CUSIP?? The answer is obviously no. Some might say that it’s not relevant to why something works as long as it does work. But this kind of logical positivism might be dangerous and misleading, especially when applied to a field where one cannot perform controlled experiments. Aside from the Charmichael example, such spurious patterns might be found using other techniques, in particular, machine learning tools like neural networks and non-linear techniques.
Model-centric data snooping
One of the main drivers of data snooping in financial analysis is the use by many researchers of the same data many times such that it would always seem possible that there is a pattern in the data? (Lo and MacKinlay, 1990). Finance is a model-centric field, mainly due to the limited availability of data and the impossibility of controlled experiments. For example, a large number of studies in finance rely on well-known datasets such as CRSP (Center for Research in Security Prices) for stock price data and COMPUSTAT for company fundamentals data (Davis, 1994). The more studies are conducted using the same data, the more likely that successive findings might be insignificant, thus increasing the likelihood of data snooping. In their remarkable work, (Harvey et. al, 2016) illustrated how this repeated use of the same data with different models to find market factors (characteristics that are common to different stocks) implies that many findings in finance are likely false. The main reason provided by the authors is that researchers do not adjust the significance hurdles used in their study by the number of already existing studies on the subject. To decrease the risk of data snooping in future studies, the authors recommend increasing the threshold for statistical significance
Research culture drives data snooping
In an article published in the Review of Financial Studies, Campbell Harvey, the 2016 president of the American Finance Association, pointed out that the research and publication practices
Data-driven data snooping
In a model-centric culture, the same data is used to validate different models. As illustrated previously, this could lead to data-snooping issues. Crucially, the converse might be true. In a data-driven approach, the model is fixed but the data varies. Data-driven analysis is at the core of Machine Learning techniques
领英推荐
Pessimists allege that Big Data may bring an end to social science research. One fear is that scholars will focus on pattern recognition rather than developing theory or engaging in hypothesis-driven empirical research. As it becomes easier to manipulate large numbers of records it is seductive to keep collecting more and more observations, matching ever more and more diverse sources — the potential is unlimited. Resources may be diverted to never-ending data projects rather than focusing on questions that are answerable with currently available data. Moreover, with a sufficiently large sample it is simply easier to find associations and make dubious claims. Another worry is that rather than focusing on interesting questions researchers will limit their inquiry to questions they are able to examine rather than consider the more socially relevant questions, becoming like the proverbial drunk who seeks their car keys under the lamp post because it is easiest to look there.
Recently, machine learning interpretability has started to gain attention (Molnar, 2020). This is an important step in alleviating the risk of false discoveries. However, in fields like finance, model interpretability would still require domain knowledge
To illustrate the role of domain knowledge, experience, and judgment in detecting false discoveries, let’s take the well-known finding in finance published by (Cieslak et. al, 2019). In the United States, the Federal Reserve has a branch called the Federal Open Market Committee (FOMC), which is responsible for the FED’s open market operations (the purchase and sale of securities in the open market by the Federal Reserve). The FOMC holds eight regularly scheduled meetings per year. An FOMC cycle refers to the period of time between two meetings. At these meetings, the Committee reviews economic and financial conditions, determines the appropriate stance of monetary policy, and assesses the risks to its long-run goals of price stability and sustainable economic growth. It is therefore expected that the market would react to any formal communication by the FOMC. Interestingly, authors (Cieslak et. al, 2019) found that the even weeks over the FOMC cycle observe positive excess returns, whilst the odd weeks exhibit excess returns that are negative or close to zero. The even/odd week variable was modeled as a dummy (0,1). Now if were to quickly judge this finding, it would seem obvious that the order of the week (even/odd) is simply an insignificant data pattern that has no interpretation or theory behind it. However, in this case, it was found that there is actually an explanation. The reason why the equity premia are gained in the even weeks is due to the fact that the federal reserve has other meetings during these weeks, and that people attending these meetings informally leak information to the public about the intentions of the FOMC, therefore driving market reactions before the formal communications of the FOMC.
Can data snooping be completely eliminated?
According to (Lo, 1994), data snooping cannot be completely eliminated. It is an unavoidable challenge in non-experimental fields like finance. The first step towards addressing the issue of data snooping is by keeping it in mind when conducting financial analysis. In some cases, like the Fermat example provided at the beginning, it is straightforward that the finding is spurious. However, in other cases, the bias might be too subtle to detect. Partial solutions that could help in this case involve using novel data in training and avoiding over-analyzed data, higher significance levels, robust multiple-testing techniques, and introducing theoretical restrictions based on financial theory, judgment, or experience.
References
Cieslak, A., Morse, A., & Vissing Jorgensen, A. (2019). Stock returns over the FOMC cycle. The Journal of Finance, 74(5), 2201-2248.
Davis, J. L. (1994). The cross-section of realized stock returns: The pre-COMPUSTAT evidence. The Journal of Finance, 49(5), 1579-1593.
Dimson, E., & Marsh, P. (1990). Volatility forecasting without data-snooping. Journal of Banking & Finance, 14(2-3), 399-421.
Feldman, M., Kenney, M., & Lissoni, F. (2015). The new data frontier: Special issue of research policy. Research Policy, 44(9), 1629-1632.
Harvey, C. R. (2017). Presidential address: The scientific outlook in financial economics. The Journal of Finance, 72(4), 1399-1440.
Harvey, C. R., Liu, Y., & Zhu, H. (2016). … and the cross-section of expected returns. The Review of Financial Studies, 29(1), 5-68.
Lo, A. W., & MacKinlay, A. C. (1990). Data-snooping biases in tests of financial asset pricing models. The Review of Financial Studies, 3(3), 431-467.
Lo, A. (1994). Data-snooping biases in financial analysis. Blending Quantitative and Traditional Equity Analysis. Charlottesville, VA: Association for Investment Management and Research, 59-66.
Molnar, C. (2020). Interpretable machine learning. Lulu. com.
Founder @ Atuals | Startup Tech Builder | AI & Cloud Solutions
2 年I'm curious