登录查看更多内容

The Problem of Data Snooping in Financial Analysis and Machine Learning

Tamer Khraisha (Ph.D)

Software Engineer | O'Reilly Author | Financial Data Management and Technology

发布日期: 2022年10月16日

In non-experimental fields such as economics and finance, the use of statistics for inferring relationships in the data is a central practice. However, several studies and scholars have pointed out the fact that the practices followed in the statistical fields might be misleading. Particularly in financial analysis, the problem of data snooping - finding what seems significant but, in fact, a spurious pattern - is a major challenge. Given all the analytical tools and technologies available nowadays to the researcher, with enough time, trial and error, it is very likely that a pattern can be found in the data. Although data snooping affects all nonexperimental sciences, the field of finance is highly exposed to this problem, given the large number of studies conducted over and over again on the same data. Data snooping can manifest itself in a variety of ways, sometimes easy to detect while in other cases it’s so subtle that it may go undetected. Given the recent interest in using more sophisticated tools from Machine Learning in financial analysis, understanding the risk of data snooping is essential for conducting reliable studies and finding meaningful patterns.

In this article, I will shed light on the problem of data snooping in finance, give some explanations on why it happens, and provide a list of references for further reading.

Introductory example

To illustrate the problem of data snooping, I will borrow an interesting example from (Lo, 1994). The example is based on a mathematical discovery by the French mathematician Pierre de Fermat. Fermat discovered that prime numbers p- numbers greater than 1 that are not a product of two smaller natural numbers (e.g. 3, 5, 7) - have the following property: When 2^(p-1) is divided by p, there is always a remainder of 1. For example, when 2^(3-1) = 2^(2) = 4 is divided by 3, the result is 3 plus a remainder of 1. This finding holds for all prime numbers. However, the converse is not true: if a number q satisfies the property that 2^(q-1) / q has a remainder of 1, this does not necessarily imply that q is a prime number. Yet it was found that there are non-prime numbers that satisfy this property. Such numbers are called “Carmichael numbers”. For example, in the first 10000 numbers, there are only seven Carmichael numbers: 561, 1,105, 1,729, 2,465, 2,821, 6,601, and 8,911. Lo illustrates that a stock selection strategy that consists in selecting those stocks with one of these seven Charmichael numbers in their CUSIP had achieved remarkable performance over the period 1926-1991. For example, the figure below illustrates the performance of the stock with CUSIP 03110510, which demonstrates extraordinary performance.

Here comes the data-snooping issue: is this a valid finding? Does stock performance really depend on what type of a number of included in its CUSIP?? The answer is obviously no. Some might say that it’s not relevant to why something works as long as it does work. But this kind of logical positivism might be dangerous and misleading, especially when applied to a field where one cannot perform controlled experiments. Aside from the Charmichael example, such spurious patterns might be found using other techniques, in particular, machine learning tools like neural networks and non-linear techniques.

Model-centric data snooping

One of the main drivers of data snooping in financial analysis is the use by many researchers of the same data many times such that it would always seem possible that there is a pattern in the data? (Lo and MacKinlay, 1990). Finance is a model-centric field, mainly due to the limited availability of data and the impossibility of controlled experiments. For example, a large number of studies in finance rely on well-known datasets such as CRSP (Center for Research in Security Prices) for stock price data and COMPUSTAT for company fundamentals data (Davis, 1994). The more studies are conducted using the same data, the more likely that successive findings might be insignificant, thus increasing the likelihood of data snooping. In their remarkable work, (Harvey et. al, 2016) illustrated how this repeated use of the same data with different models to find market factors (characteristics that are common to different stocks) implies that many findings in finance are likely false. The main reason provided by the authors is that researchers do not adjust the significance hurdles used in their study by the number of already existing studies on the subject. To decrease the risk of data snooping in future studies, the authors recommend increasing the threshold for statistical significance (t-statistic) to 3 and relying on multiple hypothesis testing techniques.

Research culture drives data snooping

In an article published in the Review of Financial Studies, Campbell Harvey, the 2016 president of the American Finance Association, pointed out that the research and publication practices followed by the finance community contribute to the problem of data snooping. On the one hand, journal editors expect researchers to submit only positive findings (the paper found a pattern) and reject negative ones (the paper didn’t find any pattern). Given such expectations, researchers might deliberately put all their efforts into looking for those patterns which show any sort of statistical significance. This problem is often referred to as p-hacking and it is a form of data snooping. In this case, the researcher will conduct many tests, choose the ones that have statistical significance that could constitute publication material, and discard the rest. In looking for market factors, this means that only a small fraction of the factors make it to publication, and insignificant factors are neglected. This in turn drives other practices that lead to false discoveries: first, negative findings are not used in adjusting significance levels, therefore the usual cutoff values may not be appropriate anymore (Harvey et. al, 2016); Second, the research paradigm gets biased by relying on existing research to conduct future research rather than conducting novel tests. By trying to build on existing findings, the researcher introduces a bias that consists of conditioning the model (estimator) selection on the data and existing findings, rather than conducting independent research. When the selection of a model is influenced by the data, the risk of false discovery is higher as the researcher might be asking the wrong question.

Data-driven data snooping

In a model-centric culture, the same data is used to validate different models. As illustrated previously, this could lead to data-snooping issues. Crucially, the converse might be true. In a data-driven approach, the model is fixed but the data varies. Data-driven analysis is at the core of Machine Learning techniques. In a Machine Learning project, the typical scenario is the availability of a lot of data and a fixed set of modeling tools that can be trained on the data to recognize patterns. Machine learning tends to focus on making predictions, therefore giving less importance to classical statistical significance and causal interpretability, which in turn might lead to data snooping. Machine learning models are very powerful, and with a high number of hyperparameter tuning options available, sooner or later a pattern is very likely to be found in the data. In (Dimson and Marsh, 1990), the authors showed that more complex models are more prone to data snooping bias. By blindly relying on complex machine learning techniques, researchers might be driven to focus mostly on finding any regularity in the data, irrespective of its foundation. As Feldman et. al, 2015) put it:

领英推荐

Leveraging Genetic Algorithms in Financial Modeling…

Marcin Majka 5 个月前

ARIMAX: Time Series Forecasting with External Variables

Marcin Majka 5 个月前

The Trick That Helps All Statisticians Survive

Keith McNulty 6 个月前

Pessimists allege that Big Data may bring an end to social science research. One fear is that scholars will focus on pattern recognition rather than developing theory or engaging in hypothesis-driven empirical research. As it becomes easier to manipulate large numbers of records it is seductive to keep collecting more and more observations, matching ever more and more diverse sources — the potential is unlimited. Resources may be diverted to never-ending data projects rather than focusing on questions that are answerable with currently available data. Moreover, with a sufficiently large sample it is simply easier to find associations and make dubious claims. Another worry is that rather than focusing on interesting questions researchers will limit their inquiry to questions they are able to examine rather than consider the more socially relevant questions, becoming like the proverbial drunk who seeks their car keys under the lamp post because it is easiest to look there.

Recently, machine learning interpretability has started to gain attention (Molnar, 2020). This is an important step in alleviating the risk of false discoveries. However, in fields like finance, model interpretability would still require domain knowledge about theories, intuition, and judgment to validate scientific findings and provide insights beyond feature interpretation.

To illustrate the role of domain knowledge, experience, and judgment in detecting false discoveries, let’s take the well-known finding in finance published by (Cieslak et. al, 2019). In the United States, the Federal Reserve has a branch called the Federal Open Market Committee (FOMC), which is responsible for the FED’s open market operations (the purchase and sale of securities in the open market by the Federal Reserve). The FOMC holds eight regularly scheduled meetings per year. An FOMC cycle refers to the period of time between two meetings. At these meetings, the Committee reviews economic and financial conditions, determines the appropriate stance of monetary policy, and assesses the risks to its long-run goals of price stability and sustainable economic growth. It is therefore expected that the market would react to any formal communication by the FOMC. Interestingly, authors (Cieslak et. al, 2019) found that the even weeks over the FOMC cycle observe positive excess returns, whilst the odd weeks exhibit excess returns that are negative or close to zero. The even/odd week variable was modeled as a dummy (0,1). Now if were to quickly judge this finding, it would seem obvious that the order of the week (even/odd) is simply an insignificant data pattern that has no interpretation or theory behind it. However, in this case, it was found that there is actually an explanation. The reason why the equity premia are gained in the even weeks is due to the fact that the federal reserve has other meetings during these weeks, and that people attending these meetings informally leak information to the public about the intentions of the FOMC, therefore driving market reactions before the formal communications of the FOMC.

Can data snooping be completely eliminated?

According to (Lo, 1994), data snooping cannot be completely eliminated. It is an unavoidable challenge in non-experimental fields like finance. The first step towards addressing the issue of data snooping is by keeping it in mind when conducting financial analysis. In some cases, like the Fermat example provided at the beginning, it is straightforward that the finding is spurious. However, in other cases, the bias might be too subtle to detect. Partial solutions that could help in this case involve using novel data in training and avoiding over-analyzed data, higher significance levels, robust multiple-testing techniques, and introducing theoretical restrictions based on financial theory, judgment, or experience.

References

Cieslak, A., Morse, A., & Vissing Jorgensen, A. (2019). Stock returns over the FOMC cycle. The Journal of Finance, 74(5), 2201-2248.

Davis, J. L. (1994). The cross-section of realized stock returns: The pre-COMPUSTAT evidence. The Journal of Finance, 49(5), 1579-1593.

Dimson, E., & Marsh, P. (1990). Volatility forecasting without data-snooping. Journal of Banking & Finance, 14(2-3), 399-421.

Feldman, M., Kenney, M., & Lissoni, F. (2015). The new data frontier: Special issue of research policy. Research Policy, 44(9), 1629-1632.

Harvey, C. R. (2017). Presidential address: The scientific outlook in financial economics. The Journal of Finance, 72(4), 1399-1440.

Harvey, C. R., Liu, Y., & Zhu, H. (2016). … and the cross-section of expected returns. The Review of Financial Studies, 29(1), 5-68.

Lo, A. W., & MacKinlay, A. C. (1990). Data-snooping biases in tests of financial asset pricing models. The Review of Financial Studies, 3(3), 431-467.

Lo, A. (1994). Data-snooping biases in financial analysis. Blending Quantitative and Traditional Equity Analysis. Charlottesville, VA: Association for Investment Management and Research, 59-66.

Molnar, C. (2020). Interpretable machine learning. Lulu. com.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Renato Dinis

Founder @ Atuals | Startup Tech Builder | AI & Cloud Solutions

2 年

I'm curious

2 次回应

查看更多评论

要查看或添加评论，请登录

Tamer Khraisha (Ph.D)的更多文章

The Problem of Reference Data Management In Financial Services

2024年4月22日

The Problem of Reference Data Management In Financial Services

One of the most outstanding data-related challenges in the financial sector is the management of reference data. At a…

1 条评论
Data Engineering: A Technical or Business Field? (Opinion Piece)

2024年3月30日

Data Engineering: A Technical or Business Field? (Opinion Piece)

I keep coming across debates on whether data engineering should be technical or business-centric. The reality is that…
The Quest for Low Latency in Financial Markets

2024年3月11日

The Quest for Low Latency in Financial Markets

When it comes to performance in financial markets, low latency is a key differentiator. The notion of latency is often…
Blockchain-Based Databases

2024年2月17日

Blockchain-Based Databases

Blockchain is one of the most promising technological trends in today's financial services landscape. It represents the…

2 条评论
ISO 20022 - Universal Financial Industry Message Scheme

2024年2月9日

ISO 20022 - Universal Financial Industry Message Scheme

As messaging standards increased in scale, sophistication, and variety, financial market participants, in collaboration…

2 条评论
Database Engine Ranking and Popularity in 2023

2023年6月11日

Database Engine Ranking and Popularity in 2023

The DB-Engines Ranking ranks database management systems according to their popularity. The ranking is updated monthly.
Commoditization of Artificial Intelligence: AI-as-a-Service

2023年5月9日

Commoditization of Artificial Intelligence: AI-as-a-Service

Introduction When the cloud market first emerged, it captured the attention of everyone as it offered people…

1 条评论
Top Artificial Intelligence and Machine Learning Scientific Journals

2023年4月3日

Top Artificial Intelligence and Machine Learning Scientific Journals

The expansion of the fields of artificial intelligence and machine learning over the past ten years has resulted in a…
Redshift vs BigQuery vs Snowflake: Internals and Features of the most Popular Cloud Data Warehouses

2023年3月11日

Redshift vs BigQuery vs Snowflake: Internals and Features of the most Popular Cloud Data Warehouses

The market for cloud-based data warehousing has seen a significant surge as companies move toward the cloud. Cloud…

1 条评论
High-Performance PostgreSQL: A Dive Into the Internals

2023年1月14日

High-Performance PostgreSQL: A Dive Into the Internals

PostgreSQL is a powerful, open-source object-relational database system that uses and extends the SQL language combined…

See all articles

The Problem of Data Snooping in Financial Analysis and Machine Learning

Tamer Khraisha (Ph.D)

Software Engineer | O'Reilly Author | Financial Data Management and Technology

Introductory example

Model-centric data snooping

Research culture drives data snooping

Data-driven data snooping

领英推荐

Can data snooping be completely eliminated?

References

Tamer Khraisha (Ph.D)的更多文章

社区洞察

其他会员也浏览了

The Trick That Helps All Statisticians Survive

From Dilemmas to Decisions: The Art of Quantitative Analysis

Unlocking Alpha: Going Beyond Bloomberg with Unconventional Data Sources.

The Importance of Game Theory in Data Science: How the Prisoner’s Dilemma Can Help You Make Better Decisions

R-Squared vs. Adjusted R-Squared: An Investor’s Guide to Precision

Rolling Origin Sampling

Striking the Balance: Logic and Data in Decision-Making

Claude Shannon Used Markov Chains. Why?

What does 'significant' mean?

Demystifying the K-Nearest Neighbors (KNN) Algorithm: A Deep Dive into Its Mechanics and Applications

Introductory example

Model-centric data snooping

Research culture drives data snooping

Data-driven data snooping

领英推荐

Can data snooping be completely eliminated?

References

Tamer Khraisha (Ph.D)的更多文章

The Problem of Reference Data Management In Financial Services

Data Engineering: A Technical or Business Field? (Opinion Piece)

The Quest for Low Latency in Financial Markets

Blockchain-Based Databases

ISO 20022 - Universal Financial Industry Message Scheme

Database Engine Ranking and Popularity in 2023

Commoditization of Artificial Intelligence: AI-as-a-Service

Top Artificial Intelligence and Machine Learning Scientific Journals

Redshift vs BigQuery vs Snowflake: Internals and Features of the most Popular Cloud Data Warehouses

High-Performance PostgreSQL: A Dive Into the Internals

社区洞察

其他会员也浏览了

The Trick That Helps All Statisticians Survive

From Dilemmas to Decisions: The Art of Quantitative Analysis

Unlocking Alpha: Going Beyond Bloomberg with Unconventional Data Sources.

The Importance of Game Theory in Data Science: How the Prisoner’s Dilemma Can Help You Make Better Decisions

R-Squared vs. Adjusted R-Squared: An Investor’s Guide to Precision

Rolling Origin Sampling

Striking the Balance: Logic and Data in Decision-Making

Claude Shannon Used Markov Chains. Why?

What does 'significant' mean?

Demystifying the K-Nearest Neighbors (KNN) Algorithm: A Deep Dive into Its Mechanics and Applications