Decoding the concept of probability of backtest overfitting: a step-by-step guide with python scripts and visual?aids.

Decoding the concept of probability of backtest overfitting: a step-by-step guide with python scripts and visual?aids.

The goal of this article is twofold:

1) to simplify the concepts behind the very popular paper?:”The probability of backtest overfitting” https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2326253

by Bailey, https://www.dhirubhai.net/in/david-h-bailey-72b471/, Borwein https://www.dhirubhai.net/in/jonathan-borwein-2157a835/, de Prado https://www.dhirubhai.net/in/lopezdeprado/ and Zhux https://www.dhirubhai.net/in/qiji-zhu-1669b71a/, and present them in a way that is easily understandable to readers,

2) to visualize these concepts to further aid in comprehension.

Many scientific articles rely on complicated notation and specialized terminology, which can be a barrier for readers who are unfamiliar with the field. By breaking down the ideas into simple, step-by-step explanations and providing Python codes for readers to replicate the process, I hope to make the concepts more accessible. Additionally, by visualizing the concepts, I aim to make them even easier to grasp.

Defining the model and collecting data:

The first ingredient is a Parametric Model, which means that the strategy is a function of the data sets and of n parameters.

These parameters have predefined values that determine the model’s combination space, which is represented by all possible n_plet combinations [p1, p2….pn].

Each of these unique n_plets generates a specific trading signal, and every signal corresponds to a particular value of the fitness function, which we select to calibrate the model.

Calibrating the model involves identifying the n_plet that maximizes the fitness function across the parameter space.

For instance, a very basic seasonal model in which each day I open a trade at a specific hour and close it at another specific hour, is described by 2 parameters: the entry hour and the exit hour, and the parameter space is formed by all the pairs (T entry, T exit).

If we assume the instruments trades 23 hours each day, there are 23*22 possible pair or 2_plets.

The initial step is to generate a Matrix that encompasses all the time series of returns associated with the parametric space. The number of columns in this Matrix is equivalent to the total possible parameter combinations (e.g., 23*22 in the previous simple example). Meanwhile, the number of rows is determined by the number of timestamps in the return time series.

For instance, suppose we have ten years of price observations, and each year has 250 trading days. In that case, the number of columns in the Matrix would be (250*10)-1 (excluding one because the calculation transitions from price to return).

No alt text provided for this image


Defining the?CSVC.

CSVC stands for combinatorically symmetric cross-validation.

The CSVC procedure involves dividing the dataset into 2N equal parts. Half of them will be used for calibrating the model, and the other half will be used to evaluate the performance of the strategy with unseen data. This procedure is repeated for every possible way in which N parts can be chosen among the 2N parts, and the number of possible ways is given by the binomial coefficient C(2N,N), as described in the link: https://en.wikipedia.org/wiki/Combination

To illustrate this concept, let’s consider an example where the dataset is divided into 6 parts. In this case, 3 parts will be used for the in-sample/training/calibration set, and the remaining 3 parts will be used for the out-of-sample/testing/validation set. The number of possible combinations in which 3 parts can be selected from a set of 6 is given by the binomial coefficient C (6,3) = 20. The figure below shows all 20 possible combinations:

No alt text provided for this image


On each partition, the target function is calculated for all the possible parameter combinations both on the red and on the green area.

To make it clear let’s picture the methodology for a model with 5 possible combinations of parameters. Each combination will give rise to a strategy and hence to an equity line. The target function is calculated separately for the dataset intervals corresponding to the red areas and for the green areas.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Assuming that the framework is now clear, the next step is to rank the parameter combinations based on the value of the target function in both the in-sample and out-of-sample datasets.

No alt text provided for this image

In the table OSRik is the ranking in the out of sample of the parameter combination that ranks ith in the kth partition of in sample/out of sample.

Considering only the first row of the previous table, we end up with a list of out of sample ranks corresponding to the best solution in the in-sample. In other word we have a list that tells us for every partition of the dataset, how the parameter combination that maximizes the target function performs in the out of sample compared to all the other parameter combinations.?

No alt text provided for this image

The idea is that if a model is not overfitted, the parameter combination that performs best in the in sample will not consistently rank low compared to the other parameter combinations if tested in the out of sample.

Next step consists in using ?the list of out of sample rankings as defined above in order to assess a “probability of overfitting”.

An Example:

Let's consider the idea above by examining a set of 5000 artificially generated returns, each comprising 1000 observations. If this were a real-world model, it would imply that there are 5000 possible combinations of parameters, and the dataset covers a span of 1000 days.

No alt text provided for this image

Let us then decide to divide the 1000 timestamps into 16 parts. The way to choose 16/2 = 8?elements from a set of 16 is given by the binomial coefficient (16,8) = 12870. This means the number of partitions of In-sample/Out of sample in our example is equal to 12870.?

No alt text provided for this image

To represent the vector of out-of-sample rankings OSRik ?in a more mathematically elegant way, we can use the logit function.?

No alt text provided for this image

Specifically, if the rank of the best in-sample parameter combination in the out-of-sample set is in the bottom half, then the logit value is negative; otherwise, it is positive.

Plotting a histogram of the logit values can help us visualize the model's overfitting. If the histogram is concentrated on the negative side, then the model is likely overfitting; if it is on the positive side, then it is underfitting.

In our specific example, the histogram looks like this:

No alt text provided for this image

To calculate the probability of overfitting, we simply divide the total number of occurrences where the logit value is less than zero by the total number of occurrences.

Appendix: Python script.

In this section you will find the python script that will allow you to calculate the probability of overfitting giving:

  1. A dataframe of returns were each columns represent a timeseries of returns of a given parameter combination of the model -> labelled as SUMMARY
  2. The number of parts in which the dataset is split. -> labelled as n_partition

from multiprocessing_utils import 
from itertools import combinations
import matplotlib.pyplot as plt
import bottleneck as bn
import numpy as np
import pandas as pd






class oft:


? ? def __init__(self, SUMMARY, n_partitions):
? ? ? ? # Initialize instance variables
? ? ? ? self.SUMMARY = SUMMARY? # A pandas dataframe
? ? ? ? self.n_partitions = n_partitions? # An integer
? ? ? ? self.len_sample = len(self.SUMMARY.columns)? # The number of columns in the dataframe
? ? ? ? self.oft_combinations = self.create_IS_OS_combinations()? # A list of tuples containing index sets
? ? ? ? self.logits = run_simulation(self.calc_logit, self.oft_combinations)? # A list of logits
? ? ? ? self.show_results()? # A method to print the results


? ? def create_IS_OS_combinations(self):
? ? ? ? # Create index sets for in-sample (IS) and out-of-sample (OS) combinations
? ? ? ? combs_IS = list(combinations(range(self.n_partitions), int(self.n_partitions/2)))
? ? ? ? combs_OS = [list(set(range(self.n_partitions)) - set(comb)) for comb in combs_IS]
? ? ? ? # Split the rows of the dataframe into partitions
? ? ? ? splitted = np.array_split(self.SUMMARY.index, self.n_partitions)
? ? ? ? # Combine the partitions into IS and OS sets
? ? ? ? OS = [np.concatenate([splitted[i] for i in [*comb]]) for comb in combs_OS]
? ? ? ? IS = [np.concatenate([splitted[i] for i in [*comb]]) for comb in combs_IS]
? ? ? ? # Return a list of tuples containing the IS and OS sets
? ? ? ? return zip(IS, OS)


? ? def find_OS_rank_of_best_IS(self, tuple_IS_OS):
? ? ? ? # Given an IS and OS combination, find the rank of the best IS in the OS set
? ? ? ? IS = tuple_IS_OS[0]
? ? ? ? OS = tuple_IS_OS[1]
? ? ? ? # Calculate the Sharpe ratio for the IS and OS sets
? ? ? ? ranked_sharpe_IS = self.rank(self.calc_sharpe, self.SUMMARY, IS)
? ? ? ? ranked_sharpe_OS = self.rank(self.calc_sharpe, self.SUMMARY, OS)
? ? ? ? # Find the position of the best IS in the ranked OS set
? ? ? ? postion_of_best_IS_in_OS = np.where(ranked_sharpe_OS.index == ranked_sharpe_IS.index[0])[0][0]
? ? ? ? return postion_of_best_IS_in_OS


? ? def rank(self, function, SUMMARY, mappa):
? ? ? ? # Apply a function to each row of the dataframe for a given index set and return a sorted dataframe
? ? ? ? combs = SUMMARY.loc[mappa].T.values
? ? ? ? return pd.DataFrame([function(i) for i in combs]).sort_values(by=0, ascending=False)


? ? def calc_logit(self, k):
? ? ? ? # Given an IS and OS combination, calculate the logit
? ? ? ? os_rank = self.find_OS_rank_of_best_IS(k)
? ? ? ? rel_os_rank = os_rank/self.len_sample
? ? ? ? logit = np.log(1/(0.5+rel_os_rank))
? ? ? ? return logit




? ? def calc_sharpe(self,pl):
? ? ? ? N = len(pl)
? ? ? ? summa = bn.nansum(pl)
? ? ? ? if summa!= 0:
? ? ? ? ? ? sharpe = summa/np.sqrt(N*bn.ss(pl)-summa**2)
? ? ? ? ? ? return sharpe
? ??


? ? def show_results(self):
? ? ? ? df = pd.DataFrame(self.logits)
? ? ? ? fig, ax = plt.subplots(figsize=(15, 5))
? ? ? ? df.hist(bins=400, ax=ax)
? ? ? ? ax.set_xlabel('OUT OF SAMPLE RANK OF THE BEST PARAMETER COMBINATION')
? ? ? ? ax.set_ylabel('number of occurences')
? ? ? ? ax.set_title("Logit Histogram")
? ? ? ? dpi = 500


? ? ? ? # Save the figure to a file with the desired resolution
? ? ? ? plt.savefig('histogram_2.png', dpi=dpi)
? ? ? ??
? ? ? ?? 

        


import multiprocessing
import time

def wrapper(args):
? ? return args[0](*args[1:])


def run_simulation(funct,combs):
? ? print(funct)
? ? pool = multiprocessing.Pool(multiprocessing.cpu_count())
? ? start_time = time.time()
? ? print("Calibrating")
? ? results = pool.map(wrapper, [[funct,comb] for comb in combs])
? ? pool.close()
? ? print("--- %s seconds ---" % (time.time() - start_time))
? ? return resultsg        
Sergey Frolov

Quantitative Researcher

1 个月

How do you come up with this elegant logit representation? ln the source article author use ln(rank/(1-rank)). I checked several more realization for this methods and they use the formula from the source article. Now I see, that the result is quite similar for both formulas. Thank you

Martin Mayer-Krebs, CQF

Quantitative Portfolio Manager

1 年

Very interesting article Francesco Landolfi. Thanks for writing it! Could it be applied to k-fold cross-validation in general and not only to CSVC in particular?

Carlos Castro

Quant Intern @ MassPRIM - MSc. Finance, MSc. Data Science

1 年

How is it that using OOS periods previous in time to In Sample ones does not introduce look ahead bias?

回复
Piotr Pomorski, PhD, CFA

Machine Learning & AI | Data Science | Quant Research

1 年

Insightful, thanks! Isn’t it dangerous though to perform calibration/validation/testing on all possible window combinations? Won’t that result in de facto parameter overfitting over the data used?

回复

要查看或添加评论,请登录

Francesco Landolfi的更多文章

社区洞察

其他会员也浏览了