ML Classification Algorithms to Predict Market Movements and Backtesting
Sabir Jana, CFA
Head of Quantitative Research at FinSharpe | IIT Madras | 25 years of Industry experience
In this article, we will use the stock trading strategies based on multiple machine learning classification algorithms to predict the market movement. To analyze the performance we will perform simple vectorized backtesting and then test the best performing strategy using Backtrader to get a more realistic picture. You can find the relevant Jupyter notebook used in this article on my Github page. The overall approach is as follows:
- Gathering Historical Pricing Data.
- Feature Engineering.
- Build and Apply Classification Machine Learning Algorithms.
- Backtesting of Selected Strategy using Backtrader.
- Performance Analysis of Backtesting.
Gathering Historical Pricing Data
We are going to use the Nifty-50 index for this analysis. We will download the daily closing pricing data with the help of yfinance python library, calculate daily log returns, and derive market direction based on that. We will visualize the closing prices and daily returns to quickly check our data. Let’s go through the code:
# make the necessary imports import numpy as np from matplotlib import pyplot as plt import pandas as pd import seaborn as sns import yfinance as yf import warnings from sklearn import linear_model from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.neural_network import MLPClassifier import datetime import pyfolio as pf import backtrader as bt from backtrader.feeds import PandasData import warnings # set the style and ignore warnings plt.style.use(‘seaborn-colorblind’) warnings.simplefilter(action=’ignore’, category=FutureWarning) warnings.filterwarnings(‘ignore’) # this is to display images in notebook %matplotlib inline %config InlineBackend.figure_format = 'retina' # ticker and the start and end dates for testing ticker = '^NSEI' # Nifty 50 benchmark start = datetime.datetime(2010, 1, 1) end = datetime.datetime(2020, 7, 31) # download ticker ‘Adj Close’ price from yahoo finance stock = yf.download(ticker, progress=True, actions=True,start=start, end=end)['Adj Close'] stock = pd.DataFrame(stock) stock.rename(columns = {'Adj Close':ticker}, inplace=True) stock.head(2) # calculate daily log returns and market direction stock['returns'] = np.log(stock / stock.shift(1)) stock.dropna(inplace=True) stock['direction'] = np.sign(stock['returns']).astype(int) stock.head(3) # visualize the closing price and daily returns fig, ax = plt.subplots(2, 1, sharex=True, figsize = (12,6)) ax[0].plot(stock[ticker], label = f'{ticker} Adj Close') ax[0].set(title = f'{ticker} Closing Price', ylabel = 'Price') ax[0].grid(True) ax[0].legend() ax[1].plot(stock['returns'], label = 'Daily Returns') ax[1].set(title = f'{ticker} Daily Retunrs', ylabel = 'Returns') ax[1].grid(True) plt.legend() plt.tight_layout(); plt.savefig('images/chart1', dpi=300)
Code commentary:
- Make the necessary imports.
- Set the ticker as index Nifty-50 with start and end dates as 2010–01–01 and 2020–07–31.
- Download daily Adj Close data with the help of yfinance from Yahoo Finance.
- Calculate daily log returns and market direction using np.sign().astype(int).
- Visualize daily closing prices and log returns.
Feature Engineering
In this section, we will create feature variables to predict the market direction. As a first step, we will use five lags of the log-returns series and then digitize them as binary (0, 1) to predict the probability of an upward and a downward market movement as (+1, -1). The python code is as follows:
# define the number of lags lags = [1, 2, 3, 4, 5] # compute lagged log returns cols = [] for lag in lags: col = f'rtn_lag{lag}' stock[col] = stock['returns'].shift(lag) cols.append(col) stock.dropna(inplace=True) stock.head(2) # function to transform the lag returns to binary values (0,+1) def create_bins(data, bins=[0]): global cols_bin cols_bin = [] for col in cols: col_bin = col + '_bin' data[col_bin] = np.digitize(data[col], bins=bins) cols_bin.append(col_bin) create_bins(stock) stock[cols+cols_bin].head(2)
Code commentary:
- Compute five days lagged returns and shift the returns series to the number of lags to align them with one day forward return.
- Define the function to transform the lag returns to binary values (0,1) using the function np.digitize().
Build and Apply Classification Machine Learning Algorithms
Now we are going to use Logistic regression, Gaussian Naive Bayes, Support Vector Machine (SVM), Random Forest, and MLP Classifier approach to predict the market direction as (+1, -1). Please refer to sklearn documentation for detail on these and other algorithms. We will then evaluate the performance of each of these models using vectorized backtesting and visualize the cumulative returns. Let’s go through the python code:
# create a dictionary of selected algorithms models = { ‘log_reg’: linear_model.LogisticRegression(), ‘gauss_nb’: GaussianNB(), ‘svm’: SVC(), ‘random_forest’: RandomForestClassifier(max_depth=10, n_estimators=100), ‘MLP’ : MLPClassifier(max_iter=500), } # function that fits all models. def fit_models(data): mfit = {model: models[model].fit(data[cols_bin], data['direction']) for model in models.keys()} # function that predicts (derives all position values) from the fitted models def derive_positions(data): for model in models.keys(): data['pos_' + model] = models[model].predict(data[cols_bin]) # function to evaluate all trading strategies def evaluate_strats(data): global strategy_rtn strategy_rtn = [] for model in models.keys(): col = 'strategy_' + model data[col] = data['pos_' + model] * data['returns'] strategy_rtn.append(col) strategy_rtn.insert(0, 'returns') # fit the models fit_models(stock) # derives all position values derive_positions(stock) # evaluate all trading strategies by multiplying predicted directions to actual daily returns evaluate_strats(stock) # calculate total return and std. deviation of each strategy print('\nTotal Returns: \n') print(stock[strategy_rtn].sum().apply(np.exp)) print('\nAnnual Volitility:') stock[strategy_rtn].std() * 252 ** 0.5 # number of trades over time for highest and second highest return strategy print('Number of trades SVM = ', (stock['pos_svm'].diff()!=0).sum()) print('Number of trades Ramdom Forest = ',(stock['pos_random_forest'].diff()!=0).sum()) # vectorized backtesting of the resulting trading strategies and visualize the performance over time ax = stock[strategy_rtn].cumsum().apply(np.exp).plot(figsize=(12, 6), title = 'Machine Learning Classifiers Return Comparison') ax.set_ylabel("Cumulative Returns") ax.grid(True); plt.tight_layout(); plt.savefig('images/chart2', dpi=300)
Code commentary:
- Create a dictionary of selected algorithms.
- Define a function that fits all models with direction column as the dependent variable and _bin columns as feature variables.
- Define a function that predicts all position values from the fitted models.
- Define a function to evaluate all trading strategies.
- Next, we fit the models, predict positions, and evaluate all trading strategies by multiplying predicted directions to actual daily returns.
- Calculate the total return and standard deviation of each strategy.
- Calculate the number of trades overtime for the highest and second-highest return strategies.
- Vectorize backtesting of the resulting trading strategies and visualize the performance over time.
We can see that the support vector machine model has given the maximum total returns over time with comparable annual volatility with other models. However, it will be quite immature to deploy any such strategy based on vectorized backtesting results. Some of the reason are listed below:
- The number of trades is quite high and vectorized backtesting doesn’t account for costs such as trading and market slippage.
- The strategy accounts for both long and short positions however short selling may not be feasible due to multiple reasons.
Hence, our backtesting needs to be more realistic and event-driven to address the above gaps.
Backtesting of Selected Strategy using Backtrader
In this section, we will take our best performing model, i.e. support vector machine (SVM), and perform the backtesting using the python library Backtrader. The backtesting strategy will be as follows:
- We start with the initial capital of 100, 000 and trading commission as 0.1%.
- We buy when the predicted value is +1 and sell (only if stock is in possession) when the predicted value is -1.
- All-in strategy — when creating a buy order, buy as many shares as possible.
- Short selling is not allowed.
Let’s go through the python code:
# fetch the daily pricing data from yahoo finance prices = yf.download(ticker, progress=True, actions=True, start=start, end=end) prices.head(2) # rename the columns as needed for Backtrader prices.drop(['Close','Dividends','Stock Splits'], inplace=True, axis=1) prices.rename(columns = {'Open':'open','High':'high','Low':'low','Adj Close':'close','Volume':'volume', }, inplace=True) prices.head(3) # add the predicted column to prices dataframe. This will be used as signal for buy or sell predictions = stock['strategy_svm'] predictions = pd.DataFrame(predictions) predictions.rename(columns = {'strategy_svm':'predicted'}, inplace=True) prices = predictions.join(prices, how='right').dropna() prices.head(2) OHLCV = ['open', 'high', 'low', 'close', 'volume'] # class to define the columns we will provide class SignalData(PandasData): """ Define pandas DataFrame structure """ cols = OHLCV + ['predicted'] # create lines lines = tuple(cols) # define parameters params = {c: -1 for c in cols} params.update({'datetime': None}) params = tuple(params.items())
Code commentary:
- Fetch the daily pricing data from yahoo finance and rename the columns as OHLCV format needed for Backtrader.
- Take the SVM strategy returns from the stock dataframe and join it to the prices dataframe. This column’s value will be a signal to buy or sell while placing the order.
- Define a custom SignalData class for dataframe columns to be fed to Backtrader.
Now, we define the MLStrategy class for the backtesting strategy. It needs to be inherited from bt.Strategy. As we have predicted the market direction on the day’s closing price, hence we will use cheat_on_open=True when creating the bt.Cerebro object. This means the number of shares we want to buy will be based on day t+1’s open price. As a result, we also define the next_open method instead of next within the Strategy class.
# define backtesting strategy class class MLStrategy(bt.Strategy): params = dict( ) def __init__(self): # keep track of open, close prices and predicted value in the series self.data_predicted = self.datas[0].predicted self.data_open = self.datas[0].open self.data_close = self.datas[0].close # keep track of pending orders/buy price/buy commission self.order = None self.price = None self.comm = None # logging function def log(self, txt): '''Logging function''' dt = self.datas[0].datetime.date(0).isoformat() print(f'{dt}, {txt}') def notify_order(self, order): if order.status in [order.Submitted, order.Accepted]: # order already submitted/accepted - no action required return # report executed order if order.status in [order.Completed]: if order.isbuy(): self.log(f'BUY EXECUTED --- Price: {order.executed.price:.2f}, Cost: {order.executed.value:.2f},Commission: {order.executed.comm:.2f}' ) self.price = order.executed.price self.comm = order.executed.comm else: self.log(f'SELL EXECUTED --- Price: {order.executed.price:.2f}, Cost: {order.executed.value:.2f},Commission: {order.executed.comm:.2f}' ) # report failed order elif order.status in [order.Canceled, order.Margin, order.Rejected]: self.log('Order Failed') # set no pending order self.order = None def notify_trade(self, trade): if not trade.isclosed: return self.log(f'OPERATION RESULT --- Gross: {trade.pnl:.2f}, Net: {trade.pnlcomm:.2f}') # We have set cheat_on_open = True.This means that we calculated the signals on day t's close price, # but calculated the number of shares we wanted to buy based on day t+1's open price. def next_open(self): if not self.position: if self.data_predicted > 0: # calculate the max number of shares ('all-in') size = int(self.broker.getcash() / self.datas[0].open) # buy order self.log(f'BUY CREATED --- Size: {size}, Cash: {self.broker.getcash():.2f}, Open: {self.data_open[0]}, Close: {self.data_close[0]}') self.buy(size=size) else: if self.data_predicted < 0: # sell order self.log(f'SELL CREATED --- Size: {self.position.size}') self.sell(size=self.position.size)
Code commentary:
- The function __init__ tracks open, close, predicted, and pending orders.
- The function notify_order tracks the order status.
- The function notify_trade is triggered if the order is complete and logs profit and loss for the trade.
- The function next_open checks the available cash and calculates the maximum number of shares that can be bought. It places the buy order if we don’t hold any position and the predicted value is greater than zero. Else, it places the sell order if the predicted value is less than zero.
Next, we instantiate SignalData and Cerebro objects and add prices dataframe, MLStrategy, initial capital, commission, and pyfolio analyzer. Finally, we run the backtest and capture the results.
# instantiate SignalData class data = SignalData(dataname=prices) # instantiate Cerebro, add strategy, data, initial cash, commission and pyfolio for performance analysis cerebro = bt.Cerebro(stdstats = False, cheat_on_open=True) cerebro.addstrategy(MLStrategy) cerebro.adddata(data, name=ticker) cerebro.broker.setcash(100000.0) cerebro.broker.setcommission(commission=0.001) cerebro.addanalyzer(bt.analyzers.PyFolio, _name='pyfolio') # run the backtest print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) backtest_result = cerebro.run() print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue())
Performance Analysis of Backtesting
We will analyze the performance statistics using pyfolio . pyfolio is a Python library for performance and risk analysis of financial portfolios developed by Quantopian Inc.
# Extract inputs for pyfolio strat = backtest_result[0] pyfoliozer = strat.analyzers.getbyname(‘pyfolio’) returns, positions, transactions, gross_lev = pyfoliozer.get_pf_items() returns.name = ‘Strategy’ returns.head(2) # get benchmark returns benchmark_rets= stock['returns'] benchmark_rets.index = benchmark_rets.index.tz_localize('UTC') benchmark_rets = benchmark_rets.filter(returns.index) benchmark_rets.name = 'Nifty-50' benchmark_rets.head(2) # get performance statistics for strategy pf.show_perf_stats(returns) # plot performance for strategy vs benchmark fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(16, 9),constrained_layout=True) axes = ax.flatten() pf.plot_drawdown_periods(returns=returns, ax=axes[0]) axes[0].grid(True) pf.plot_rolling_returns(returns=returns, factor_returns=benchmark_rets, ax=axes[1], title='Strategy vs Nifty-50') axes[1].grid(True) pf.plot_drawdown_underwater(returns=returns, ax=axes[2]) axes[2].grid(True) pf.plot_rolling_sharpe(returns=returns, ax=axes[3]) axes[3].grid(True) # fig.suptitle('Strategy vs Nifty-50 (Buy and Hold)', fontsize=16, y=0.990) plt.grid(True) plt.legend() plt.tight_layout() plt.savefig('images/chart3', dpi=300)
Code commentary:
- We extract inputs needed for pyfolio from the backtesting result.
- Get the benchmark daily returns to compare and contrast with the strategy.
- Get performance statistics for the strategy using pyfolio show_perf_stats.
- Visualize drawdowns, cumulative returns, underwater plot, and rolling Sharpe ratio.
Let’s analyze the performance of our strategy. The annual return is just 3.9% and the cumulative return is 48% as compared to 8.86 times total return we observed during vectorized backtesting. If we visualize a few other performance parameters in comparison to the benchmark, we can see our strategy is not able to beat the performance of the simple buy and hold strategy.
So the obvious question is why? This is due to the fact that we paid a huge commission for a high number of trades. The second reason; we allowed no short selling while performing backtesting with Backtrader.
In conclusion, often the vectorized backtesting results may look great on paper however we need to consider all aspects of implementation shortfall and feasibility before we decide to implement such a strategy. Also, keep in mind that the capital market is not just about machine learning otherwise all data scientists would have become super-rich by now.
Happy investing and do leave your comments on the article!
Please Note: This analysis is only for educational purposes and the author is not liable for any of your investment decisions.
References:
- Python for Finance 2e: Mastering Data-Driven Finance by Yves Hilpisch
- Python for Finance Cookbook: Over 50 recipes for applying modern Python libraries to financial data analysis by Eryk Lewinson
- Machine Learning for Algorithmic Trading by Stefan Jansen
- Please check out my other articles/ posts on quantitative finance at my Linkedin page or on Medium.
Aspiring Client Advisor with expertise in data analysis and financial services | Hong Kong Baptist University BBA (First Class Honours)
2 年It's an excellent work, sir. I am new to machine learning, so I would like to ask whether the data is not divided into the training group and the test group, so there is not much reference value for such backtest? And If all models use the same parameters, does it mean that the advantages of some models may not be brought into full play? I am a student working on the machine learning in stock market, so I would be very grateful if you could respond to me and I look forward to hearing from you.
Head of Quantitative Research at FinSharpe | IIT Madras | 25 years of Industry experience
4 年Daniel Rodriguez - Thanks for the excellent backtesting library. Appreciate feedback on backtesting done using backtrader. Thanks!
Associate Director (Analytics and Artificial Intelligence) at Cognizant Singapore
4 年Very interesting ??
Data Engineer @ Google
4 年Gina Galey
Chief AI Officer, ?Head of AI & Quant Research, Author
4 年Awesome. Thx. For sharing!