Time-series Classification using Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM)

Time-series Classification using Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM)

One of the most striking facts about neural networks is that they can approximate a wide range of functions and for that reason they they are often referred to as universal function approximators. The ability to compute an arbitrary function is truly remarkable. Almost any process you can imagine can be thought of as function computation. Consider the problem of naming a piece of music based on a short sample of the piece. That can be thought of as computing a function. Or consider the problem of translating a Chinese text into English. Or something simple like y = sin(x)

No alt text provided for this image


The Investment and trading world is brimming with functions, whether it be a stochastic model like Black-Scholes to price options or a simple linear function like the Fama-French Factor model to understand portfolio risk. Out of all those functions, the most challenging is the time-series price itself. Price can't be modeled as a single function, its formed by the interactions of multiple entities (traders, investors, policy makers, speculators…) trying to increase their utility from different sources of information and mandates. We can often see a wide range of price behavior in the market, sometimes it would fit very well to an ARIMA model, but when there's momentum in the market a GARCH looks like a better fit to volatility and after a few days when the price is in a range bound condition a Vasicek model would look really attractive. Due to this varied behavior we need models that are agnostic to our statistical assumptions and can learn complex non-linear relations from data . Moreover the price itself is so noisy that we often need derivatives or exogenous variables to model the future returns and most simple statistical tools fail at this. 

But that doesn't mean statistical models are not useful, in fact models like auto regression (AR) models, moving average (MA) models, Holt-winters, ARIMA, Theta etc., does really well in univariate time series analysis and with a static underlying data generating process. I would highly recommend reading this paper by Spyros Makridakis which compares these Statistical Models vs Machine Learning models to really understand the significance of these models.

In this article I would be taking a look at a classical supervised learning problem of understanding the function f that connects x and y from multivariate time series perspective on synthetic data.

No alt text provided for this image



In the trading domain this function would look something like:

2 day forward returns = f (previous day returns, news sentiment, previous day traded volume)

Where function f could be linear, non-linear or anything imaginable under the sun.

In this article I will create a few random functions, change the distribution of the target y, introduce lags in x and see how well the ML models can predict target y as a classification problem. The reason why I am framing this problem as a classification problem over regression is because I will be further using this model as the brain for the Deep Reinforcement Learning agent in one of my project (github).

Few requirements for my models are:

  • Support Multi-class
  • Non-linear
  • Be able to perform well under heavily imbalanced data
  • Be able to adapt to a wide range of Single to Noise ratio.
  • Be able to support custom loss function

And these are the models that I will be comparing in this study:

  • Random Forest
  • XGBoost
  • LigthGBM
  • 1-D Convolutional Neural Networks (CNN) Architecture
  • Long Short Term Memory (LSTM) plus CNN Architecture

Dataset

No alt text provided for this image



Where the features (x1.... x5) are from standard normal distribution, so we don't have to worry about stationarity. Target variable y would be sampled and grouped to create the required balanced or unbalanced classes. Sample size is 10,000 with a train-test split of 80:20.

Lets start with the 3 tree based ensemble models, with time-series hyper-parameter tuning gives us the following test dataset results for the following functions.

No alt text provided for this image

We can see that as we move from a linear to more complex functions, the models were holding up quite good, even with some noise our models were still able to give good results , but the moment I introduced time lags in the last function, our models started under performing. This makes perfect sense given that these models don’t have any kind of memory. Efficient Market Hypothesis is only a theoretical concept, in reality time lags are important and they exist in the market and we often see a delayed response in price to a range of economic and trading events. So we can't get away without modelling time lag.

No alt text provided for this image

A few feature engineering approaches in time series are deriving new features like:

  • Date features like day of week, day of month etc.
  • AutoRegressive features like optimal lag and lag-features interaction
  • Different types of exponentially weighted moving averages
  • Aggregation of past information (different time groups and time intervals)
  • Target transformations and differentiation etc..

I wont go deep into it, but Cesium is a good package that’s used in multiple papers to create standard time series features. Feature importance goes hand-in-hand with feature engineering (both model dependent and independent) and two algorithms that I have found interesting are SHAPley and LIME. A rigorous exploration exercise and feature engineering would definitely help us approximate the time lagged function better.

But in our case when we are trying to model a time series like stock price which has has a mind of its own and it keeps changing based on the events and the participants. Having a fixed set of features is not a great strategy especially if you want to incorporate it into a Reinforcement Learning framework. So lets look at few models that have inbuilt feature engineering.

Deep Learning : 1D - Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN)

No alt text provided for this image

Wait, did I say CNN, isn't is used for images ?

Yes, they are are quite prominent in the image domain but recently 1D-CNNs have been shown to provide good results on challenging sequential datasets with little or no feature engineering . CNN layers not only incorporate feature engineering in one framework, they are able to extract features and create informative representations of time series automatically. They are highly noise-resistant models, and they are able to extract very informative deep features, which are independent from time. Also Long short-term memory (LSTM a type of RNN) are pretty good at extracting patterns in input feature space, where the input data spans over long sequences. The unique gated architecture of LSTM’s give it ability to manipulate its memory state in such a way to store long term memory. For these reasons I would be using a CNN and CNN plus LSTM architecture. Netherlands eScience Center has selected a few architectures (CNN and CNN+LSTM) that has worked well for their time-series datasets and compiled them into a python packaged called Mcfly , I have used architectures from the Mcfly package for further testing.

Here are the test set results of the some functions (with lags) trained and tested on CNN and CNN+LSTM architecture provided by Mcfly, no hyper-parameter tuning is done on the models, instead 10 architecture (each for CNN and CNN+LSTM) with random parameter initialization is trained for 20 epochs on the 20% of the training set, once the best model is chosen, that model is ran for 200 epochs with a train:validation:test split of 70:10:20 on 10,000 samples.

No alt text provided for this image

The deep learning models seems to be holding up pretty good with the introduction of lags. As you can see even though CNN only architecture has better overall scores, but CNN_LSTM architecture seems to be stable under increased noise.

The main limitations of deep learning is the amount of data required to train new models. All the previous experiments involved a sample size of 10,000, if we are looking at daily close price of a company as a sample, to get to the 10,000 we would need 40 years of data, which would limit our time series universe.

So to overcome this limitation, I decided to test whether one neural network architecture can model multiple time series with similar characteristics on a combination of smaller datasets.

No alt text provided for this image



Lets consider the case of above 4 function each with 10% of signal to noise ratio. If you look at them closely, the difference between them is just the coefficients and the lag (trying to characterize idiosyncrasies of different stock prices). And lets stack one after the other for each point in time to create a total of 10,000 samples and add an additional feature to distinguish between these functions (added as a embedding layer in Keras, so we get a dynamic vector representation instead of static one-hot encoded feature set) and train for 400 epochs.

No alt text provided for this image

The best performing model in this scenario turned out to be CNN_LSTM and this shows that we can mix multiple time series with similar underlying processes to overcome the issue of less data, now lets test 2 scenarios that we often deal with.

  1. The dynamics of the existing 4 functions changed. (To replicate the change in price behavior a.k.a. regime shift which is very common in the equity market)
  2. A new similar function was introduced. (To mimic addition of a new security to our analysis)

I will retrain our existing CNN_LSTM model (with the weights from the previous experiment) and check if that retrained model be able to adapt to these 2 scenarios with just 10% of the initial sample size,

1) We change the 4 functions a little and retrained the model on a sample size of 1,000 points. Interestingly our existing model picked up the change in dynamics with just 200 epochs of training.

No alt text provided for this image

2) We introduced a new function which is similar to the existing 4 functions and wanted to compare the performance of a newly trained model vs retrained model on a sample size of 1,000 for 200 epochs. And we can see the retrained CNN_LSTM model were able to learn the new function as well.

No alt text provided for this image

Conclusion

In the beginning I thought it would be silly to work on synthetic data but this turned out to be really interesting experience. Neural Networks has often been dubbed as a black box because unlike a tree based model its very hard to interpret the models predictions. Even though there has been a lot of research in the space of model explainability it still helps to experiment these models on synthetic datasets specific to the domain. Some of the commonly observed characteristics of price like mean reversion, clustering, lag etc... can be formulated using either well known stochastic models or econometrics process so that we can have a baseline model architecture before going to develop it further with real data.

One of the reasons why I started worked on synthetic data is to design custom loss functions for my neural network architectures, because when I work with a real dataset especially something that involves stock price or its derivatives, there's a lot of randomness in it that its often very difficult to understand whether the underlying dynamics have changed during periods of training. So keeping a lot of variables under control and running experiments has helped me immensely to understand these models and their limitations.

Next Step

With the advancement of NLP and other sequential based machine learning research, its really promising to try these new research in the financial domain and 2 that I found most interesting are:

Attention Network:

Attention mechanism used in the state of the art NLP architecture Open-AI GPT2 and Google BERT has given these 2 models unbelievable NLP capabilities and performance compared to their predecessors without Attention mechanism. One of the main reasons being that it closely resembles how humans reads text and understand context. I believe some of the challenges of maintaining two networks in Dueling Network Architecture in Deep Reinforcement Learning can be overcome by use of these Attention mechanism because the underlying goal of both are similar. For these reasons I would also like to test the performance of attention based CNN and CNN_LSTM networks on more complicated synthetic time series.

Transfer learning:

One of the biggest challenge with deep learning is the amount of data required for training, in other domains we would overcome this problem using techniques such as augmentation. But with a time series like market returns, its not that straight forward to generate samples that have the underlying structure as the original time series. There are lot of interesting research coming out on using Auto-encoders and GANs to overcome this problem, but something I found quite interesting to try is transfer learning. In this article I tried retraining CNN_LSTM architecture on a new function but when there is a big difference in the underlying dynamics of the new time-series convergence isn't assured, Applying transfer learning in those situations is worth giving a try. This paper gives an insight into the use of transfer learning on timeseries using Dynamic Time Warping (DTW) to identify similar datasets.

Tools used

There was a lot of experiments and most of the deep learning were done in the cloud, so I had to use a combination of local and cloud to run the experiments. And fortunately Mlflow has been of great help tracking and packaging these experiments. I used MBATS platform(github) which is a dockeriszed architecture to run and track the experiments . With Mlflow you have this amazing option to package experiments locally and run them in AWS Sagemaker, Databricks or Azure ML, for this I used Azure (free 200$ signup credits) for most of the deep learning models. I also used Google Colab Pro for initial prototyping (10$/Month for a P100 GPU, nothing can beat that price).

Kshitiz Sharma

Customer Success at Nasdaq - Nasdaq Risk Platform

4 年

In industry of finding alpha with ML you are killing it man, we’re you able to find any opportunity in the Alexandria dataset we were working on??

要查看或添加评论,请登录

Saeed Rahman的更多文章

社区洞察

其他会员也浏览了