How to create an RNN (Recurrent Neural Network) capable of predicting the behavior of the stock markets and cryptocurrencies
Juan David Tuta Botero
Data Science | Machine Learning | Artificial Intelligence
In this article, we will see how to build an artificial intelligence using an RNN (Recurrent neural network) to predict the price of cryptocurrencies in the future. Due to the highly marketable behavior of these assets, the precision of the network is not optimal. Still, it allows us to have an academic approach for creating more-complex networks with different uses, whether they are buying and selling shares, weather events, or even biological processes.
Time Series Forecasting
First of all, we are going to try to understand the mathematical foundation for the creation of these methods. For that, we need to define what is a time series. It is a series of?data points?indexed (or listed or graphed) in time order. Most commonly, a time series is a?sequence?taken at successive equally spaced points in time. Thus it is a sequence of?discrete-time?data. Examples of time series are heights of ocean?tides, counts of?sunspots, and the daily closing value of the?Dow Jones Industrial Average.
Time series?analysis?comprises methods for analyzing time-series data to extract meaningful statistics and other characteristics of the data.?Time series?forecasting?is the use of a?model?to predict future values based on previously observed values. While?regression analysis?is often employed in such a way as to test relationships between one or more different time series, this type of analysis is not usually called "time series analysis", which refers in particular to relationships between different points in time within a single series.?Interrupted time series?analysis is used to detect changes in the evolution of a time series from before to after some intervention which may affect the underlying variable.
Preprocessing Data
For this guide, we are going to use a cryptocurrency database in a time range from 2012 to 2018 you can download it from this link. The first thing is to see our database and understand what type of data is relevant for our analysis and what is not, in addition to re-interpreting them so that the network can understand it much better. In this database, we observe several columns that have the following meanings.
Clean the database
The first step is to see what data is useful and what is not, a first look at the database we can see several fields that are filled with Nan, these values mean that they were not reported, and these values could affect our neural network in a representative way, then we can decide to do 3 things.
1.We can make all the unreported values the mean of their respective data groups. The big problem with this solution is that when plotting the values, for example, Open, High and low, we see that the initial values have a mean well below the final value which would generate a significant deviation in the first values as well as the final ones, therefore this solution is discarded.
2. The second proposal is to make these values 0, or leave them as they are, if we convert them to 0 this could seriously alter the interpretation of the network in these spaces since it would represent null values while if we use the np.nan this takes a very small negative value which despite the fact that in financial mathematics if there are negative values this is not the case
3. The third option is to eliminate the non-stored data or to do interpolation to find them. This second idea is not very good either because we do not know what happens in these ranges and, knowing the highly changing behavior of these actions, it is more likely that we will fall into errors that may affect our network so the best decision is to eliminate them using the next function.
def?removing_empty_data(data_frame)
??"""
??Function?that?remove?the?rows?with?empty?values
??Args:
????data_frame:?Is?a?numpy.dataframe?with?cripto's?price?information
??Return
??"""
??columns?=?data_frame.columns
??NaN?=?data_frame[columns[1]]?>?0
??new_data_frame?=?data_frame.loc[NaN]
??data_frame.reset_index(inplace=True)
??ddata_framef.drop(columns=["index",?"Unnamed:?0"],
????????????????????inplace=True)
??return?new_data_frame
If we applied the recently created function to our dataset we are going to see the new filtered data
df = removing_empty_data()
df
Great now we have all our data cleansed we are going to see some information about our data using the method described by pandas.dataframe.
df.describe().transpose()
Well, seeing this information there is nothing rare to come to mind, like crazy min values on each of the columns o anything else, but it is a good practice to check this kind of information before doing anything else.
Better interpretation
Now we must remember that this information is going to pass through a machine and we must help him as much as we can to have a global understanding of the variables we are using. First, in the field of prediction, there is an interesting concept of seasoning and this used to happen in many cases for example the stocks in the market, one simple example would be a company that sells winter clothes, probably the company would have the better numbers when we are in winter than in summer, we will try to apply this concept to our cryptocurrency problem using the Fourier transform.
fft?=?tf.signal.rfft(df['Open']
f_per_dataset?=?np.arange(0,?len(fft))
n_samples_h?=?len(df['Open'])
hours_per_year?=?24*365.2524
years_per_dataset?=?n_samples_h/(hours_per_year)
f_per_year?=?f_per_dataset/years_per_dataset
plt.step(f_per_year,?np.abs(fft))
plt.xscale('log'))
Due to the erratic behavior of the market we see that there are no specific points where we see a great trend, rather we see a greater escalation as time goes by, this is to be expected because the price of cryptocurrencies depends a lot on speculation. If we wanted to see, for example, the image of a company that sells winter clothing, we would have a graph more or less of this type, which indicates that we have a peak or a frequency of one year and another of one day.
And if it is the case to find a graph of this style for our network, it is much easier to use sine and cosine parameters than a total value, so we could make the following change.
day = 24*60*6
year = (365.2425)*day
df['Day sin'] = np.sin(timestamp_s * (2 * np.pi / day))
df['Day cos'] = np.cos(timestamp_s * (2 * np.pi / day))
df['Year sin'] = np.sin(timestamp_s * (2 * np.pi / year))
df['Year cos'] = np.cos(timestamp_s * (2 * np.pi / year))
plt.plot(np.array(df['Day sin'])[:25])
plt.plot(np.array(df['Day cos'])[:25])
plt.xlabel('Time [h]')
plt.title('Time of day signal')0
In the end, we would have some time graphs of this style.
Since this is not the case, we continue. It is good to create and separate the training sets. We will use a split (70%, 20%, 10%) for the training, validation, and test sets. Note that the data is not randomly shuffled before splitting. This is for two reasons:
It ensures that it is still possible to split the data into consecutive sample windows. It ensures that the validation/test results are more realistic and are evaluated on the data collected after the model is trained.
column_indices=?{name:?i?for?i,?name?in?enumerate(df.columns)
n?=?len(df)
train_df?=?df[0:int(n*0.7)]
val_df?=?df[int(n*0.7):int(n*0.9)]
test_df?=?df[int(n*0.9):]
num_features?=?df.shape[1]
It is important to scale features before training a neural network. Normalization is a common way to do this scaling: subtract the mean and divide by the standard deviation of each feature. The mean and standard deviation should only be calculated using the training data so that the models do not have access to the values in the validation and test sets. It is also arguable that the model should not have access to future values in the training set during training, and that this normalization should be done using moving averages. That's not the focus of this tutorial, and the test and validation suites ensure you get (somewhat) honest metrics. So for the sake of simplicity, this tutorial uses a simple average.
train_mean?=?train_df.mean(
train_std?=?train_df.std()
train_df?=?(train_df?-?train_mean)?/?train_std
val_df?=?(val_df?-?train_mean)?/?train_std
test_df?=?(test_df?-?train_mean)?/?train_std)
df_std?=?(df?-?train_mean)?/?train_st
df_std?=?df_std.melt(var_name='Column',?value_name='Normalized')
plt.figure(figsize=(12,?6))
ax?=?sns.violinplot(x='Column',?y='Normalized',?data=df_std)
_?=?ax.set_xticklabels(df.keys(),?rotation=90)
Now take a look at the distribution of features. Some features have long tails, but there are no obvious bugs like unrealistic values like np.nan can get.
Data windowing
The models in this tutorial will make a set of predictions based on a window of consecutive samples from the data. The main features of the input windows are:
This tutorial builds a variety of models (including Linear, DNN, CNN, and RNN models), and uses them for both:
This section focuses on implementing the data windowing so that it can be reused for all of those models.
Depending on the task and type of model you may want to generate a variety of data windows. Here are some examples:
The rest of this section defines a?WindowGenerator?class. This class can:
?Indexes and offsets
Indexes and offset start by creating the?WindowGenerator?class. The?__init__?method includes all the necessary logic for the input and label indices. It also takes the training, evaluation, and test DataFrames as input. These will be converted to?tf.data.Datasets of windows later.
class?WindowGenerator()
??def?__init__(self,?input_width,?label_width,?shift,
???????????????train_df=train_df,?val_df=val_df,?test_df=test_df,
???????????????label_columns=None):
????#?Store?the?raw?data.
????self.train_df?=?train_df
????self.val_df?=?val_df
????self.test_df?=?test_df
????#?Work?out?the?label?column?indices.
????self.label_columns?=?label_columns
????if?label_columns?is?not?None:
??????self.label_columns_indices?=?{name:?i?for?i,?name?in
????????????????????????????????????enumerate(label_columns)}
????self.column_indices?=?{name:?i?for?i,?name?in
???????????????????????????enumerate(train_df.columns)}
????#?Work?out?the?window?parameters.
????self.input_width?=?input_width
????self.label_width?=?label_width
????self.shift?=?shift
????self.total_window_size?=?input_width?+?shift
????self.input_slice?=?slice(0,?input_width)
????self.input_indices?=?np.arange(self.total_window_size)[self.input_slice]
????self.label_start?=?self.total_window_size?-?self.label_width
????self.labels_slice?=?slice(self.label_start,?None)
????self.label_indices?=?np.arange(self.total_window_size)[self.labels_slice]
??def?__repr__(self):
????return?'\n'.join([
????????f'Total?window?size:?{self.total_window_size}',
????????f'Input?indices:?{self.input_indices}',
????????f'Label?indices:?{self.label_indices}',
????????f'Label?column?name(s):?{self.label_columns}']):
Here is the code to create the 2 windows shown in the diagrams at the start of this section:
w1 = WindowGenerator(input_width=24, label_width=1, shift=24
? ? ? ? ? ? ? ? ? ? ?label_columns=['Open'])
w1
2.?Split
Given a list of consecutive inputs, the?split_window?method will convert them to a window of inputs and a window of labels. The example?w2?you defined earlier will be split like this:
领英推荐
This diagram doesn't show the?features?axis of the data, but this?split_window?function also handles the?label_columns?so it can be used for both the single output and multi-output examples.
# Stck three slices, the length of the total window
example_window = tf.stack([np.array(train_df[:w2.total_window_size]),
? ? ? ? ? ? ? ? ? ? ? ? ? ?np.array(train_df[100:100+w2.total_window_size]),
? ? ? ? ? ? ? ? ? ? ? ? ? ?np.array(train_df[200:200+w2.total_window_size])])
example_inputs, example_labels = w2.split_window(example_window)
print('All shapes are: (batch, time, features)')
print(f'Window shape: {example_window.shape}')
print(f'Inputs shape: {example_inputs.shape}')
print(f'Labels shape: {example_labels.shape}').
Typically, data in TensorFlow is packed into arrays where the outermost index is across examples (the "batch" dimension). The middle indices are the "time" or "space" (width, height) dimension(s). The innermost indices are the features.
The code above took a batch of three 7-time step windows with 19 features at each time step. It splits them into a batch of 6-time step 19-feature inputs, and a 1-time step 1-feature label. The label only has one feature because the?WindowGenerator?was initialized with?label_columns=['Open']. Initially, this tutorial will build models that predict single output labels.
3.?Plot
Here is a plot method that allows simple visualization of the split window:
w2.example = example_inputs, example_labels
def plot(self, model=None, plot_col='T (degC)', max_subplots=3)
? inputs, labels = self.example
? plt.figure(figsize=(12, 8))
? plot_col_index = self.column_indices[plot_col]
? max_n = min(max_subplots, len(inputs))
? for n in range(max_n):
? ? plt.subplot(max_n, 1, n+1)
? ? plt.ylabel(f'{plot_col} [normed]')
? ? plt.plot(self.input_indices, inputs[n, :, plot_col_index],
? ? ? ? ? ? ?label='Inputs', marker='.', zorder=-10)
? ? if self.label_columns:
? ? ? label_col_index = self.label_columns_indices.get(plot_col, None)
? ? else:
? ? ? label_col_index = plot_col_index
? ? if label_col_index is None:
? ? ? continue
? ? plt.scatter(self.label_indices, labels[n, :, label_col_index],
? ? ? ? ? ? ? ? edgecolors='k', label='Labels', c='#2ca02c', s=64)
? ? if model is not None:
? ? ? predictions = model(inputs)
? ? ? plt.scatter(self.label_indices, predictions[n, :, label_col_index],
? ? ? ? ? ? ? ? ? marker='X', edgecolors='k', label='Predictions',
? ? ? ? ? ? ? ? ? c='#ff7f0e', s=64)
? ? if n == 0:
? ? ? plt.legend()
? plt.xlabel('Time [h]')
WindowGenerator.plot = plot
w2.plot()
This plot aligns inputs, labels, and (later) predictions based on the time that the item refers to
4.?Create?tf.data.Datasets
Finally, this?make_dataset?method will take a time series DataFrame and convert it to a?tf.data.Dataset?of?(input_window, label_window)?pairs using the?tf.keras.utils.timeseries_dataset_from_array?function:
def make_dataset(self, data)
? data = np.array(data, dtype=np.float32)
? ds = tf.keras.utils.timeseries_dataset_from_array(
? ? ? data=data,
? ? ? targets=None,
? ? ? sequence_length=self.total_window_size,
? ? ? sequence_stride=1,
? ? ? shuffle=True,
? ? ? batch_size=32,)
? ds = ds.map(self.split_window)
? return ds
WindowGenerator.make_dataset = make_dataset
The?WindowGenerator?object holds training, validation, and test data.
Add properties for accessing them as?tf.data.Datasets using the?make_dataset?method you defined earlier. Also, add a standard example batch for easy access and plotting:
@propert
def train(self):
? return self.make_dataset(self.train_df)
@property
def val(self):
? return self.make_dataset(self.val_df)
@property
def test(self):
? return self.make_dataset(self.test_df)
@property
def example(self):
? """Get and cache an example batch of `inputs, labels` for plotting."""
? result = getattr(self, '_example', None)
? if result is None:
? ? # No example batch was found, so get one from the `.train` dataset
? ? result = next(iter(self.train))
? ? # And cache it for next time
? ? self._example = result
? return result
WindowGenerator.train = train
WindowGenerator.val = val
WindowGenerator.test = test
WindowGenerator.example = exampley
Now, the?WindowGenerator?object gives you access to the?tf.data.Dataset?objects, so you can easily iterate over the data.
The?Dataset.element_spec?property tells you the structure, data types, and shapes of the dataset elements.
# Each element is an (inputs, label) pair.
w2.train.element_spec
Iterating over Dataset yields concrete batches:
for example_inputs, example_labels in w2.train.take(1):
? print(f'Inputs shape (batch, time, features): {example_inputs.shape}')
? print(f'Labels shape (batch, time, features): {example_labels.shape}')
Single-step models
The simplest model you can build on this sort of data is one that predicts a single feature's value—1 time step (one hour) into the future based only on the current conditions.
So, start by building models to predict the?Open?value one hour into the future.
Recurrent neural network
A Recurrent Neural Network (RNN) is a type of neural network well-suited to time series data. RNNs process a time series step-by-step, maintaining an internal state from time-step to time step.
You can learn more about the?Text generation with an RNN?tutorial and the?Recurrent Neural Networks (RNN) with Keras?guide.
In this tutorial, you will use an RNN layer called Long Short-Term Memory (tf.keras.layers.LSTM).
An important constructor argument for all Keras RNN layers, such as?tf.keras.layers.LSTM, is the?return_sequences?argument. This setting can configure the layer in one of two ways:
lstm_model = tf.keras.models.Sequential([
? ? # Shape [batch, time, features] => [batch, time, lstm_units]
? ? tf.keras.layers.LSTM(32, return_sequences=True),
? ? # Shape => [batch, time, features]
? ? tf.keras.layers.Dense(units=1)])
With?return_sequences=True, the model can be trained on 24 hours of data at a time.
print('Input shape:', wide_window.example[0].shape)
print('Output shape:', lstm_model(wide_window.example[0]).shape)
history = compile_and_fit(lstm_model, wide_window
IPython.display.clear_output()
val_performance['LSTM'] = lstm_model.evaluate(wide_window.val)
performance['LSTM'] = lstm_model.evaluate(wide_window.test, verbose=0))
wide_window.plot(lstm_model)
Multi-step models
Both the single-output and multiple-output models in the previous sections made?single time step predictions, one hour into the future.
This section looks at how to expand these models to make?multiple time step predictions.
In a multi-step prediction, the model needs to learn to predict a range of future values. Thus, unlike a single-step model, where only a single future point is predicted, a multi-step model predicts a sequence of the future values.
There are two rough approaches to this:
In this section, all the models will predict?all the features across all output time steps.
For the multi-step model, the training data again consists of hourly samples. However, here, the models will learn to predict 24 hours into the future, given 24 hours of the past.
Here is a?Window?object that generates these slices from the dataset:
OUT_STEPS = 24
multi_window = WindowGenerator(input_width=24,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?label_width=OUT_STEPS,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?shift=OUT_STEPS)
multi_window.plot()
multi_window
Conclusions
With the results, we obtained we can give a clear understanding of the incredible randomness in the stock markets of currencies and assets even more speaking of cryptocurrencies. But using this information base we can recreate it for more stable markets and we could have much better results.
In the analysis of the data, we could also use PCA (Principal Component Analysis) which would almost certainly take the first 3 columns of our database but it is also interesting to see if the closing price of an asset could influence its behavior. on the next day. These are interesting considerations to analyze in the future, if you are interested in reviewing the code you can access this notebook in Google collab where all the executed code is, or my GitHub where you will see the main functions.
Bibliography
https://www.tensorflow.org/tutorials/structured_data/time_series