How to create an RNN (Recurrent Neural Network) capable of predicting the behavior of the stock markets and cryptocurrencies

How to create an RNN (Recurrent Neural Network) capable of predicting the behavior of the stock markets and cryptocurrencies

In this article, we will see how to build an artificial intelligence using an RNN (Recurrent neural network) to predict the price of cryptocurrencies in the future. Due to the highly marketable behavior of these assets, the precision of the network is not optimal. Still, it allows us to have an academic approach for creating more-complex networks with different uses, whether they are buying and selling shares, weather events, or even biological processes.

Time Series Forecasting

First of all, we are going to try to understand the mathematical foundation for the creation of these methods. For that, we need to define what is a time series. It is a series of?data points?indexed (or listed or graphed) in time order. Most commonly, a time series is a?sequence?taken at successive equally spaced points in time. Thus it is a sequence of?discrete-time?data. Examples of time series are heights of ocean?tides, counts of?sunspots, and the daily closing value of the?Dow Jones Industrial Average.

Time series?analysis?comprises methods for analyzing time-series data to extract meaningful statistics and other characteristics of the data.?Time series?forecasting?is the use of a?model?to predict future values based on previously observed values. While?regression analysis?is often employed in such a way as to test relationships between one or more different time series, this type of analysis is not usually called "time series analysis", which refers in particular to relationships between different points in time within a single series.?Interrupted time series?analysis is used to detect changes in the evolution of a time series from before to after some intervention which may affect the underlying variable.

Time series: random data plus trend

Preprocessing Data

For this guide, we are going to use a cryptocurrency database in a time range from 2012 to 2018 you can download it from this link. The first thing is to see our database and understand what type of data is relevant for our analysis and what is not, in addition to re-interpreting them so that the network can understand it much better. In this database, we observe several columns that have the following meanings.

  • The start time of the time window in Unix time
  • The open price in USD at the start of the time window
  • The high price in USD within the time window
  • The low price in USD within the time window
  • The close price in USD at the end of the time window
  • The amount of BTC transacted in the time window
  • The amount of Currency (USD) transacted in the time window
  • The volume-weighted average price in USD for the time window

Clean the database

The first step is to see what data is useful and what is not, a first look at the database we can see several fields that are filled with Nan, these values mean that they were not reported, and these values could affect our neural network in a representative way, then we can decide to do 3 things.

No hay texto alternativo para esta imagen

1.We can make all the unreported values the mean of their respective data groups. The big problem with this solution is that when plotting the values, for example, Open, High and low, we see that the initial values have a mean well below the final value which would generate a significant deviation in the first values as well as the final ones, therefore this solution is discarded.

No hay texto alternativo para esta imagen

2. The second proposal is to make these values 0, or leave them as they are, if we convert them to 0 this could seriously alter the interpretation of the network in these spaces since it would represent null values while if we use the np.nan this takes a very small negative value which despite the fact that in financial mathematics if there are negative values this is not the case

3. The third option is to eliminate the non-stored data or to do interpolation to find them. This second idea is not very good either because we do not know what happens in these ranges and, knowing the highly changing behavior of these actions, it is more likely that we will fall into errors that may affect our network so the best decision is to eliminate them using the next function.

def?removing_empty_data(data_frame)
??"""
??Function?that?remove?the?rows?with?empty?values
??Args:
????data_frame:?Is?a?numpy.dataframe?with?cripto's?price?information
??Return
??"""
??columns?=?data_frame.columns
??NaN?=?data_frame[columns[1]]?>?0
??new_data_frame?=?data_frame.loc[NaN]


??data_frame.reset_index(inplace=True)
??ddata_framef.drop(columns=["index",?"Unnamed:?0"],
????????????????????inplace=True)


??return?new_data_frame        

If we applied the recently created function to our dataset we are going to see the new filtered data

df = removing_empty_data()
df        
No hay texto alternativo para esta imagen

Great now we have all our data cleansed we are going to see some information about our data using the method described by pandas.dataframe.

df.describe().transpose()        
No hay texto alternativo para esta imagen

Well, seeing this information there is nothing rare to come to mind, like crazy min values on each of the columns o anything else, but it is a good practice to check this kind of information before doing anything else.

Better interpretation

Now we must remember that this information is going to pass through a machine and we must help him as much as we can to have a global understanding of the variables we are using. First, in the field of prediction, there is an interesting concept of seasoning and this used to happen in many cases for example the stocks in the market, one simple example would be a company that sells winter clothes, probably the company would have the better numbers when we are in winter than in summer, we will try to apply this concept to our cryptocurrency problem using the Fourier transform.

fft?=?tf.signal.rfft(df['Open']
f_per_dataset?=?np.arange(0,?len(fft))


n_samples_h?=?len(df['Open'])
hours_per_year?=?24*365.2524
years_per_dataset?=?n_samples_h/(hours_per_year)


f_per_year?=?f_per_dataset/years_per_dataset
plt.step(f_per_year,?np.abs(fft))
plt.xscale('log'))        
No hay texto alternativo para esta imagen

Due to the erratic behavior of the market we see that there are no specific points where we see a great trend, rather we see a greater escalation as time goes by, this is to be expected because the price of cryptocurrencies depends a lot on speculation. If we wanted to see, for example, the image of a company that sells winter clothing, we would have a graph more or less of this type, which indicates that we have a peak or a frequency of one year and another of one day.

No hay texto alternativo para esta imagen

And if it is the case to find a graph of this style for our network, it is much easier to use sine and cosine parameters than a total value, so we could make the following change.

day = 24*60*6
year = (365.2425)*day

df['Day sin'] = np.sin(timestamp_s * (2 * np.pi / day))
df['Day cos'] = np.cos(timestamp_s * (2 * np.pi / day))
df['Year sin'] = np.sin(timestamp_s * (2 * np.pi / year))
df['Year cos'] = np.cos(timestamp_s * (2 * np.pi / year))

plt.plot(np.array(df['Day sin'])[:25])
plt.plot(np.array(df['Day cos'])[:25])
plt.xlabel('Time [h]')
plt.title('Time of day signal')0        

In the end, we would have some time graphs of this style.

No hay texto alternativo para esta imagen

Since this is not the case, we continue. It is good to create and separate the training sets. We will use a split (70%, 20%, 10%) for the training, validation, and test sets. Note that the data is not randomly shuffled before splitting. This is for two reasons:

It ensures that it is still possible to split the data into consecutive sample windows. It ensures that the validation/test results are more realistic and are evaluated on the data collected after the model is trained.

column_indices=?{name:?i?for?i,?name?in?enumerate(df.columns)


n?=?len(df)
train_df?=?df[0:int(n*0.7)]
val_df?=?df[int(n*0.7):int(n*0.9)]
test_df?=?df[int(n*0.9):]


num_features?=?df.shape[1]        

It is important to scale features before training a neural network. Normalization is a common way to do this scaling: subtract the mean and divide by the standard deviation of each feature. The mean and standard deviation should only be calculated using the training data so that the models do not have access to the values in the validation and test sets. It is also arguable that the model should not have access to future values in the training set during training, and that this normalization should be done using moving averages. That's not the focus of this tutorial, and the test and validation suites ensure you get (somewhat) honest metrics. So for the sake of simplicity, this tutorial uses a simple average.

train_mean?=?train_df.mean(
train_std?=?train_df.std()


train_df?=?(train_df?-?train_mean)?/?train_std
val_df?=?(val_df?-?train_mean)?/?train_std
test_df?=?(test_df?-?train_mean)?/?train_std)

df_std?=?(df?-?train_mean)?/?train_st
df_std?=?df_std.melt(var_name='Column',?value_name='Normalized')
plt.figure(figsize=(12,?6))
ax?=?sns.violinplot(x='Column',?y='Normalized',?data=df_std)
_?=?ax.set_xticklabels(df.keys(),?rotation=90)        
No hay texto alternativo para esta imagen

Now take a look at the distribution of features. Some features have long tails, but there are no obvious bugs like unrealistic values like np.nan can get.

Data windowing

The models in this tutorial will make a set of predictions based on a window of consecutive samples from the data. The main features of the input windows are:

  • The width (number of time steps) of the input and label windows.
  • The time offset between them.
  • Which features are used as inputs, labels, or both?

This tutorial builds a variety of models (including Linear, DNN, CNN, and RNN models), and uses them for both:

  • Single-output, and?multi-output?predictions.
  • Single-time-step?and?multi-time-step?predictions.

This section focuses on implementing the data windowing so that it can be reused for all of those models.

Depending on the task and type of model you may want to generate a variety of data windows. Here are some examples:

  1. For example, to make a single prediction 24 hours into the future, given 24 hours of history, you might define a window like this:

No hay texto alternativo para esta imagen

The rest of this section defines a?WindowGenerator?class. This class can:

  1. Handle the indexes and offsets as shown in the diagrams above.
  2. Split windows of features into?(features, labels)?pairs.
  3. Plot the content of the resulting windows.
  4. Efficiently generate batches of these windows from the training, evaluation, and test data, using?tf.data.Datasets.

?Indexes and offsets

Indexes and offset start by creating the?WindowGenerator?class. The?__init__?method includes all the necessary logic for the input and label indices. It also takes the training, evaluation, and test DataFrames as input. These will be converted to?tf.data.Datasets of windows later.

class?WindowGenerator()
??def?__init__(self,?input_width,?label_width,?shift,
???????????????train_df=train_df,?val_df=val_df,?test_df=test_df,
???????????????label_columns=None):
????#?Store?the?raw?data.
????self.train_df?=?train_df
????self.val_df?=?val_df
????self.test_df?=?test_df


????#?Work?out?the?label?column?indices.
????self.label_columns?=?label_columns
????if?label_columns?is?not?None:
??????self.label_columns_indices?=?{name:?i?for?i,?name?in
????????????????????????????????????enumerate(label_columns)}
????self.column_indices?=?{name:?i?for?i,?name?in
???????????????????????????enumerate(train_df.columns)}


????#?Work?out?the?window?parameters.
????self.input_width?=?input_width
????self.label_width?=?label_width
????self.shift?=?shift


????self.total_window_size?=?input_width?+?shift


????self.input_slice?=?slice(0,?input_width)
????self.input_indices?=?np.arange(self.total_window_size)[self.input_slice]


????self.label_start?=?self.total_window_size?-?self.label_width
????self.labels_slice?=?slice(self.label_start,?None)
????self.label_indices?=?np.arange(self.total_window_size)[self.labels_slice]


??def?__repr__(self):
????return?'\n'.join([
????????f'Total?window?size:?{self.total_window_size}',
????????f'Input?indices:?{self.input_indices}',
????????f'Label?indices:?{self.label_indices}',
????????f'Label?column?name(s):?{self.label_columns}']):        

Here is the code to create the 2 windows shown in the diagrams at the start of this section:

w1 = WindowGenerator(input_width=24, label_width=1, shift=24
? ? ? ? ? ? ? ? ? ? ?label_columns=['Open'])
w1
        
No hay texto alternativo para esta imagen

2.?Split

Given a list of consecutive inputs, the?split_window?method will convert them to a window of inputs and a window of labels. The example?w2?you defined earlier will be split like this:

No hay texto alternativo para esta imagen

This diagram doesn't show the?features?axis of the data, but this?split_window?function also handles the?label_columns?so it can be used for both the single output and multi-output examples.

# Stck three slices, the length of the total window
example_window = tf.stack([np.array(train_df[:w2.total_window_size]),
? ? ? ? ? ? ? ? ? ? ? ? ? ?np.array(train_df[100:100+w2.total_window_size]),
? ? ? ? ? ? ? ? ? ? ? ? ? ?np.array(train_df[200:200+w2.total_window_size])])

example_inputs, example_labels = w2.split_window(example_window)

print('All shapes are: (batch, time, features)')
print(f'Window shape: {example_window.shape}')
print(f'Inputs shape: {example_inputs.shape}')
print(f'Labels shape: {example_labels.shape}').        
No hay texto alternativo para esta imagen

Typically, data in TensorFlow is packed into arrays where the outermost index is across examples (the "batch" dimension). The middle indices are the "time" or "space" (width, height) dimension(s). The innermost indices are the features.

The code above took a batch of three 7-time step windows with 19 features at each time step. It splits them into a batch of 6-time step 19-feature inputs, and a 1-time step 1-feature label. The label only has one feature because the?WindowGenerator?was initialized with?label_columns=['Open']. Initially, this tutorial will build models that predict single output labels.

3.?Plot

Here is a plot method that allows simple visualization of the split window:

w2.example = example_inputs, example_labels

def plot(self, model=None, plot_col='T (degC)', max_subplots=3)
? inputs, labels = self.example
? plt.figure(figsize=(12, 8))
? plot_col_index = self.column_indices[plot_col]
? max_n = min(max_subplots, len(inputs))
? for n in range(max_n):
? ? plt.subplot(max_n, 1, n+1)
? ? plt.ylabel(f'{plot_col} [normed]')
? ? plt.plot(self.input_indices, inputs[n, :, plot_col_index],
? ? ? ? ? ? ?label='Inputs', marker='.', zorder=-10)

? ? if self.label_columns:
? ? ? label_col_index = self.label_columns_indices.get(plot_col, None)
? ? else:
? ? ? label_col_index = plot_col_index

? ? if label_col_index is None:
? ? ? continue

? ? plt.scatter(self.label_indices, labels[n, :, label_col_index],
? ? ? ? ? ? ? ? edgecolors='k', label='Labels', c='#2ca02c', s=64)
? ? if model is not None:
? ? ? predictions = model(inputs)
? ? ? plt.scatter(self.label_indices, predictions[n, :, label_col_index],
? ? ? ? ? ? ? ? ? marker='X', edgecolors='k', label='Predictions',
? ? ? ? ? ? ? ? ? c='#ff7f0e', s=64)

? ? if n == 0:
? ? ? plt.legend()

? plt.xlabel('Time [h]')

WindowGenerator.plot = plot
w2.plot()        

This plot aligns inputs, labels, and (later) predictions based on the time that the item refers to

No hay texto alternativo para esta imagen

4.?Create?tf.data.Datasets

Finally, this?make_dataset?method will take a time series DataFrame and convert it to a?tf.data.Dataset?of?(input_window, label_window)?pairs using the?tf.keras.utils.timeseries_dataset_from_array?function:

def make_dataset(self, data)
? data = np.array(data, dtype=np.float32)
? ds = tf.keras.utils.timeseries_dataset_from_array(
? ? ? data=data,
? ? ? targets=None,
? ? ? sequence_length=self.total_window_size,
? ? ? sequence_stride=1,
? ? ? shuffle=True,
? ? ? batch_size=32,)

? ds = ds.map(self.split_window)

? return ds

WindowGenerator.make_dataset = make_dataset        

The?WindowGenerator?object holds training, validation, and test data.

Add properties for accessing them as?tf.data.Datasets using the?make_dataset?method you defined earlier. Also, add a standard example batch for easy access and plotting:

@propert
def train(self):
? return self.make_dataset(self.train_df)

@property
def val(self):
? return self.make_dataset(self.val_df)

@property
def test(self):
? return self.make_dataset(self.test_df)

@property
def example(self):
? """Get and cache an example batch of `inputs, labels` for plotting."""
? result = getattr(self, '_example', None)
? if result is None:
? ? # No example batch was found, so get one from the `.train` dataset
? ? result = next(iter(self.train))
? ? # And cache it for next time
? ? self._example = result
? return result

WindowGenerator.train = train
WindowGenerator.val = val
WindowGenerator.test = test
WindowGenerator.example = exampley        

Now, the?WindowGenerator?object gives you access to the?tf.data.Dataset?objects, so you can easily iterate over the data.

The?Dataset.element_spec?property tells you the structure, data types, and shapes of the dataset elements.

# Each element is an (inputs, label) pair.
w2.train.element_spec        
No hay texto alternativo para esta imagen

Iterating over Dataset yields concrete batches:

for example_inputs, example_labels in w2.train.take(1):
? print(f'Inputs shape (batch, time, features): {example_inputs.shape}')
? print(f'Labels shape (batch, time, features): {example_labels.shape}')        
No hay texto alternativo para esta imagen

Single-step models

The simplest model you can build on this sort of data is one that predicts a single feature's value—1 time step (one hour) into the future based only on the current conditions.

So, start by building models to predict the?Open?value one hour into the future.

Recurrent neural network

A Recurrent Neural Network (RNN) is a type of neural network well-suited to time series data. RNNs process a time series step-by-step, maintaining an internal state from time-step to time step.

You can learn more about the?Text generation with an RNN?tutorial and the?Recurrent Neural Networks (RNN) with Keras?guide.

In this tutorial, you will use an RNN layer called Long Short-Term Memory (tf.keras.layers.LSTM).

An important constructor argument for all Keras RNN layers, such as?tf.keras.layers.LSTM, is the?return_sequences?argument. This setting can configure the layer in one of two ways:

  1. If?False, the default, the layer only returns the output of the final time step, giving the model time to warm up its internal state before making a single prediction:

No hay texto alternativo para esta imagen

  1. If?True, the layer returns output for each input. This is useful for:

  • Stacking RNN layers.
  • Training a model on multiple time steps simultaneously.

No hay texto alternativo para esta imagen
lstm_model = tf.keras.models.Sequential([
? ? # Shape [batch, time, features] => [batch, time, lstm_units]
? ? tf.keras.layers.LSTM(32, return_sequences=True),
? ? # Shape => [batch, time, features]
? ? tf.keras.layers.Dense(units=1)])        

With?return_sequences=True, the model can be trained on 24 hours of data at a time.

print('Input shape:', wide_window.example[0].shape)
print('Output shape:', lstm_model(wide_window.example[0]).shape)        
No hay texto alternativo para esta imagen
history = compile_and_fit(lstm_model, wide_window

IPython.display.clear_output()
val_performance['LSTM'] = lstm_model.evaluate(wide_window.val)
performance['LSTM'] = lstm_model.evaluate(wide_window.test, verbose=0))

wide_window.plot(lstm_model)        
No hay texto alternativo para esta imagen

Multi-step models

Both the single-output and multiple-output models in the previous sections made?single time step predictions, one hour into the future.

This section looks at how to expand these models to make?multiple time step predictions.

In a multi-step prediction, the model needs to learn to predict a range of future values. Thus, unlike a single-step model, where only a single future point is predicted, a multi-step model predicts a sequence of the future values.

There are two rough approaches to this:

  1. Single-shot predictions where the entire time series is predicted at once.
  2. Autoregressive predictions where the model only makes single-step predictions and its output is fed back as its input.

In this section, all the models will predict?all the features across all output time steps.

For the multi-step model, the training data again consists of hourly samples. However, here, the models will learn to predict 24 hours into the future, given 24 hours of the past.

Here is a?Window?object that generates these slices from the dataset:

OUT_STEPS = 24
multi_window = WindowGenerator(input_width=24,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?label_width=OUT_STEPS,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?shift=OUT_STEPS)

multi_window.plot()
multi_window        
No hay texto alternativo para esta imagen

Conclusions

With the results, we obtained we can give a clear understanding of the incredible randomness in the stock markets of currencies and assets even more speaking of cryptocurrencies. But using this information base we can recreate it for more stable markets and we could have much better results.

In the analysis of the data, we could also use PCA (Principal Component Analysis) which would almost certainly take the first 3 columns of our database but it is also interesting to see if the closing price of an asset could influence its behavior. on the next day. These are interesting considerations to analyze in the future, if you are interested in reviewing the code you can access this notebook in Google collab where all the executed code is, or my GitHub where you will see the main functions.

Bibliography

https://www.tensorflow.org/tutorials/structured_data/time_series

要查看或添加评论,请登录

Juan David Tuta Botero的更多文章

社区洞察

其他会员也浏览了