Recurrent Neural Networks: What Is This and How It Works?
Commonly used direct propagation neural networks discussed in the previous article (https://www.dhirubhai.net/pulse/neural-networks-how-work-mantas-lukauskas) cannot capture continuous data changes, so they cannot be used in this case. To this end, recurrent neural networks have been developed to capture not only the current state but also the past state. An example of this is translating a text into a foreign language, when the appropriate inflexion of the word to be translated or the word itself is selected, depending on the consecutive words in the sentence. Recurrent neural networks, in solving this task, use cycles that allow the preservation of previous states. The presented structure of the recurrent neural network contains notations: A - part of the neural network; Xt - neural network input value at time t; ht is the neural network output value at time t. Given that the presented recursive neural network is understood as a cycle, it can also be represented as a single process consisting of several parts.
Sometimes recurrent neural networks are highly efficient for specific tasks. For example, if recursive neural networks are used to translate text from one language to another, it is usually sufficient to know a few previous words to select the correct inflexion of the word or the word itself, which has several different meanings. The figure below shows the scheme of operation of these neural networks. When moving from one part of the neural network to another, the previous value is saved, merged with the newly received value in the new part of the network. The hyperbolic tangent function is then used to obtain the output value, which is given for the network below.
In these cases, the difference between the input information provided and where this information is used is small so that recursive neural networks can learn this information. The figure below shows an illustration showing the small gap between the input values and the output value.
However, there is a possible case where a larger context is needed to predict the output values. In this case, using simple recurrent neural networks becomes difficult to implement. The figure below is an illustration showing the large gap between the input values and the output value.
For this reason, a new type of neural network has been developed that allows the capture of information from the distant past. These neural networks are called long-term-short-term memory neural networks (LSTM), which are discussed below.
Long-short term memory
This type of neural network was first introduced in 1997 by Hochreiter and Schmidhuber. Recently, due to the ability to capture information from the long past, these neural networks are particularly commonly used in practice. These neural networks also have a "chain" type structure, but their repeating unit has a completely different structure than simple recurrent neural networks. Instead of a single layer, as in recurrent neural networks, LSTM has as many as four layers with a unique relationship. The structure of long-term-short-term memory neural networks is presented in the figure below.
In the figure above, the notation used is: green square - neural network layer, blue circle - operations are performed between individual vectors (multiplication, composition, etc.). This figure also shows the directions of the vectors, the intersection of the two arrows indicates the joining of the vectors, and the separation of one vector into the two vectors indicates the copying of these vectors. A key element of these neural networks is a horizontal line running throughout the circuit. An illustration of this item is provided in the figure below.
The following section provides information on the principle of operation of long-term-short-term neural networks, with an overview of each step performed in these neural networks. The first layer of LSTM neural networks (see Figure 2.11) attempts to select which information should be discarded from the previously collected information. This is done using a layer of the sigmoidal function called the forget gate layer. In this layer, the input value of h_ (t-1) and x_t are used, and the output value between 0 and 1 is obtained.
These values are obtained for each previously accumulated state C_ (t-1), where 1 indicates that what must save the state and 0 that the new state must be discarded and not stored. If this is applied when forecasting economic indicators, it is conceivable to reject certain previous years' values. When forecasting investment attractiveness, it may not depend on the values of the indicators that existed 3 years ago. In this case, the output of such values in this layer would be 0.
The second step of this algorithm is to decide which will retain information. It all consists of two parts. The first part is the layer of the sigmoidal function, which is called the input gate layer. This layer decides which will update values.
i_t = σ (W_i ? [h_ (t-1), x_t] + b_i);
Then the second is the layer of the hyperbolic tangent, which creates new possible values (C_t) ?, which can be added to the state value.
(C_t) ? = tanh? (W_C ? [h_ (t-1), x_t] + b_C);
Then, in the next step, these values are summed to update the status. In the third step, the old state C_ (t-1) is updated to the new state C_t (see Figure 2.13). To update this state, what performed all calculations in the previous steps is that only the state update remains.
C_t = f_t * C_ (t-1) + i_t * (C_t) ?;
First, the old state is multiplied by f_t, where the product of these members allows us to forget unnecessary information. An i_t * (C_t) ? is added to this multiplication to show how each state is updated. This is where the real forgetting of information takes place.
The last step is to decide what the output will be (see Figure 2.14). The output value, in this case, depends on the status value, but a filtered version of this value is. First, a sigmoidal function is used that determines what part of the state value will be output.
o_t = σ (W_0 [h_ (t-1), x_t] + b_0);
The hyperbolic tangent function is then used to make the state value between -1 and 1.
h_t = o_t * tanh? 〖(C_t);〗
This obtained value is multiplied by the previously obtained value of the sigmoidal function so that the output value has some predefined part of the state value.
Gated Reccurent Unit (GRU) (lt. Sulaikomo pasikartojan?io vieneto neuroniniai tinklai)
Another modification of recurrent neural networks that differs greatly from the LSTM modification is gated recurrent unit (GRU) networks. This network is similar to an LSTM-type network because, like LSTM, various logical elements are used to control the presentation of information. One of the main differences between LSTM and GRU is that GRU has no memory cells.
Neural networks of this type do not distinguish between forget gates and input gates but combine them into a single update gate. Also, this type of neural network combines the cell state and the hidden state. In the first step of these neural networks, the renewal gate z_t at time t is calculated:
z_t = σ (W_z ? [h_ (t-1), x_t]);
In this step, the new value of x_t is used and the state h_ (t-1) that existed in the previous period. Both of these results are added together, and then a sigmoidal function is used to change the values to a range of 0 to 1. The update gate allows the model to determine how much past information from past time periods should be left in the model and used in the future. This element is beneficial as it helps to avoid the problem of vanishing gradient.
In the second step, a reset gate is used. This step is particularly significant in this type of neural network because it indicates what portion of past information should be discarded. Information rejection is calculated based on:
r_t = σ (W_r ? [h_ (t-1), x_t]);
what can see that the formula used in this step is the same as the formula used in the first step. The only difference between these formulas is the weights used. The third step of this algorithm is to calculate the current memory state. In this step, the previously calculated reset gate is used. New information and update gateways are used to save the required information from the past. All this is done using the formula:
(h_t) ? = tanh? (W * [r_t * h_ (t-1), x_t]);
First, x_t and h_ (t-1) are multiplied by their weights. The Hadamard product between the renewal gate (r_t) and the past state (h_ (t-1)) is then calculated. This product makes it possible to decide which information should be discarded from the previous period. The update gate (r_t) value is close to 0, allowing information from past periods to be discarded and only more recent information to be used. After these operations, the values are summed, and the nonlinear function of the hyperbolic tangent is applied to the result of their composition. The last step calculates the h_t vector, which stores information in the current network unit and passes this information to another unit. To achieve this, the above-mentioned update gate is used. This part of the neural network decides which information needs to be gathered from current information from the past. All this is done using the formula:
h_t = (1-z_t) * h_ (t-1) + z_t * (h_t) ?;
First, using this formula, the Hadamard product between (1-z_t) and h_ (t-1) is performed, then using the same Hadamard product, it is performed between both z_t and (〖h〗 _t) ?. Finally, the sum of these two products is calculated, and the information at time t is obtained, which is presented to the next unit of the neural network.
This is the theory, but how can this be put into practice?
Let's take the simple example of Tensorflow/Keras and show how we can implement LSTM (RNN and GRU would be the same).
Firstly import our libraries:
import numpy as np import tensorflow as tf from tensorflow import keras
from tensorflow.keras import layers
Ok, now we have our libraries, so it's time to use these libraries for our task :) We can plot our data to see if everything is ok with it.
import pandas as pd import seaborn as sns df = pd.read_csv('airline-passengers.csv') df['Month'] = pd.to_datetime(df.Month) sns.lineplot(data=dataset,x="Month", y="Passengers")
After that, we need to scale our data. I used MinMaxScaler from sklearn. preprocessing and split our data into train and test. After that, let's create the simplest LSTM model and train it.
model = Sequential() model.add(LSTM(4, input_shape=(1, look_back))) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam') model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
After 100 epochs of training our RMSE for train and test datasets
Train Score: 22.82 RMSE Test Score: 48.83 RMSE
Finally, we can visualize our predictions with this model