Recurrent Neural Networks (#RNN) and #LSTM- Deep Learning
colah.github.io

Recurrent Neural Networks (#RNN) and #LSTM- Deep Learning

What do you do if the patterns in your data change with time? In that case, your best bet is to use a recurrent neural network. This deep learning model has a simple structure with a built-in feedback loop, allowing it to act as a forecasting engine. For example, could a net be used to scan traffic footage and immediately flag a collision? Through the use of a recurrent net, these real-time interactions are now possible.

The Recurrent Neural Net (RNN) is the brainchild of Juergen Schmidhuber and Sepp Hochreiter. Their applications are extremely versatile – ranging from speech recognition to driverless cars.

The other deep nets – MLP, DBN, and CNN – are known as feedforward networks since a signal moves in only one direction across the layers. In a feedforward neural network, signals flow in only one direction from input to output, one layer at a time.

In contrast, RNNs have a feedback loop where the net’s output is fed back into the net along with the next input. Since RNNs have just one layer of neurons, they are structurally one of the simplest types of nets. In a recurrent net, the output of a layer is added to the next input and fed back into the same layer, which is typically the only layer in the entire network.

In a feedforward neural network, signals flow in only one direction from input to output, one layer at a time. In a recurrent net, the output of a layer is added to the next input and fed back into the same layer, which is typically the only layer in the entire network. You can think of this process as a passage through time – shown here are 4 such time steps. At t = 1, the net takes the output of time t = 0 and sends it back into the net along with the next input. The net repeats this for t = 2, t = 3, and so on.

Unlike feedforward nets, a recurrent net can receive a sequence of values as input, and it can also produce a sequence of values as output. The ability to operate with sequences opens up these nets to a wide variety of applications.

Here are some sample applications for different input-output scenarios:

1) Image Captioning - Single input, sequence of outputs:

When the input is singular and the output is a sequence, a potential application is image captioning.


2) Document Classification - Sequence of inputs, single output:

A sequence of inputs with a single output can be used for document classification.


3) Video Classification -

Sequence of inputs, sequence of outputs: video processing by frame. If a time delay is introduced, the net can statistically forecast the demand in supply chain planning.


Like we’ve seen with other deep learning models, by stacking RNNs on top of each other, you can form a net capable of more complex output than a single RNN working alone. Typically, an RNN is an extremely difficult net to train. Since these nets use backpropagation, we once again run into the problem of the vanishing gradient. Unfortunately, the vanishing gradient is exponentially worse for an RNN. The reason for this is that each time step is the equivalent of an entire layer in a feedforward network. So training an RNN for 100 time steps is like training a 100-layer feedforward net – this leads to exponentially small gradients and a decay of information through time.

There are several ways to address this problem - the most popular of which is gating. Gating is a technique that helps the net decide when to forget the current input, and when to remember it for future time steps. The most popular gating types today are #GRU and #LSTM. Besides gating, there are also a few other techniques like gradient clipping, steeper gates, and better optimizers.

When it comes to training a recurrent net, GPUs are an obvious choice over an ordinary CPU. This was validated by a research team at Indico, which uses these nets on text processing tasks like sentiment analysis and helpfulness extraction. The team found that GPUs were able to train the nets 250 times faster! That’s the difference between one day of training, and over eight months!

So under what circumstances would you use a recurrent net over a feedforward net?

We know that a feedforward net outputs one value, which in many cases was a class or a prediction. A recurrent net is suited for time series data, where an output can be the next value in a sequence, or the next several values. So the answer depends on whether the application calls for classification, regression, or forecasting.

LONG SHORT TERM MEMORY – ( #LSTM )

Long short-term memory (LSTM) network is the most popular solution to the vanishing gradient problem.

LSTMs were created to deal with the vanishing gradient problem. So, let’s have a brief reminder on this issue. As we propagate the error through the network, it has to go through the unraveled temporal loop – the hidden layers connected to themselves in time by the means of weights wrec. Because this weight is applied many-many times on top of itself, that causes the gradient to decline rapidly. As a result, weights of the layers on the very far left are updated much slower than the weights of the layers on the far right. This creates a domino effect because the weights of the far-left layers define the inputs to the far-right layers. Therefore, the whole training of the network suffers, and that is called the problem of the vanishing gradient.


WALKTHROUGH THE ARCHITECTURE

Now we are ready to look into the LSTM architecture step by step:

  1. We’ve got new value xt and value from the previous node ht-1 coming in.
  2. These values are combined together and go through the sigmoid activation function, where it is decided if the forget valve should be open, closed or open to some extent.
  3. The same values, or actually vectors of values, go in parallel through another layer operation “tanh”, where it is decided what value we’re going to pass to the memory pipeline, and also sigmoid layer operation, where it is decided, if that value is going to be passed to the memory pipeline and to what extent.
  4. Then, we have a memory flowing through the top pipeline. If we have forget valve open and memory valve closed then the memory will not change. Otherwise, if we have forget valve closed and memory valve open, the memory will be updated completely.
  5. Finally, we’ve got xt and ht-1 combined to decide what part of the memory pipeline is going to become the output of this module.

That’s basically, what’s happening within the LSTM network. As you can see, it has a pretty straightforward architecture.

Hope you learn something from this article.

Happy Learning. Happy Sharing.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了