Brief intro to RNN

RNN stands for recurrent neural network. First, let me explain a neural network.

Define NN

A neural network is a computer system modelled on the human brain and nervous system. The key component of a neural network is a neuron which is nothing but a decision making body. In maths, there is a function called sigmoid which can replicate a neuron. It is also known as “S” shaped curve. It basically normalises the input i.e. given any number in the range [-∞,∞] it generates a real number in the range [0,1]. 


In the real world, we make decisions all the time. We take a number of inputs and then we decide whether yes or no in other words 0 or 1.

The basic neural network contains an input layer for gathering any sort of input like audio, video, image, text, etc, a hidden layer that captures certain patterns from raw data and finally the output layer that makes a decision. Each layer can have any number of neurons. Such a system is called a feedforward mechanism because the data keeps moving from one layer to other. Its like a mesh or grid where all neurons in previous layers are connected to all neurons in next layer but neurons in the same layer are not connected to each other in the same layer. The connections in these networks are nothing but weight. These weights are calculated in a manner such that output of the neural network matches the required output. These calculations are done by reverse engineering in an iterative fashion until there is little or no error between output of the neural network and desired output

One of the lacunae of this network is lack of communication between neurons in the same layer. This means every neuron in that layer is going through the learning mechanism. Rather if the learning in one neuron is passed on to the neighbouring neuron, it will reduce redundant learning as well as even simplify the neural network. The other problem is fixed input size because of the structure of the input layer. There can be inputs where the size is not predetermined. For example, translation between two languages require variable input size and hence feedforward neural network will not work for sequence to sequence learning where both output and input size vary.

Define RNN

A recurrent neural network (RNN) is a class of neural network where connections between units form a directed cycle. The input sequence x is fed to the RNN s (s meaning state) and the output is o. 


Let’s look at a single unit of RNN. The input x(t) and the state of the previous cell s(t-1) is fed to the current cell. The next state s(t) is described as follows:- 


st = g(s(t-1)*W+x(t)*U) #where g is the sigmoid function

In general RNNs are unable to predict long term dependencies well because of two reasons 

a) Either the gradients tend to vanish or. 

b) explode. 

If we take the example “I grew up in France… I speak fluent French” the desired output after fluent is French. But an RNN wilI predict “I grew up in France… I speak fluent English” because 

a) English is more commonly occurring after fluent and 

b) It does not take long term dependencies into account.

There are basically two solutions 

a) Control the gradient using a clipping function s.t it does not explode

b) Use 2nd order derivatives that take care of vanishing gradients but there is no guarantee

c) Build sophisticated activation functions. 

Considering the last option, the first attempt in building sophisticated activation functions resulted in invention of LSTM (long short term memory) unit by Hochreiter and Schmidhuber in 1997. In 2014, GRU (gated recurrent unit) was developed by Cho. 

Both LSTM and GRU are characterised by gates that have certain functionalities. It has been found RNNs using either of these recurrent units have been found to perform better for various tasks such as machine translation and speech recognition. 

LSTM vs GRU            

 Applications

LSTM and GRU has found themselves in myriads of application such as chatbots, music generator, machine translation, speech recognition.

Implementation

The tools required for designing neural networks using LSTM or GRU are tensor flow, keras, torch, theano, etc. 

—> Tensorflow, keras and theano can be imported in python and installed using pip

—> Torch requires lua and the installation code is as follows in mac/ubuntu:-

git clone https://github.com/torch/distro.git ~/torch --recursive

cd ~/torch; bash install-deps;

./install.sh

source ~/.bash_profile

luarocks install nn

luarocks install nngraph

luarocks install nninit

luarocks install optim

luarocks install luautf8

Installation of cuda

Apart from this cuda is required to run gpus and installer code is as follows:-

if ! dpkg-query -W cuda; then

curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb

sudo dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb

sudo apt-get update

sudo apt-get install cuda

#download cudnn-8.0-linux-x64-v5.1.solitairetheme8 .

tar -xvf cudnn-8.0-linux-x64-v5.1.solitairetheme8

export LD_LIBRARY_PATH=/home/$USERNAME/cuda/lib64/

export LD_LIBRARY_PATH=“$LD_LIBRARY_PATH:/usr/local/cuda/lib64" 

GPUs can be either bought or rented on google cloud platform or amazon aws. Tesla K-80 is the preferred gpu.

Codebase

I use keras and tensor flow. For the sake of convenience I will show keras here. Keras is an api built on top of tensor flow. 

Designing a LSTM in python

def create_lstm_model(INPUT_DIM, OUTPUT_DIM):

  in_out_neurons = INPUT_DIM  

  out_n = OUTPUT_DIM  

  hidden_neurons_stage_1 = 500

  

  print("Creating model...")

  model = Sequential()

  print("Adding LSTM ...layer 1")

  model.add(LSTM(hidden_neurons_stage_1, input_dim=in_out_neurons))

  #model.add(BatchNormalization())

  model.add(Activation('tanh'))

  print("Adding dropout ...")

  model.add(Dropout(0.4))

   

  print("adding output layer...")

  model.add(Dense(out_n, input_dim=hidden_neurons_stage_1))

  #model.add(BatchNormalization())

  print("adding activation...")

  #keras.layers.advanced_activations.LeakyReLU(alpha=0.3)

  model.add(Activation("relu"))



  print("compiling...")

  adagrad1 = optimizers.Adagrad(lr=0.0001, clipnorm=0.5)

  model.compile(loss="mean_squared_error", optimizer=adagrad1,metrics=['accuracy'])

  print(model.summary())

  print("compiled!")

  return model


Output of this code

____________________________________________________________________________________________________

Layer (type)               Output Shape     Param #  Connected to           

====================================================================================================

lstm 1 (LSTM)              (None, 500)     1202000  lstm input 1[0][0]        

____________________________________________________________________________________________________

activation 1 (Activation)  (None, 500)     0        lstm 1[0][0]           

____________________________________________________________________________________________________

dropout 1 (Dropout)        (None, 500)     0        activation 1[0][0]        

____________________________________________________________________________________________________

dense 1 (Dense)            (None, 100)     50100    dropout 1[0][0]         

____________________________________________________________________________________________________

activation 2 (Activation)  (None, 100)     0        dense 1[0][0]          

====================================================================================================

Total params: 1,252,100

Trainable params: 1,252,100

Non-trainable params: 0



None

compiled!


The design begins with the input of dimension in_out_neurons. This is followed by the weight matrix to the hidden layer.The weight matrix prior to the hidden layer has input dimension same as dimensions of input sequence and size of hidden layer. 

Then comes the hidden layer where there are 500 recurrent units of LSTM. The hidden layer is followed by an activation layer which is like a sigmoid function to normalise the output values. In our case tanh, which normalises values to the range [-1,1]. The activation layer is followed by a dropout layer which reduces over-fitting by removing neurons that are not improving accuracy on test cases. The activation and dropout have the size of the hidden layer. 

Hidden layer = Activation layer = Dropout layer

The output layer has an activation by Relu which normalises the output in the range (0,a). The total parameters are 1,252,100. The optimiser used is Adagrad which has a 

a) low learning rate of 0.0001 

b) clip_norm=0.5 in case gradient starts deviating to huge values. 

Once I train i get the weights for the weight matrix. Using the weights i redesign the model on runtime and get predictions on the test set.

This same model can be replaced with a GRU, Bi-directional LSTM. 

References:-

  1. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,Yoshua Bengio
  2. Understanding LSTM networks by Colah
  3. Tutorial to Keras
  4. Activation functions in neural network

要查看或添加评论,请登录

Bibek Behera的更多文章

社区洞察

其他会员也浏览了