Brief intro to RNN
Bibek Behera
Empowering Conversational AI Innovation at FloBot: Customized Chatbot Solutions for Every Industry.
RNN stands for recurrent neural network. First, let me explain a neural network.
Define NN
A neural network is a computer system modelled on the human brain and nervous system. The key component of a neural network is a neuron which is nothing but a decision making body. In maths, there is a function called sigmoid which can replicate a neuron. It is also known as “S” shaped curve. It basically normalises the input i.e. given any number in the range [-∞,∞] it generates a real number in the range [0,1].
In the real world, we make decisions all the time. We take a number of inputs and then we decide whether yes or no in other words 0 or 1.
The basic neural network contains an input layer for gathering any sort of input like audio, video, image, text, etc, a hidden layer that captures certain patterns from raw data and finally the output layer that makes a decision. Each layer can have any number of neurons. Such a system is called a feedforward mechanism because the data keeps moving from one layer to other. Its like a mesh or grid where all neurons in previous layers are connected to all neurons in next layer but neurons in the same layer are not connected to each other in the same layer. The connections in these networks are nothing but weight. These weights are calculated in a manner such that output of the neural network matches the required output. These calculations are done by reverse engineering in an iterative fashion until there is little or no error between output of the neural network and desired output
One of the lacunae of this network is lack of communication between neurons in the same layer. This means every neuron in that layer is going through the learning mechanism. Rather if the learning in one neuron is passed on to the neighbouring neuron, it will reduce redundant learning as well as even simplify the neural network. The other problem is fixed input size because of the structure of the input layer. There can be inputs where the size is not predetermined. For example, translation between two languages require variable input size and hence feedforward neural network will not work for sequence to sequence learning where both output and input size vary.
Define RNN
A recurrent neural network (RNN) is a class of neural network where connections between units form a directed cycle. The input sequence x is fed to the RNN s (s meaning state) and the output is o.
Let’s look at a single unit of RNN. The input x(t) and the state of the previous cell s(t-1) is fed to the current cell. The next state s(t) is described as follows:-
st = g(s(t-1)*W+x(t)*U) #where g is the sigmoid function
In general RNNs are unable to predict long term dependencies well because of two reasons
a) Either the gradients tend to vanish or.
b) explode.
If we take the example “I grew up in France… I speak fluent French” the desired output after fluent is French. But an RNN wilI predict “I grew up in France… I speak fluent English” because
a) English is more commonly occurring after fluent and
b) It does not take long term dependencies into account.
There are basically two solutions
a) Control the gradient using a clipping function s.t it does not explode
b) Use 2nd order derivatives that take care of vanishing gradients but there is no guarantee
c) Build sophisticated activation functions.
Considering the last option, the first attempt in building sophisticated activation functions resulted in invention of LSTM (long short term memory) unit by Hochreiter and Schmidhuber in 1997. In 2014, GRU (gated recurrent unit) was developed by Cho.
Both LSTM and GRU are characterised by gates that have certain functionalities. It has been found RNNs using either of these recurrent units have been found to perform better for various tasks such as machine translation and speech recognition.
LSTM vs GRU
Applications
LSTM and GRU has found themselves in myriads of application such as chatbots, music generator, machine translation, speech recognition.
Implementation
The tools required for designing neural networks using LSTM or GRU are tensor flow, keras, torch, theano, etc.
—> Tensorflow, keras and theano can be imported in python and installed using pip
—> Torch requires lua and the installation code is as follows in mac/ubuntu:-
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps;
./install.sh
source ~/.bash_profile
luarocks install nn
luarocks install nngraph
luarocks install nninit
luarocks install optim
luarocks install luautf8
Installation of cuda
Apart from this cuda is required to run gpus and installer code is as follows:-
if ! dpkg-query -W cuda; then
curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda
#download cudnn-8.0-linux-x64-v5.1.solitairetheme8 .
tar -xvf cudnn-8.0-linux-x64-v5.1.solitairetheme8
export LD_LIBRARY_PATH=/home/$USERNAME/cuda/lib64/
export LD_LIBRARY_PATH=“$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
GPUs can be either bought or rented on google cloud platform or amazon aws. Tesla K-80 is the preferred gpu.
Codebase
I use keras and tensor flow. For the sake of convenience I will show keras here. Keras is an api built on top of tensor flow.
Designing a LSTM in python
def create_lstm_model(INPUT_DIM, OUTPUT_DIM):
in_out_neurons = INPUT_DIM
out_n = OUTPUT_DIM
hidden_neurons_stage_1 = 500
print("Creating model...")
model = Sequential()
print("Adding LSTM ...layer 1")
model.add(LSTM(hidden_neurons_stage_1, input_dim=in_out_neurons))
#model.add(BatchNormalization())
model.add(Activation('tanh'))
print("Adding dropout ...")
model.add(Dropout(0.4))
print("adding output layer...")
model.add(Dense(out_n, input_dim=hidden_neurons_stage_1))
#model.add(BatchNormalization())
print("adding activation...")
#keras.layers.advanced_activations.LeakyReLU(alpha=0.3)
model.add(Activation("relu"))
print("compiling...")
adagrad1 = optimizers.Adagrad(lr=0.0001, clipnorm=0.5)
model.compile(loss="mean_squared_error", optimizer=adagrad1,metrics=['accuracy'])
print(model.summary())
print("compiled!")
return model
Output of this code
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
lstm 1 (LSTM) (None, 500) 1202000 lstm input 1[0][0]
____________________________________________________________________________________________________
activation 1 (Activation) (None, 500) 0 lstm 1[0][0]
____________________________________________________________________________________________________
dropout 1 (Dropout) (None, 500) 0 activation 1[0][0]
____________________________________________________________________________________________________
dense 1 (Dense) (None, 100) 50100 dropout 1[0][0]
____________________________________________________________________________________________________
activation 2 (Activation) (None, 100) 0 dense 1[0][0]
====================================================================================================
Total params: 1,252,100
Trainable params: 1,252,100
Non-trainable params: 0
None
compiled!
The design begins with the input of dimension in_out_neurons. This is followed by the weight matrix to the hidden layer.The weight matrix prior to the hidden layer has input dimension same as dimensions of input sequence and size of hidden layer.
Then comes the hidden layer where there are 500 recurrent units of LSTM. The hidden layer is followed by an activation layer which is like a sigmoid function to normalise the output values. In our case tanh, which normalises values to the range [-1,1]. The activation layer is followed by a dropout layer which reduces over-fitting by removing neurons that are not improving accuracy on test cases. The activation and dropout have the size of the hidden layer.
Hidden layer = Activation layer = Dropout layer
The output layer has an activation by Relu which normalises the output in the range (0,a). The total parameters are 1,252,100. The optimiser used is Adagrad which has a
a) low learning rate of 0.0001
b) clip_norm=0.5 in case gradient starts deviating to huge values.
Once I train i get the weights for the weight matrix. Using the weights i redesign the model on runtime and get predictions on the test set.
This same model can be replaced with a GRU, Bi-directional LSTM.
References:-
- Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,Yoshua Bengio
- Understanding LSTM networks by Colah
- Tutorial to Keras
- Activation functions in neural network