A Comparison of DNN, CNN and LSTM using TF/Keras

A Comparison of DNN, CNN and LSTM using TF/Keras

A quick look at the different neural network architectures, their advantages and disadvantages.

Introducing CNN and LSTM

Before we get into the details of my comparison, here is an introduction to, or rather, my understanding of the other neural network architectures. We all understand deep neural network, which are simply a set of neurons per layer interconnected sequentially to another set of neurons in the next layer and so on.

Each neuron implements the equation y = f(Wx + b) for inputs x and output y, where f is the non-linear activation functionW is the weight matrix and b is the bias. Here is a picture from https://playground.tensorflow.org/

No alt text provided for this image

CNN

A convolutional neural network, CNN, is adding additional “filtering” layers where the filter weights (or convolution kernels if you prefer fancier words :) can be learned in addition to the weights and biases for each neuron. It is still the back propagation that is doing this job for us, but we shall not make it too easy for the trusty workhorse that is backprop!

Here is a picture I made in PowerPoint to explain the CNN. There are better pictures on the web with cool graphics, but I don’t want to copy the hardwork from someone else. When I am creating my content, I have to create my own illustrations too! Which is why content creation is a hard job. Despite that, the internet today is built by people who have created awesome content because they had fun doing so!

No alt text provided for this image

As you can see in the above picture, a CNN has several parallel filters which can be tuned to extract different features of interest. But of course, we won’t design the filters to do so like we do in Signal Processing, but we will let back propagation compute the filter weights.

Those readers who are familiar with Signal Processing can make the connection to filter banks to separate high and low frequencies. This idea plays an important role in compressing images, where filter banks can be used to separate low and high frequencies, and only low frequencies need to be kept. Let us not digress, however.

The input vector is filtered by each of these “convolutional” layers. They “convolve” the input vector with a kernel (the filter impulse response). Convolution is one of the fundamental operations in linear systems, as fundamental as multiplication is to numbers. In fact, convolution operation is exactly same as polynomial multiplication. If you do multiply two polynomials and evaluate the result with x=10, you will get your regular long multiplication for numbers. I digress again.

Each convolutional layer then generates its own output vector, so the dimension increases by K if we have K convolutional layers. To reduce the dimensionality, we use a “pooling” layer — either compute MAX/MIN or average of a certain number of samples. Concatenate the output of all the pooling layers and pass it through a dense layer to generate output.

RNN and LSTM

An LSTM (Long Short Term Memory) is a type of Recurrent Neural Network (RNN), where the same network is trained through sequence of inputs across “time”. I say “time” in quotes, because this is just a way of splitting the input vector in to time sequences, and then looping through the sequences to train the network.

Since it is the same network, or rather the same set of neurons, that are trained in every time instance, we need to have a way of passing “state information” across time. The state the neurons evolve to in one time instance is used an additional input to the neurons in the next time instance. Hopefully, the picture below illustrates this.

If we replace the single dense layer in RNN with an “LSTM layer”, we get an LSTM network. There are excellent explanatory articles on the web explaining RNN and LSTM — here is one from Colah’s blog: “Understanding LSTM”.

The RNN or LSTM captures the dependency across time sequences in the input vector. The same effect can be accomplished with DNN but that would require collecting the input vector across time and then feeding it to a large layer, resulting in a larger set of parameters to train compared to RNN.

No alt text provided for this image

Comparing Time Series Prediction

With that introduction to CNN and RNN, let us get into the main topic of this article — comparing DNN, CNN and RNN/LSTM. We will pick time series prediction as the problem we want to solve, but with a twist! Once the networks are trained, we will evaluate not only their prediction based on input samples, but also append the predicted samples as input to see how well the network generates the time series. New predictions based on old predictions — now that is a good challenge! Other than being a fun experiment to do, this also has practical applications.

For example, the channel estimation in WLAN happens during the preamble but needs to be used for demodulation until the whole packet ends. If it is a very long packet, the channel would be slowly changing over time and towards the end of the packet, we would be left with a poor estimate of the channel if we don’t track the channel variations. Since we don’t get additional training symbols to estimate the channel during the payload, we need to “predict” the channel variations to update the channel. The channel estimate is updated based on the prediction and is then used again for the next prediction. If one of the predictions is erroneous, this will result in that error getting propagated to future predictions.

As is the norm with ML practitioners, I am using the Jupyter notebook to write this article and the associated code. Let us go ahead and import the usuals.

No alt text provided for this image


Let us use the sum of sinusoids as the input time series. I quite like this data. Even with superposition of just three sinusoids, the time series looks random enough!

No alt text provided for this image
No alt text provided for this image

DNN Prediction Performance

We will start with the DNN. We are going to feed the DNN with 64 samples of the time series, and the DNN needs to predict the 65th sample. Taking the time series data that is 4000 samples long, we split it in to overlapping sequences of 64 samples to generate ~ 4000 batches (in other words, 4000 input vectors each 64 samples long). Feel free to copy the code into your Python or Colab environment to run and get a feel of what we are doing.

No alt text provided for this image

The DNN model is a 3 layer sequential network, with first layer having 32 neurons, second layer 8 and 3rd layer 1.

No alt text provided for this image

Let us go ahead and train the model now, easy peasy :)

No alt text provided for this image
No alt text provided for this image

We now generate test data exactly the same way we generated the training data and use it to evaluate the network. As you see, the predictions match the expected output to a reasonable degree of accuracy.

No alt text provided for this image

But predictions based on input samples are easy. We have 64 input samples to predict the 65th. Let us evaluate how well the DNN performs if it must predict 65th sample based on 64 of its past predictions! All we do here is run the model to get each prediction, append the prediction to the input and repeat this in a loop. The output below shows pretty good performance. We can visually see that the generated output based on its own past predictions matches the input pattern. This means we can turn-off the input to the network at any point and let the network run on its own outputs to generate subsequent outputs, like a signal generator. So far so good!

No alt text provided for this image

But predictions based on input samples are easy. We have 64 input samples to predict the 65th. Let us evaluate how well the DNN performs if it must predict 65th sample based on 64 of its past predictions! All we do here is run the model to get each prediction, append the prediction to the input and repeat this in a loop.

The output below shows pretty good performance. We can visually see that the generated output based on its own past predictions matches the input pattern. This means we can turn-off the input to the network at any point and let the network run on its own outputs to generate subsequent outputs, like a signal generator. So far so good!

No alt text provided for this image
No alt text provided for this image

Prediction using CNN

Let us build a CNN now, but only use 16 inputs at a time to predict the next sample. The “convolution” should already be capable of extracting the time correlation between samples, and we are using 3 different filters, each having a kernel size of 4 taps. The code below if fairly well commented, so let us just quickly get past training and validation to the interesting part.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Again, prediction based on input samples is pretty good. But did you, dear reader, notice how CNN only needs 192 parameters compared to the 2353 we had for DNN? That is an order of magnitude smaller! But training is a bit slower though, 16s compared to 11s for DNN.

Let us find out how CNNs “signal generation” capability is. Running the code below, we see that the CNN outputs are slowly “decaying” when they are generated based on past predictions. I know, saying “decaying” in quotes is not a very scientific analysis, but this is just a fun experiment!

While CNN does a pretty good job of prediction with just 192 params, it is not as good with perpetual signal generation compared to DNN. May be increasing the CNN size will make it better? Easy to get the answer — just try it out! Onwards to LSTM then.

No alt text provided for this image
No alt text provided for this image

LSTM Prediction

Getting data ready for LSTM depends on how far we want to “lookback”. In other words, the number of sequences of input LSTM will train before generating an output. For our example, we will use a lookback of 4 sequences, and each sequence is 8 samples long. Note that the Keras LSTM layer requires the input tensor to be of the shape (batch_size, lookback=4, input_size=8), so we just take samples 0 to 31 for 1st batch, samples 1 to 32 for second batch etc., concatenated into one vector which we then reshape to the appropriate dimensions.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

We see that prediction performance based on input samples is spot on, but training is a lot slower, even when the number of parameters is half of what we had for DNN.

And if we run the LSTM in signal generation mode, it seems to be doing fairly well, but is still missing out the low frequency modulation that the DNN has managed to capture. This is again just an artifact of DNN looking at 64 samples while the LSTM looking at only 32. Go ahead and try increasing the lookback to 8, making LSTM train on 64 samples per output, and you will see that it does as well as the DNN.

By the way, if you did try the above experiment of changing the lookback, you would notice another cool fact about LSTM. The number of parameters that we must train stays the same when you change the lookback. This means you can look at very long sequences of inputs without increasing the size of your network — therein lies its power!

LSTMs have been used very successfully on a wide variety of problems in speech recognition and NLP where we have to look at long term history. Imagine doing that with DNN and you would have a monster network at hand.

No alt text provided for this image
No alt text provided for this image

Concluding Remarks

Wrapping up, we see that for the simple time series prediction problem we chose to experiment on, all the three networks perform similarly. If we evaluate how well they generate new predictions based on their previous predictions, we again see that as long as each of the network is being trained on same number of input samples, the performance is again similar.

No alt text provided for this image

CNN can be used to reduce the number of parameters we need to train without sacrificing performance — the power of combining signal processing and deep learning! But training is a wee bit slower than it is for DNN.

LSTM required more parameters than CNN, but only about half of DNN. While being the slowest to train, their advantage comes from being able to look at long sequences of inputs without increasing the network size.

And that, dear reader, brings us to the end of this article. I thank you for your time and hope you got a bit of insight in return.


Souvik Roy

Senior Data Analyst at Merkle

4 年

Nice explanation expecting some more important details??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了