BxD Primer Series: Long Short-Term Memory (LSTM) Neural Networks
Hey there ??
Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Long Short-Term Memory (LSTM) Neural Networks. Let’s get started:
The What:
Long Short-Term Memory (LSTM) is a specialized RNN architecture (check our edition on RNNs?here ) that can capture long-term dependencies in sequential data. They are well-suited for tasks such as natural language processing, speech recognition, and time-series prediction.
LSTM networks are composed of cells that store information over time, and gates that control the flow of information into and out of cells. Gates consist of sigmoid activation functions that determine how much of input and previous cell state to keep or forget, and a tanh activation function that determines what new information to store in cell state.
Each cell has three gates: input gate, forget gate, and output gate.
- Input gate controls how much new information to add to cell state.
- Forget gate controls how much information to discard from cell state
- Output gate controls how much of cell state goes to output and next time step.
LSTM networks can be stacked to form deeper networks, with each layer receiving input from previous layer and passing its output to next layer. The output of last layer can be fed into a fully connected layer or softmax layer for classification or prediction tasks.
LSTM networks have two main advantages over traditional RNNs.
- Able to handle long-term dependencies better, because they selectively store or forget information in the cell state.
- Less prone to vanishing gradient problem, which can occur in RNNs when gradients become too small and cause the network to stop learning.
Difference from Traditional RNNs:
A traditional RNN (Recurrent Neural Network) processes sequential data by applying?same set of weights?to each input at each time step, and passing the output to next time step. This creates a feedback loop in network, where output at each time step depends on previous inputs and current input.
However, traditional RNNs suffer from vanishing gradient problem, which occurs when the gradients of loss function become very small as they are propagated through time, causing the network to stop learning. This is because the gradients of loss function are multiplied by same set of weights at each time step, leading to the gradients becoming very small or even zero over long sequences of data.
In contrast, an LSTM (Long Short-Term Memory) network?uses a memory cell and gates to control the flow of information through network. Memory cell allows the network to selectively remember or forget information over time, while gates control the amount of information that enters and leaves the cell.
By selectively storing or forgetting information over time, an LSTM network is better able to handle long-term dependencies in sequential data. Additionally, because the gradients in an LSTM network are multiplied by gates rather than the same set of weights, vanishing gradient problem is less severe in LSTM networks compared to traditional RNNs.
Basic Architecture:
LSTM networks use a memory cell to store information over time. The cell is designed to maintain a constant value by selectively adding or removing information over time, which allows network to remember important features of input sequence.
LSTM (Long Short-Term Memory) cell is designed to address issue of vanishing gradients in traditional RNNs. It is made up of several components that work together to selectively store or forget information over time. Components of a LSTM cell are:
- Input gate?takes as input the current input to cell and previous output of cell, and outputs a value between 0 and 1. This value is then used to control how much of input should be added to cell state.
- Forget gate?takes as input the current input to cell and previous output of cell, and outputs a value between 0 and 1. This value is then used to control how much of previous cell state should be retained.
- Cell state?is a vector that stores information that has been accumulated over time. Cell state is updated at each time step based on input and forget gates, as well as current input to cell.
- Output gate?takes as input the current input to cell and previous output of cell, and outputs a value between 0 and 1. This value is then used to control how much of current cell state should be output.
Bidirectional LSTM:
Bidirectional LSTM (BiLSTM) processes input sequence in both forward and backward directions, combining the outputs of two directions at each time step. This allows the network to capture information from both past and future context.
In a traditional LSTM network, the hidden state at each time step depends only on previous time step's hidden state and current input.
In a BiLSTM, there are two separate hidden states at each time step: one that processes the input sequence in forward order, and another that processes it in reverse order.?Forward hidden state?depends on previous forward hidden state and current forward input, while the?reverse hidden state?depends on previous reverse hidden state and current reverse input.
Once the input sequence has been processed in both directions, the forward and reverse hidden states at each time step are concatenated to form a combined hidden state. This combined hidden state is then used to make a prediction or produce an output at each time step.
BiLSTM can capture context from both past and future directions, which can be useful in tasks such as speech recognition, where current sound may depend on both preceding and following sounds. It is also useful in natural language processing tasks such as named entity recognition or sentiment analysis, where the meaning of a word or phrase may depend on both preceding and following words.
The How:
Let's assume that we have a sequence of input values x_1, x_2, ..., x_T. At each time step?t, the input?x_t?is fed into LSTM cell, along with previous hidden state?h_{t-1}?and previous memory cell content?c_{t-1}. LSTM cell then computes a new hidden state?h_t?and a new memory cell content?c_t?based on the input, previous state and cell content.
? Input gate?determines which new information to add to LSTM cell. It is controlled by input gate activation vector?i_t:
- W_i?is the weight matrix for input gate
- [x_t, h_{t-1}]?is concatenation of input?x_t?and previous hidden state?h_{t-1}
- b_i?is the bias vector for input gate
- ???is the sigmoid function, which squashes the output of gate to a value between 0 and 1
? Forget gate?determines which information to discard from previous memory cell content?c_{t-1}. It is controlled by forget gate activation vector?f_t:
- W_f?is the weight matrix for forget gate
- b_f?is the bias vector for forget gate
? Current memory content: LSTM updates its memory cell?c_t?by selectively adding or discarding information. The update vector?u_t?determines which new information to add:
- W_c?is the weight matrix for memory cell update
- b_c?is the bias vector for memory cell update
- tanh?is the hyperbolic tangent function, which maps the output of update vector to a value between -1 and 1
? Updated memory content: Current memory cell content?c_t?is updated based on the input gate and forget gate activations and update vector.
Operator?(⊙)?denotes element-wise multiplication.
Above equation calculates new memory cell content?c_t?as a linear combination of previous memory cell content?c_{t-1}?and the new information vector?i_t ⊙ u_t.
- Input gate activation vector?i_t?controls which elements of?u_t?should be added to memory cell content
- Forget gate activation vector?f_t?controls which elements of?c_{t-1}?should be retained
? Output gate?determines which information to output from the memory cell content?c_t?to the next hidden state?h_t. It is controlled by the output gate activation vector?o_t:
- W_o?is the weight matrix for output gate
- b_o?is the bias vector for output gate.
? New hidden state,?h_t?is computed as a non-linear function of memory cell content?c_t?and output gate activation vector?o_t:
Output of network can be the last hidden state (in one-one and many-one scenarios) or a combination of hidden states of all cells (in one-many and many-many scenarios).
Weight Initialization Techniques:
Several methods can be used to initialize the weights of an LSTM network. These methods can affect the performance of network during training and prevent gradients from becoming too small or too large at beginning of training.
? Random initialization: In this method, weights are randomly initialized using a uniform or normal distribution.
→ Uniform Distribution:
For each weight w_ij in weight matrix W:
- w_ij is randomly sampled from a uniform distribution U(a, b), where a and b are lower and upper bounds, respectively.
- The bounds a and b can be specified based on desired range of weights.
Uniform distribution is defined by the probability density function:
→ Normal Distribution:
For each weight w_ij in the weight matrix W:
- w_ij is randomly sampled from a normal distribution N(μ, σ2), where μ is the mean and σ2 is the variance.
- Mean μ and variance σ2 can be set according to specific requirements.
Normal distribution is defined by the probability density function:
? Xavier initialization, also known as Glorot initialization, involves sampling the initial weights from a Gaussian distribution with zero mean and a variance that is specifically calculated to maintain the variance of input and output signals approximately equal.
Let's denote the weights of a layer as W:
- W is a matrix with dimensions (n_out, n_in)
- n_out is the number of output connections
- n_in is the number of input connections.
Variance for Xavier initialization:
variance = 2 / (n_in + n_out)
Once we have the variance, we can sample initial weights from a Gaussian distribution with zero mean and this calculated variance:
W ~ N(0, variance)
? He initialization?works well with ReLU activation functions.
Let's consider a neural network layer with n_input neurons. In He initialization, the initial weights are sampled from a Gaussian distribution with zero mean and a variance of:
variance = 2 / n_input
? Orthogonal initialization: Orthogonal initialization involves initializing LSTM weights with an orthogonal matrix, which is a matrix where columns are orthogonal to each other.
If we denote the weight matrix as W, then orthogonal initialization ensures that the columns of W satisfy following condition:
W^T * W = I
Where W^T is transpose of W and I is the identity matrix.
By using orthogonal initialization, we promote independence of LSTM units and mitigate potential issues of vanishing or exploding gradients during training process. This initialization strategy helps LSTM cells maintain a stable memory representation.
Parameters in LSTMs:
- Weights (Wi, Wf, Wc): Weighs of LSTM are initialized using any of the techniques described above and then learned over time to fit on the data available for network training.
- Bias terms (bi, bc, bo): Bias vectors control the influence of input, hidden state, and output on memory cell and output gate.
- Number of LSTM units: Increasing the number of units can increase model's capacity to learn complex patterns in data, but it also increases computation time and risk of overfitting.
- Number of layers: Increasing the number of layers can increase model's ability to learn hierarchical features, but it also increases computation time and risk of overfitting.
- Learning rate: A high learning rate can lead to faster convergence but can also cause the model to overshoot optimal solution, while a low learning rate can result in slow convergence or getting stuck in local minima.
- Time steps in the input sequence: A larger number of time steps can increase model's ability to capture long-term dependencies, but it also increases computation time and risk of overfitting.
The Why:
Reasons for using LSTMs:
- LSTMs are specifically designed to handle long-term dependencies in sequential data, such as speech or time-series data.
- LSTMs can handle variable-length sequences of data, such as text or audio, without requiring fixed-length input or padding.
- Memory cell in LSTMs allows them to selectively store or discard information over time, which helps in tasks that involve filtering noise or irrelevant information from input sequence.
- LSTMs can be used in a bidirectional architecture. This allows them to capture both past and future dependencies in the input sequence for tasks such as speech recognition or language translation.
- LSTMs can be fine-tuned on a small amount of task-specific data, after being pre-trained on a large amount of related data, making them suitable for transfer learning scenarios.
The Why Not:
Reasons for not using LSTMs:
- For some applications, simpler architectures such as feedforward neural networks or convolutional neural networks may be sufficient, and LSTMs may not offer any significant advantages over these simpler models.
- LSTMs can be slow to train, particularly if dataset is large or the architecture is complex. This can be a disadvantage in applications where real-time processing is required.
- LSTMs are inherently sequential models, which makes them difficult to parallelize and can lead to slower training times compared to other types of neural networks.
Time for you to support:
In next edition, we will cover Gated Recurrent Unit (GRU) Neural Networks.
Let us know your feedback!
Until then,
Have a great time! ??