Why LSTM?

Why LSTM?

because simple RNN suffers two main problems

1)Vanishing Gradient problem

2)Exploding Gradient Problem

what is the vanishing gradient problem or the problem of losing long-term memory? (the problem of failing to memorize long sentences)

Example: 'Maharashtra' is a beautiful state. It has a total of 35 districts, It's called the Land of Saints. Mumbai is its Capital city. Pune, another major city, is renowned for its educational institutions and thriving IT industry. It has a diverse culture. The language spoken is 'Marathi'.

Suppose we are working on the Next Word Predictor project so when we want to predict 'Marathi' it's obvious the most contributing word to predict it is 'Maharashtra'. RNN gives importance to the previous order word but the less the order, the more the contribution (importance) of that word to predict the next word (This is because of the gradient problem that we will see further.) In our example to predict the next word 'Marathi', 'Maharashtra' should be close to 'Marathi' then it will contribute more to predict 'Marathi' but in our case, it is longer from 'Marathi' so is more chance our next word predictor will give the wrong prediction.

this above issue happens because of either a vanishing gradient problem or an exploding gradient problem. when we pass the text word by word (token-to-token) to RNN with increasing timestamps(t1,t2,..tn) it calculates the weights and in the backpropagation, the weight gets updated through gradient using the chain rule. Due to the long text, the calculations in gradient increase, and the multiple-time multiplication of fractions (values between 0-1) causes the gradient value to close to zero and the weight updation stops after a particular time. this is called the Vanishing Gradient Problem. In contrast if the gradient value starts to get greater than 1 the gradient values increase significantly and cause update improper and unstable weights this is called the Exploding Gradient problem.

Let's understand this mathematically,

The gradient calculation involves unfolding the network over time steps and applying the chain rule to compute gradients at each time step. [n(time step= n(tokens)]

long text

Let's denote:

  • ??: Loss function
  • ???: Hidden state at time step ??t
  • ????: Input at time step ??t
  • ????: Output at time step ??t
  • ??: Parameters of the RNN

Gradient Calculation

This process continues until the gradient is computed for all time steps. However, in practice, the gradient computation can suffer from the vanishing or exploding gradient problem, especially for long sequences.

To solve this problem Deep Learning scientist developed Advanced RNN neural Network call LSTM.

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) architecture that are particularly effective in capturing long-range dependencies and mitigating the vanishing gradient problem, which is commonly encountered in standard RNNs.

Here are some reasons why LSTM networks are often preferred:

  1. Memory Cell: LSTMs have a special memory cell that can maintain information over long periods of time. This memory cell is designed to preserve information over multiple time steps, allowing LSTMs to remember important past information and selectively update it when necessary.
  2. Gating Mechanisms: LSTMs use gating mechanisms to control the flow of information into and out of the memory cell. These gating mechanisms consist of three gates: the input gate, the forget gate, and the output gate. These gates regulate the flow of information by determining how much information should be stored, forgotten, or passed to the next time step.
  3. Gradient Flow: LSTMs are designed to facilitate the flow of gradients during training. The gating mechanisms help alleviate the vanishing gradient problem by allowing gradients to propagate more easily through time. This makes it easier for LSTMs to learn long-range dependencies and capture patterns in sequential data.
  4. Versatility: LSTMs are versatile and can be applied to various tasks such as sequence modeling, language modeling, machine translation, speech recognition, and more. Their effectiveness in capturing long-range dependencies makes them suitable for tasks where understanding context over long sequences is crucial.
  5. Empirical Success: LSTMs have been empirically shown to outperform traditional RNNs on a wide range of tasks. Their ability to capture long-term dependencies and handle sequential data effectively has made them a popular choice in the machine learning and natural language processing communities.

We will see this in the upcoming blog. what is LSTM? and How does it work? then you will get a better idea.

Shameless self-promotion ??:

If you find it helpful please comment and discuss more concepts .




.













Ashish Sharma

NLP || DEEP LEARNING || GEN AI ||LLMS

10 个月

bro you forgot to add vikram story

要查看或添加评论,请登录

CHETAN SALUNKE的更多文章

  • Introduction to Azure DevOps

    Introduction to Azure DevOps

    Azure DevOps is a powerful suite of tools from Microsoft that facilitates seamless collaboration and continuous…

  • Delve deeper into R-squared.

    Delve deeper into R-squared.

    A good model can have a low R2 value. On the other hand, a biased model can have a high R2 value! R-squared is a…

  • How RNN Works?

    How RNN Works?

    RNN Stands for Recurrent Neural Network. Recurrent has its very proper meaning, Returning or happening time after time.

  • Why RNN?

    Why RNN?

    RNN stands for RECURRENT NEURAL NETWORK. RNN is a type of neural network that can remember things.

    1 条评论
  • Why we prefer Convolution Neural Networks (CNN) for Image data?

    Why we prefer Convolution Neural Networks (CNN) for Image data?

    The answer of this Question hidden in the Architecture of the Convolution Neural Network which is quite uncommon than…

  • ???? Discovering Adjusted R-squared: Your Guide to Better Regression Models! ????

    ???? Discovering Adjusted R-squared: Your Guide to Better Regression Models! ????

    Why the Adjusted R-Square get increase only by adding a significant variable to the model? What is Mathematics and…

    1 条评论