Why LSTM?
CHETAN SALUNKE
Data Scientist| Globally Certified Tensorflow Developer |Silver Medal in Master Of Statistics |ML| DL| NLP|LLM| Gen AI| Promt Engineering IBM Certified Data Professional| Python| SQL| Power BI| Statistics.
because simple RNN suffers two main problems
1)Vanishing Gradient problem
2)Exploding Gradient Problem
what is the vanishing gradient problem or the problem of losing long-term memory? (the problem of failing to memorize long sentences)
Example: 'Maharashtra' is a beautiful state. It has a total of 35 districts, It's called the Land of Saints. Mumbai is its Capital city. Pune, another major city, is renowned for its educational institutions and thriving IT industry. It has a diverse culture. The language spoken is 'Marathi'.
Suppose we are working on the Next Word Predictor project so when we want to predict 'Marathi' it's obvious the most contributing word to predict it is 'Maharashtra'. RNN gives importance to the previous order word but the less the order, the more the contribution (importance) of that word to predict the next word (This is because of the gradient problem that we will see further.) In our example to predict the next word 'Marathi', 'Maharashtra' should be close to 'Marathi' then it will contribute more to predict 'Marathi' but in our case, it is longer from 'Marathi' so is more chance our next word predictor will give the wrong prediction.
this above issue happens because of either a vanishing gradient problem or an exploding gradient problem. when we pass the text word by word (token-to-token) to RNN with increasing timestamps(t1,t2,..tn) it calculates the weights and in the backpropagation, the weight gets updated through gradient using the chain rule. Due to the long text, the calculations in gradient increase, and the multiple-time multiplication of fractions (values between 0-1) causes the gradient value to close to zero and the weight updation stops after a particular time. this is called the Vanishing Gradient Problem. In contrast if the gradient value starts to get greater than 1 the gradient values increase significantly and cause update improper and unstable weights this is called the Exploding Gradient problem.
Let's understand this mathematically,
The gradient calculation involves unfolding the network over time steps and applying the chain rule to compute gradients at each time step. [n(time step= n(tokens)]
Let's denote:
This process continues until the gradient is computed for all time steps. However, in practice, the gradient computation can suffer from the vanishing or exploding gradient problem, especially for long sequences.
To solve this problem Deep Learning scientist developed Advanced RNN neural Network call LSTM.
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) architecture that are particularly effective in capturing long-range dependencies and mitigating the vanishing gradient problem, which is commonly encountered in standard RNNs.
Here are some reasons why LSTM networks are often preferred:
We will see this in the upcoming blog. what is LSTM? and How does it work? then you will get a better idea.
Shameless self-promotion ??:
If you find it helpful please comment and discuss more concepts .
.
NLP || DEEP LEARNING || GEN AI ||LLMS
10 个月bro you forgot to add vikram story