CHATBOT

CHATBOT

Chatbot- A generative based approach

As my project during my internship as a data science Intern, I had developed a naive chatbot using sequence to sequence model by LSTM of RNN. Sharing the tutorial which I made explicitly for the deep learning enthusiasts to provide them a basic insight on how chatbot can be developed with the help of recurrent neural network.

Please prefer the above slide as pictures and illustration in the article below is yet to be added.

WHAT IS CHATBOT?

A Chatbot is a program that communicates with us.

 A Chatbot is a service, powered by rules and sometimes artificial intelligence that we interact with via a chat interface.


Some chatterbots use sophisticated natural language processing systems, but many simpler systems scan for keywords within the input, then pull a reply with the most matching keywords, or the most similar wording pattern, from a database.

Today, chatbots are part of virtual assistants such as Google Assistant, and are accessed via many organizations' apps, websites, and on instant messaging platforms such as Facebook Messenger, Siri, and Cortana etc.




WHY WE NEED CHATBOT?

Trends shows that, users are investing more time on messaging apps. Chatbots can handle numerous conversations at once without requiring a person on the other end answering messages by hand.


 Apps consume most of the memory of the device. Hence the user’s do not want to use separate apps for separate purposes. Trend shows that over 90% of all the apps are uninstalled after its first use. Developing a chatbot takes significantly less time and it is also easy to maintain and less expensive as compared to apps.






TAXONOMY OF MODELS


Retrieval-based models (easier) use a repository of predefined responses and some kind of heuristic to pick an appropriate response based on the input and context.

I. Respond rule based expression, don’t generate any new text.

II. Ensemble of machine learning.

III. Just pick up a response from a fixed set.

IV. Don’t make any grammatical mistakes.

V. In open domain, it is impossible to make repository of handcrafted responses

Generative models (harder) don’t rely on pre-defined responses. They generate new responses from scratch. Generative models are typically based on Machine Translation techniques, but instead of translating from one language to another, we “translate” from an input to an output (response).

I. Huge amount of data is needed to train the model.

II. On long text, these models makes grammatical mistakes.

III. In closed domain, Generative models are tough to train than the Retrieval-Based model.

Encoder and Decoder




The encoder data will be the text from one side of conversation. The decoder data will be the responses. Tokenize the sentence by chopping it into words and giving every word a Token ID, so that data retrieval will be faster, now train the model.

APPLICATION OF RECURRENT NEURAL NETWORKS- Promising of Natural Language Processing Tasks

1. It allows us to score arbitrary sentences based on how likely they are to occur in the real world. This gives us a measure of grammatical and semantic correctness.(For machine translation)

2. Allows us to generate new text. (For Language Modelling i.e. Chatbot)


 IDEA BEHIND RNN

I. To make use of sequential information.

II. In traditional NN, we assume that all inputs (and outputs) are independent of each other.

III. For NLP tasks, it is a bad idea because If you want to predict the next word in a sentence you better know which words came before it.

IV. RNNs have a “memory” which captures information about what has been calculated so far.

 






WORKING PRINCIPLE OF RNN




1.   x_t is the input at time step t.

                       For example, x_1 could be a one-hot vector corresponding to the second word      of a sentence.

2.   s_t is the hidden state at time step t.

                  s_t=f(Ux_t + Ws_{t-1}).

                  Function f is tanh or ReLU(non linear function) o_t = softmax(Vs_t). (o_t is the output at step t)


IMPORTANT POINTS ON RNN

I. Unlike a traditional deep neural network, a RNN shares the same parameters (U, V, W above) across all steps. This greatly reduces the total number of parameters we need to learn.

II. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps.

III. certain types of RNNs (like LSTMs, GRU (a simplified version of LSTM)) were specifically designed to overcome the problem of vanishing gradient(difficulties learning long-term dependencies).


LONG SHORT TERM MEMORY NETWORKS(LSTM) 

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!







THE CORE IDEA BEHIND LSTMS

I. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

II. The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

III. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through.


 


STEP-BY-STEP LSTM WALK THROUGH

     I.           To decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.”

It looks at h(T?1) and X(t), and outputs a number between 0 and 1 for each number in the cell state C(t-1).



 

  II.           ?What new information we’re going to store in the cell state.


a)   A sigmoid layer called the “input gate layer” decides which values we’ll update.


b)  A Tanh layer creates a vector of new candidate values, ~C(t), that could be added to the state. In the next step, we’ll combine these two to create an update to the state.





III.           ?To update the old Cell state C(t-1), into the new cell state c(t).


a)   Multiply the old state by f(t), forgetting the things we decided to forget earlier.


b)  Then we add i(t)?~C(t). This is the new candidate values, scaled by how much we decided to update each state value.





IV.           We need to decide what we’re going to output.


a)   First, we run a sigmoid layer which decides what parts of the cell state we’re going to output.


b)  We put the cell state through tanh (to push the values to be between ?1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.



 

GATED RECCURENT UNITS (GRU)

1.   A GRU has two gates, an LSTM has three gates.


2.   GRUs don’t possess and internal memory (C(t)) that is different from the exposed hidden state. They don’t have the output gate that is present in LSTM.


3.   The input and forget gates are coupled by an update gate z and the reset gate r is applied directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM is really split up into both r and z.


4.   We don’t apply a second nonlinearity when computing the output.



 ADDING A SECOND GRU LAYER

 1. Adding a second layer to our network allows our model to capture higher-level interactions.

2. It is likely see diminishing returns after 2-3 layers and unless we have a huge amount of data (which we don’t) more layers are unlikely to make a big difference and may lead to overfitting.





GRU VS LSTM

a.    In many tasks both architectures yield comparable performance and tuning hyperparameters like layer size is probably more important than picking the ideal architecture.


b.   GRUs have fewer parameters (U and W are smaller) and thus may train a bit faster or need less data to generalize.


c.    On the other hand, if you have enough data, the greater expressive power of LSTMs may lead to better results.



 PRE-PROCESSING THE DATA

1.   TOKENIZE TEXT

We want to make predictions on a per-word basis. This means we must tokenize our comments into sentences, and sentences into words.

The sentence “He left!” should be 3 tokens: “He”, “left”, “!”.


1)  REMOVE INFREQUENT WORDS


 Most words in our text will only appear one or two times. It’s a good idea to remove these infrequent words as having a huge vocabulary will make our model slow to train


2)  PADDING

Before training, we work on the dataset to convert the variable length sequences into fixed length sequences, by padding. We use a few special symbols to fill in the sequence.

1. EOS : End of sentence

2. PAD : Filler

3. GO : Start decoding

4. UNK : Unknown; word not in vocabulary

 Consider the following query-response pair:

Q : How are you?

A : I am fine.

Assuming that we would like our sentences (queries and responses) to be of fixed length, 10, this pair will be converted to:

Q : [ PAD, PAD, PAD, PAD, PAD, PAD, “?”, “you”, “are”, “How” ]

A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]

3)  BUCKETING


·       If the largest sentence in our dataset is of length 100, we need to encode all our sentences to be of length 100, in order to not lose any words. Now, what happens to “How are you?” ? There will be 97 PAD symbols in the encoded version of the sentence. This will overshadow the actual information in the sentence.


·       Bucketing kind of solves this problem, by putting sentences into buckets of different sizes.


               Consider this list of bucket: [ (5,10), (10,15), (20,25), (40,50) ]


If the length of a query is 4 and the length of its response is 4 (as in our previous example), we put this sentence in the bucket (5,10). The query will be padded to length 5 and the response will be padded to length 10.


?If we are using the bucket (5,10), our sentences will be encoded to :


 Q : [ PAD, “?”, “you”, “are”, “How” ]

A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]


WORD EMBEDDING

?CO-OCCURRENCE MATRIX

 Since deep learning loves math, we’re going to represent each word as a d-dimensional vector.


Here, 6 distinct word, so each word will be of 6-dim vector.



…. Extracting the rows from this matrix can give us a simple initialization of our word vectors.

 INFERENCE FROM THE ABOVE EXAMPLE

I. Notice that the words ‘love’ and ‘like’ both contain 1’s for their counts with nouns (NLP and dogs).

II. They also have 1’s for the count with “I”, thus indicating that the words must be some sort of verb.

III. With a larger dataset than just one sentence, it can be imagined that this similarity will become more clear as ‘like’, ‘love’, and other synonyms will begin to have similar word vectors, because of the fact that they are used in similar contexts.

LIMITATION

I. The dimensionality of each word will increase linearly with the size of the corpus.

II. If we had a million words (not really a lot in NLP standards), we’d have a million by million sized matrix which would be extremely sparse (lots of 0’s). Definitely not the best in terms of storage efficiency. Alternatively,

 WORD2VEC APPROACH

Word2Vec operates on the idea that we want to predict the surrounding words of every word.

 We’re going to look at the first 3 words of this sentence.

Window size m=3.

Goal is to take the center word, ‘love’, and predict the words that come before and after it by maximizing/optimizing a function to maximize the log probability of any context word given the current center word.


Where log function is:

 

The above cost function is basically saying that we’re going to add the log probabilities of ‘I’ and ‘love’ as well as ‘NLP’ and ‘love’ (where ‘love’ is the center word in both cases).

  Vc is the word vector of the center word. Every word has two vector representations (Uo and Uw), one for when the word is used as the center word and one for when it’s used as the outer word. The vectors are trained with stochastic gradient descent.


Word2Vec seeks to find vector representations of different words by maximizing the log probability of context words given a center word and modifying the vectors through SGD.

The most interesting contribution of Word2Vec was the appearance of linear relationships between different word vectors.



After training, the word vectors seemed to capture different grammatical and semantic concept.

It’s pretty incredible how these linear relationships could be formed through a simple objective function and optimization technique.




SEQUENCE TO SEQUENCE MODEL FOR CHATBOT

?Sequence To Sequence model become the Go-To model for Dialogue Systems and Machine Translation.

? It consists of two RNNs (Recurrent Neural Network(LSTM or GRU)) :

I. An encoder

II. A decoder

Encoder

1.   The encoder takes a sequence(sentence) as input and processes one symbol(word) at each time step.

2.   Its objective is to convert a sequence of symbols into a fixed size feature vector that encodes only the important information in the sequence while losing the unnecessary information.

3.   You can visualize data flow in the encoder along the time axis, as the flow of local information from one end of the sequence to another.

4.   Each hidden state influences the next hidden state and the final hidden state can be seen as the summary of the sequence. This state is called the context or thought vector, as it represents the intention of the sequence.

5.   From the context, the decoder generates another sequence, one symbol(word) at a time. Here, at each time step, the decoder is influenced by the context and the previously generated symbols.


 DUAL ENCODER LSTM ALGORITHM FOR SEQ2SEQ

1.   Both the context and the response text are split by words, and each word is embedded into a vector. The word embeddings are initialized with Word2Vec Skip gram model of vectors and are fine-tuned during training.


2.   Both the embedded context and response are fed into the same Recurrent Neural Network word-by-word. The RNN generates a vector representation that, loosely speaking, captures the “meaning” of the context and response (c and r in the picture). We can choose how large these vectors should be, but let’s say we pick 256 dimensions.



3.   We multiply c with a matrix M to “predict” a response r’. If c is a 256-dimensional vector, then M is a 256×256 dimensional matrix, and the result is another 256-dimensional vector, which we can interpret as a generated response. The matrix M is learned during training.




5.   We measure the similarity of the predicted response r’ and the actual response r by taking the dot product of these two vectors.



6.   A large dot product means the vectors are similar and that the response should receive a high score.


7.   We then apply a sigmoid function to convert that score into a probability.








 REFERENCES

? https://github.com/Marsan-Ma/chat_corpus

 (Sources of data for trial ChatBot)  

https://karpathy.github.io/2015/05/21/rnn-effectiveness/…

https://karpathy.github.io/2015/05/21/rnn-effectiveness/

https://cs231n.github.io/optimization-1/

https://colah.github.io/posts/2015-08-Backprop/

https://cs231n.github.io/optimization-2/

https://neuralnetworksanddeeplearning.com/chap2.html

https://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture2.pdf

https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

https://sebastianruder.com/word-embeddings-1/

https://suriyadeepan.github.io/2016-06-28-easy-seq2seq/

 https://www.tensorflow.org/tutorials/seq2seq

https://www.wildml.com/2015/09/recurrent-neural-networks- tutorial-part-1-introduction-to-rnns/

?https://www.wildml.com/2015/09/recurrent-neural-networks- tutorial-part-2-implementing-a-language-model-rnn-with-python- numpy-and-theano/

 ?https://www.wildml.com/2015/10/recurrent-neural-networks- tutorial-part-3-backpropagation-through-time-and-vanishing- gradients/

?https://www.wildml.com/2015/10/recurrent-neural-network- tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-




Mohit Soni

Managing Director @ Hackshade Technologies Pvt. Ltd. | Business Technology Solutions Consultant

7 年

Nice explaination ???? RNN

要查看或添加评论,请登录

社区洞察

其他会员也浏览了