NMT Architecture
Source: Custom Made

NMT Architecture

In my previous post, I shared a higher level understanding of NMT(Neural Machine Translation) architecture.

So, continuing from there:

With the same context of language translation, let's see how different aspects of it work together. On a higher level, we have 4 major components:

  1. Embedding Layer - vector(numerical) representation of text data
  2. Encoder - that which understands the source language and condenses the patterns learned into what we call context/thought vector.
  3. Context Vector - the summarized representation of source language produced by the encoder
  4. Decoder - which is responsible for decoding the context vector into the desired translation.


Let's connect the dots b/w the embedding layer, the encoder, the context vector, and the decoder:

  1. We use two-word embedding layers, one for the source language and the other for the target, to better represent the semantics b/w the words of the respective languages.


  1. The encoder is responsible for generating a thought vector or a context vector representing what the source language means.

  • The encoder is an RNN cell.
  • At time step t_0, the encoder is initialized with a zero vector by default. After finally getting trained on the sequence of source sentences/words, It produces a context vector, which is it's final external hidden state.


  1. The context vector's idea is to concisely represent a source language sentence.

  • Also, in contrast to how the encoder’s state is initialized (i.e., it is initialized with zeros), the context vector becomes the initial state for the decoder.
  • This links the encoder and the decoder, making the whole model end-to-end differentiable.


  1. The decoder is responsible for decoding the context vector into the desired translation. Our decoder is an RNN as well.

  • The context vector is the only piece of information that is available to the decoder about the source sentence. Thus, it is a crucial link b/w encoder and decoder.
  • After getting initialized with the context vector as its initial state, the decoder then learns the patterns in the target text.
  • Though it is possible for the encoder and decoder to share the same set of weights, it is usually better to use two different networks for the encoder and the decoder. This increases the number of parameters in our model, allowing us to learn the translations more effectively.
  • For the prediction, we use something like the softmax function to predict the words.


The full NMT system, with the details of how the GRU cell in the encoder connects to the GRU cell in the decoder and how the softmax layer is used to output predictions, is shown:

Source: NLP with TensorFlow by

We can also add Attention Mechanism to our decoder, which I briefly discussed in my previous post. In brief, adding attention to the decoder implies that it allows the decoder access to the encoder's state in order to learn more about the source sentence.

In my next post, I'll discuss in more detail about "Attention Mechanism" and why the context vector is not sufficient to produce good quality translations.


BTW, if you are interested in learning more about this, here is my very in-depth notebook on this topic, explaining the concepts and code implementation in great detail.

??GitHub Link: Seq2Seq Learning - Implementing NMT System


要查看或添加评论,请登录

Amit Vikram Raj的更多文章

  • How to SSH Tunnel into AWS EC2 and connect to DocumentDB using Python?

    How to SSH Tunnel into AWS EC2 and connect to DocumentDB using Python?

    Why it's needed? Before I tell you why it's needed, I'd like to share why I had to do it. The answer is simple: to…

    2 条评论
  • Layer Normalization

    Layer Normalization

    Layer Norm, Batch Norm & Covariate Shift: Continuing from my last post on batch normalization, Here are a few things on…

  • Bahdanau Attention Mechanism

    Bahdanau Attention Mechanism

    In my last NLP post regarding NMT(Neural Machine Translation), I shared about its architecture in a very intuitive…

  • Improving Predictions in Language Modelling

    Improving Predictions in Language Modelling

    Here is something that I picked up along the way on how we can improve our predictions of LSTM networks, specifically…

    2 条评论

社区洞察

其他会员也浏览了