BERT - Who?
BERT - Bidirectional Encoder Representations from Transformers, isn’t that a tongue twister!
5 years ago, google published a paper "Attention is all you need", google is the best, google has been at the forefront of all major AI initiative whether it be TensorFlow then or transformers now.
This paper has introduced a new NN architecture which has been groundbreaking, lead to ChatGPT and many more algorithms, turned the page in AI world to the new era of development.
You might ask, what’s so groundbreaking about this paper?
This paper talks about advances in the use of the attention mechanism, which is the main improvement for model Transformer.
Seq2Seq -
Before Transformers model, Seq2Seq model like RNN & CNN were building blocks for most of the NMT (Neural Machine Translation) problem statements. Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. But these models come with a limitation in regard to long sequences, their ability to retain first elements was lost when new elements are incorporated into the sequence. Seq2Seq models use Encoder & Decoder architecture under the hood, which means for a NMT problem when a sentence is provided as an input, model will process each word in the sentence to create a hidden state which will be passed to next encoder along with next word in sequence. This will repeat till the end of the sentence, thus ending up with final hidden state as context vector with imbalance & lack of awareness to the input sentence construct. When Decoder uses the context vector, this limitation will be visible in translation output.
Attention -
Now, Attention concept which is the key differentiator avoids using the last hidden state of the encoder and uses weighted function of all the encoder hidden states, thus allowing decoder to assign higher or lower weights to certain elements of the input for each element of the output. Attention comes with its own set of limitations, it’s still sequential in nature which means encoder & decoder must wait for the (t-1) step to be completed before (t-th) step is completed. So, training times are longer and computationally intense.
Transformers -
This is where Transformers come in, to help address drawbacks of Attention model. Transformers model when introduced was a transduction model relying entirely on self-attention to compute representations of its input and output without Sequence RNN model. Transformer allows for parallelization by not using recurrent units instead just weighted sum and activations.
Self Attention -
As you can see from the architecture model above, Transformers is not just attention but a Self-Attention model. Self-Attention is a process of applying attention mechanism of the input vector with itself. Essentially in a NMT problem, each word in the input sentence flows through its own path in the encoder, dependencies are created in self-attention layer instead of feed forward, thus allowing parallel execution.
领英推荐
How does self attention work in execution?
In the build logic three vectors are considered query, key, and value. For each input all three vectors are created which are a product of weights from pre-train and the input positional vector i.e.
Multi-Head Self Attention -
when we extrapolate this from individual input to input in combination to the other inputs and make n-gram (NMT problem), it gives us the multi head self-attention architecture.
Recap -
Seq2Seq models are great at many things and are pioneer models, we needed better models to address the drawbacks like imbalance & lack of awareness to the input sentence construct. Attention architecture allows step forward by presenting all the hidden states of inputs to decoders, but this is memory intense and no parallelization. Transformers models address this issue by Self Attention which instead of full vector carries forward three summary vectors of input vector in relation to all the other input vector & their output vectors, thus parallel processing. But self-attention is at individual input layer and input might have other correlations when associated with other inputs, multi head self-attention considers group of inputs of size n and repeats the self-attention thus allowing relation & context to be captured. Let’s call all these pieces a Transformer block for ease of reference.
What does all of this have to do with BERT?
BERT consists of a stacks of transformer blocks, this stack is pre-trained on a large corpus of 800 million words from English books and 2.5 billion words of text from English Wikipedia articles.
Pre-Training of BERT NN includes two key tasks Masking & Next sequence classification.
BERT comes with many other features pre trained on like word piece tokenization, special token prepends, single task - specified layer etc. which allows model to perform at human level on various language-based tasks.
The largest BERT model uses 24 transformer blocks, 1024 embedding dimensions & 16 attention heads all together 340 million parameters.
BERT was a marvel of creation & human ingenuity within AI space. Hope this article generates some interest in learning more or getting some understanding of what’s the recipe of this model. There are many other forks in the building path of BERT or Transformers to talk about which I will save to study & write for a different day.
Peace out!