BERT - Who?

BERT - Who?

BERT - Bidirectional Encoder Representations from Transformers, isn’t that a tongue twister!

5 years ago, google published a paper "Attention is all you need", google is the best, google has been at the forefront of all major AI initiative whether it be TensorFlow then or transformers now.

No alt text provided for this image

This paper has introduced a new NN architecture which has been groundbreaking, lead to ChatGPT and many more algorithms, turned the page in AI world to the new era of development.

You might ask, what’s so groundbreaking about this paper?

This paper talks about advances in the use of the attention mechanism, which is the main improvement for model Transformer.

Seq2Seq -

Before Transformers model, Seq2Seq model like RNN & CNN were building blocks for most of the NMT (Neural Machine Translation) problem statements. Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. But these models come with a limitation in regard to long sequences, their ability to retain first elements was lost when new elements are incorporated into the sequence. Seq2Seq models use Encoder & Decoder architecture under the hood, which means for a NMT problem when a sentence is provided as an input, model will process each word in the sentence to create a hidden state which will be passed to next encoder along with next word in sequence. This will repeat till the end of the sentence, thus ending up with final hidden state as context vector with imbalance & lack of awareness to the input sentence construct. When Decoder uses the context vector, this limitation will be visible in translation output.

No alt text provided for this image
Seq2Seq Model Encoder & Decoder

Attention -

Now, Attention concept which is the key differentiator avoids using the last hidden state of the encoder and uses weighted function of all the encoder hidden states, thus allowing decoder to assign higher or lower weights to certain elements of the input for each element of the output. Attention comes with its own set of limitations, it’s still sequential in nature which means encoder & decoder must wait for the (t-1) step to be completed before (t-th) step is completed. So, training times are longer and computationally intense.

No alt text provided for this image
Attention Model

Transformers -

This is where Transformers come in, to help address drawbacks of Attention model. Transformers model when introduced was a transduction model relying entirely on self-attention to compute representations of its input and output without Sequence RNN model. Transformer allows for parallelization by not using recurrent units instead just weighted sum and activations.

No alt text provided for this image

Self Attention -

As you can see from the architecture model above, Transformers is not just attention but a Self-Attention model. Self-Attention is a process of applying attention mechanism of the input vector with itself. Essentially in a NMT problem, each word in the input sentence flows through its own path in the encoder, dependencies are created in self-attention layer instead of feed forward, thus allowing parallel execution.

No alt text provided for this image

How does self attention work in execution?

In the build logic three vectors are considered query, key, and value. For each input all three vectors are created which are a product of weights from pre-train and the input positional vector i.e.

  • Query - input compared to every other vector to establish the weights for its own output.
  • Key - input compared to every other vector to establish weights for the output of other input vectors.
  • Value - input vector to all weighted sum of all output vector for each input.

Multi-Head Self Attention -

when we extrapolate this from individual input to input in combination to the other inputs and make n-gram (NMT problem), it gives us the multi head self-attention architecture.

Recap -

Seq2Seq models are great at many things and are pioneer models, we needed better models to address the drawbacks like imbalance & lack of awareness to the input sentence construct. Attention architecture allows step forward by presenting all the hidden states of inputs to decoders, but this is memory intense and no parallelization. Transformers models address this issue by Self Attention which instead of full vector carries forward three summary vectors of input vector in relation to all the other input vector & their output vectors, thus parallel processing. But self-attention is at individual input layer and input might have other correlations when associated with other inputs, multi head self-attention considers group of inputs of size n and repeats the self-attention thus allowing relation & context to be captured. Let’s call all these pieces a Transformer block for ease of reference.

What does all of this have to do with BERT?

BERT consists of a stacks of transformer blocks, this stack is pre-trained on a large corpus of 800 million words from English books and 2.5 billion words of text from English Wikipedia articles.

Pre-Training of BERT NN includes two key tasks Masking & Next sequence classification.

  • Masking allows model to learn a representation of every word in the sequence as training process includes prediction of masked/random word substitutes from original words.
  • Next Sequence classification includes prediction of sequences followed in order when presented with two or more options of possible sequences which could be selected from random places including the right match.

BERT comes with many other features pre trained on like word piece tokenization, special token prepends, single task - specified layer etc. which allows model to perform at human level on various language-based tasks.

The largest BERT model uses 24 transformer blocks, 1024 embedding dimensions & 16 attention heads all together 340 million parameters.

BERT was a marvel of creation & human ingenuity within AI space. Hope this article generates some interest in learning more or getting some understanding of what’s the recipe of this model. There are many other forks in the building path of BERT or Transformers to talk about which I will save to study & write for a different day.

Peace out!

要查看或添加评论,请登录

Eeswar C.的更多文章

  • In-Context Learning

    In-Context Learning

    Have you ever encountered instances where ChatGPT repeatedly provides similar responses to your queries, or where its…

    1 条评论
  • Retrieval Augumented Generation

    Retrieval Augumented Generation

    Anyone within the industry who has utilized ChatGPT for business purposes would likely have had the thought, "This is…

  • Diffusion Model - Gen AI

    Diffusion Model - Gen AI

    Diffusion models have gained attention for their ability to handle various tasks, particularly in the domains of image…

  • Anomaly Detection with VAE

    Anomaly Detection with VAE

    Anomaly detection is a machine learning technique used to identify patterns that are considered unusual or out of the…

  • Neural Network

    Neural Network

    In this article I am going back to the basics, Neural Networks! Most of the readers must have seen the picture above…

  • How Does my Iphone know its me?

    How Does my Iphone know its me?

    Ever wondered how does iPhone know its you and never mistakes someone else for you when using Face Detection? Drum Roll…

    1 条评论
  • Natural Language Data Search

    Natural Language Data Search

    Remember how search was tedious a decade ago! Today you can search and ask questions in any search engine as you would…

  • Machine Learning & Data Privacy

    Machine Learning & Data Privacy

    Every person i know fears about how their personal data is at risk by all the AI/ML that is surrounding them, whether…

  • Business at center of Data Science

    Business at center of Data Science

    Any one who has participated in brainstroming & whiteboarding sessions would agree that, what data scientists think of…

  • Capsule Networks (#capsnets)

    Capsule Networks (#capsnets)

    In my previous article on Handwriting Decoder (#ocr), we touched on how can we read Hand Writing using Computer vision.…

社区洞察

其他会员也浏览了