Bidirectional Encoder Representations from Transformers

Bidirectional Encoder Representations from Transformers

BERT, which stands for Bidirectional Encoder Representations from Transformers, is based on Transformers, a deep learning model in which every output element.?

BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.

BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. In the paper, the researchers detail a novel technique named Masked LM (MLM) which allows bidirectional training in models in which it was previously impossible.

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google.

What makes it Bidirectional?

We usually create a language model by training it on some unrelated task but tasks that help develop a contextual understanding of words in a model. More often than not such tasks involve predicting the next word or words in close vicinity of each other. Such training methods can’t be extended and used for bidirectional models because it would allow each word to indirectly “see itself” — when you would approach the same sentence again but from opposite direction, you kind of already know what to expect. A case of data leakage.

In such a situation, model could trivially predict the target word. Additionally, we can’t guarantee that the model, if completely trained, has learnt the contextual meaning of the words to some extent and not just focused on optimizing the trivial predictions.

So how does BERT manage to pre-train bidirectionally? It does so by using a procedure called Masked LM. More details on it later, so read on, my friend.

Pre-training BERT

The BERT model is trained on the following two unsupervised tasks.

1. Masked Language Model (MLM)

This task enables the deep bidirectional learning aspect of the model. In this task, some percentage of the input tokens are masked (Replaced with [MASK] token) at random and the model tries to predict these masked tokens — not the entire input sequence. The predicted tokens from the model are then fed into an output SoftMax over the vocabulary to get the final output words.

This, however creates a mismatch between the pre-training and fine-tuning tasks because the latter does not involve predicting masked words in most of the downstream tasks. This is mitigated by a subtle twist in how we mask the input tokens.

Approximately 15% of the words are masked while training, but all of the masked words are not replaced by the [MASK] token.


  • 80% of the time with [MASK] tokens.
  • 10% of the time with a random tokens.
  • 10% of the time with the unchanged input tokens that were being masked.


2. Next Sentence Prediction (NSP)

The LM doesn’t directly capture the relationship between two sentences which is relevant in many downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI). The model is taught sentence relationships by training on binarized NSP task.

In this task, two sentences — A and B — are chosen for pre-training.


  • 50% of the time B is the actual next sentence that follows A.
  • 50% of the time B is a random sentence from the corpus.


Training — Inputs and Outputs.

The model is trained on both above mentioned tasks simultaneously. This is made possible by clever usage of inputs and outputs.

Inputs

The model needs to take input for both a single sentence or two sentences packed together unambiguously in one token sequence. Authors note that a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A [SEP] token is used to separate two sentences as well as a using a learnt segment embedding indicating a token as a part of segment A or B.

Problem #1: All the inputs are fed in one step — as opposed to RNNs in which inputs are fed sequentially, the model is not able to preserve the ordering of the input tokens. The order of words in every language is significant, both semantically and syntactically.

Problem #2: In order to perform Next Sentence Prediction task properly we need to be able to distinguish between sentences A and B. Fixing the lengths of sentences can be too restrictive and a potential bottleneck for various downstream tasks.

Both of these problems are solved by adding embeddings containing the required information to our original tokens and using the result as the input to our BERT model. The following embeddings are added to token embeddings:


  • Segment Embedding: They provide information about the sentence a particular token is a part of.
  • Position Embedding: They provide information about the order of words in the input.


Outputs

How does one predict output for two different tasks simultaneously? The answer is by using different FFNN + Softmax layer built on top of output(s) from the last encoder, corresponding to desired input tokens. We will refer to the outputs from last encoder as final states.

The first input token is always a special classification [CLS] token. The final state corresponding to this token is used as the aggregate sequence representation for classification tasks and used for the Next Sentence Prediction where it is fed into a FFNN + Softmax layer that predicts probabilities for the labels “IsNext” or “NotNext”.

The final states corresponding to [MASK] tokens is fed into FFNN+Softmax to predict the next word from our vocabulary.

Fine-tuning BERT

Fine-tuning on various downstream tasks is done by swapping out the appropriate inputs or outputs. In the general run of things, to train task-specific models, we add an extra output layer to existing BERT and fine-tune the resultant model — all parameters, end to end. A positive consequence of adding layers — input/output and not changing the BERT model is that only a minimal number of parameters need to be learned from scratch making the procedure fast, cost and resource efficient.

Just to give you an idea of how fast and efficient it is, the authors claim that all the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.

要查看或添加评论,请登录

Kishan Kumar的更多文章

  • Sales Manager

    Sales Manager

    What is a Sales Manager? A sales manager is responsible for overseeing and leading a team of sales representatives to…

  • Data Modelers

    Data Modelers

    Data modelers are systems analysts who work with data architects and database administrators to design computer…

  • Deepfake Technology

    Deepfake Technology

    What is Deepfake? Deepfake is a term that refers to synthetic media that have been digitally manipulated to replace one…

  • Analytics

    Analytics

    Analytics is a field of computer science that uses math, statistics, and machine learning to find meaningful patterns…

  • What is Apache Airflow?

    What is Apache Airflow?

    The Apache Airflow platform allows you to create, schedule and monitor workflows through computer programming. It is a…

  • LSTM Networks

    LSTM Networks

    LSTM networks are an extension of recurrent neural networks (RNNs) mainly introduced to handle situations where RNNs…

  • Free Space Laser Communication

    Free Space Laser Communication

    FSO is a line-of-sight technology that uses lasers to provide optical bandwidth connections or FSO is an optical…

  • Neo4j

    Neo4j

    A Neo4j graph database stores nodes and relationships instead of tables or documents. Data is stored just like you…

  • Customer Communications Management

    Customer Communications Management

    What is customer communications management? Customer communications management is a strategic framework designed to…

  • Bid Rigging

    Bid Rigging

    Bid rigging is a common practice in almost every industry. It hampers the buyers’ efforts to get goods and services at…

社区洞察

其他会员也浏览了