登录查看更多内容

Bidirectional Encoder Representations from Transformers

Kishan Kumar

Senior Consultant CRD(Corporate function) at Huquo

发布日期: 2023年11月1日

BERT, which stands for Bidirectional Encoder Representations from Transformers, is based on Transformers, a deep learning model in which every output element.?

BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.

BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. In the paper, the researchers detail a novel technique named Masked LM (MLM) which allows bidirectional training in models in which it was previously impossible.

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google.

What makes it Bidirectional?

We usually create a language model by training it on some unrelated task but tasks that help develop a contextual understanding of words in a model. More often than not such tasks involve predicting the next word or words in close vicinity of each other. Such training methods can’t be extended and used for bidirectional models because it would allow each word to indirectly “see itself” — when you would approach the same sentence again but from opposite direction, you kind of already know what to expect. A case of data leakage.

In such a situation, model could trivially predict the target word. Additionally, we can’t guarantee that the model, if completely trained, has learnt the contextual meaning of the words to some extent and not just focused on optimizing the trivial predictions.

So how does BERT manage to pre-train bidirectionally? It does so by using a procedure called Masked LM. More details on it later, so read on, my friend.

Pre-training BERT

The BERT model is trained on the following two unsupervised tasks.

1. Masked Language Model (MLM)

This task enables the deep bidirectional learning aspect of the model. In this task, some percentage of the input tokens are masked (Replaced with [MASK] token) at random and the model tries to predict these masked tokens — not the entire input sequence. The predicted tokens from the model are then fed into an output SoftMax over the vocabulary to get the final output words.

This, however creates a mismatch between the pre-training and fine-tuning tasks because the latter does not involve predicting masked words in most of the downstream tasks. This is mitigated by a subtle twist in how we mask the input tokens.

Approximately 15% of the words are masked while training, but all of the masked words are not replaced by the [MASK] token.

80% of the time with [MASK] tokens.
10% of the time with a random tokens.
10% of the time with the unchanged input tokens that were being masked.

2. Next Sentence Prediction (NSP)

The LM doesn’t directly capture the relationship between two sentences which is relevant in many downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI). The model is taught sentence relationships by training on binarized NSP task.

In this task, two sentences — A and B — are chosen for pre-training.

领英推荐

RAG: From Concept to Advanced Implementation - A…

Brij kishore Pandey 7 个月前

Top LLM Papers of the week (July Week 4, 2024)

Kalyan KS 8 个月前

Watch#8: Extreme Teachers and Mixing Tokens, not…

Pascal Biese 1 年前

50% of the time B is the actual next sentence that follows A.
50% of the time B is a random sentence from the corpus.

Training — Inputs and Outputs.

The model is trained on both above mentioned tasks simultaneously. This is made possible by clever usage of inputs and outputs.

Inputs

The model needs to take input for both a single sentence or two sentences packed together unambiguously in one token sequence. Authors note that a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A [SEP] token is used to separate two sentences as well as a using a learnt segment embedding indicating a token as a part of segment A or B.

Problem #1: All the inputs are fed in one step — as opposed to RNNs in which inputs are fed sequentially, the model is not able to preserve the ordering of the input tokens. The order of words in every language is significant, both semantically and syntactically.

Problem #2: In order to perform Next Sentence Prediction task properly we need to be able to distinguish between sentences A and B. Fixing the lengths of sentences can be too restrictive and a potential bottleneck for various downstream tasks.

Both of these problems are solved by adding embeddings containing the required information to our original tokens and using the result as the input to our BERT model. The following embeddings are added to token embeddings:

Segment Embedding: They provide information about the sentence a particular token is a part of.
Position Embedding: They provide information about the order of words in the input.

Outputs

How does one predict output for two different tasks simultaneously? The answer is by using different FFNN + Softmax layer built on top of output(s) from the last encoder, corresponding to desired input tokens. We will refer to the outputs from last encoder as final states.

The first input token is always a special classification [CLS] token. The final state corresponding to this token is used as the aggregate sequence representation for classification tasks and used for the Next Sentence Prediction where it is fed into a FFNN + Softmax layer that predicts probabilities for the labels “IsNext” or “NotNext”.

The final states corresponding to [MASK] tokens is fed into FFNN+Softmax to predict the next word from our vocabulary.

Fine-tuning BERT

Fine-tuning on various downstream tasks is done by swapping out the appropriate inputs or outputs. In the general run of things, to train task-specific models, we add an extra output layer to existing BERT and fine-tune the resultant model — all parameters, end to end. A positive consequence of adding layers — input/output and not changing the BERT model is that only a minimal number of parameters need to be learned from scratch making the procedure fast, cost and resource efficient.

Just to give you an idea of how fast and efficient it is, the authors claim that all the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.

要查看或添加评论，请登录

Kishan Kumar的更多文章

Sales Manager

2024年4月5日

Sales Manager

What is a Sales Manager? A sales manager is responsible for overseeing and leading a team of sales representatives to…
Data Modelers

2024年4月4日

Data Modelers

Data modelers are systems analysts who work with data architects and database administrators to design computer…
Deepfake Technology

2024年4月3日

Deepfake Technology

What is Deepfake? Deepfake is a term that refers to synthetic media that have been digitally manipulated to replace one…
Analytics

2024年4月2日

Analytics

Analytics is a field of computer science that uses math, statistics, and machine learning to find meaningful patterns…
What is Apache Airflow?

2024年4月1日

What is Apache Airflow?

The Apache Airflow platform allows you to create, schedule and monitor workflows through computer programming. It is a…
LSTM Networks

2024年3月30日

LSTM Networks

LSTM networks are an extension of recurrent neural networks (RNNs) mainly introduced to handle situations where RNNs…
Free Space Laser Communication

2024年3月29日

Free Space Laser Communication

FSO is a line-of-sight technology that uses lasers to provide optical bandwidth connections or FSO is an optical…
Neo4j

2024年3月28日

Neo4j

A Neo4j graph database stores nodes and relationships instead of tables or documents. Data is stored just like you…
Customer Communications Management

2024年3月27日

Customer Communications Management

What is customer communications management? Customer communications management is a strategic framework designed to…
Bid Rigging

2024年3月26日

Bid Rigging

Bid rigging is a common practice in almost every industry. It hampers the buyers’ efforts to get goods and services at…

See all articles

Bidirectional Encoder Representations from Transformers

Kishan Kumar

Senior Consultant CRD(Corporate function) at Huquo

What makes it Bidirectional?

Pre-training BERT

1. Masked Language Model (MLM)

2. Next Sentence Prediction (NSP)

领英推荐

Training — Inputs and Outputs.

Inputs

Outputs

Fine-tuning BERT

Kishan Kumar的更多文章

社区洞察

其他会员也浏览了

From "Bag-of-Words" to "Instruct-Tuned LLMs": The Technical and Business Evolution of NLP

[Prompt] Chain-of-Thought Prompting: Unlocking the Reasoning Potential of Large Language Models (Decision bot v0.0.1)

Part 10: Scaling Laws & The Rise of Large Language Models – How Bigger Models Changed AI Forever

Retrieval Augmented Generation (RAG): The Ultimate Guide

Large Language Models

Retrieval Augmented Generation (RAG): The Second Coming of LLMs

LLM Tokenizers: The Hidden Engine Behind AI Language Models

CHAT-GPT and large language models (LLMs) analyzed from the standpoint of a news analytics start-up

How in heaven can i develop my own LLM? Look no further

The Best and Most Popular Open-Source LLMs: Revolutionizing AI with Transparency

What makes it Bidirectional?

Pre-training BERT

1. Masked Language Model (MLM)

2. Next Sentence Prediction (NSP)

领英推荐

Training — Inputs and Outputs.

Inputs

Outputs

Fine-tuning BERT

Kishan Kumar的更多文章

Sales Manager

Data Modelers

Deepfake Technology

Analytics

What is Apache Airflow?

LSTM Networks

Free Space Laser Communication

Neo4j

Customer Communications Management

Bid Rigging

社区洞察

其他会员也浏览了

From "Bag-of-Words" to "Instruct-Tuned LLMs": The Technical and Business Evolution of NLP

[Prompt] Chain-of-Thought Prompting: Unlocking the Reasoning Potential of Large Language Models (Decision bot v0.0.1)

Part 10: Scaling Laws & The Rise of Large Language Models – How Bigger Models Changed AI Forever

Retrieval Augmented Generation (RAG): The Ultimate Guide

Large Language Models

Retrieval Augmented Generation (RAG): The Second Coming of LLMs

LLM Tokenizers: The Hidden Engine Behind AI Language Models

CHAT-GPT and large language models (LLMs) analyzed from the standpoint of a news analytics start-up

How in heaven can i develop my own LLM? Look no further

The Best and Most Popular Open-Source LLMs: Revolutionizing AI with Transparency