登录查看更多内容

BERT

Darshika Srivastava

Associate Project Manager @ HuQuo | MBA,Amity Business School

发布日期: 2023年11月14日

Introduction

Picture this – you’re working on a really cool data science project and have applied the latest state-of-the-art library to get a pretty good result. And boom! A few days later, there’s a new state-of-the-art framework in town that has the potential to further improve your model.

That is not a hypothetical scenario – it’s the reality (and thrill) of working in the field of Natural Language Processing (NLP)! The last two years have been mind-blowing in terms of breakthroughs. I get to grips with one framework and another one, potentially even better, comes along.

Google’s BERT is one such NLP framework. I’d stick my neck out and say it’s perhaps the most influential one in recent times (and we’ll see why pretty soon).

It’s not an exaggeration to say that BERT has significantly altered the NLP landscape. Imagine using a single model that is trained on a large unlabelled dataset to achieve State-of-the-Art results on 11 individual NLP tasks. And all of this with little fine-tuning. That’s BERT! It’s a tectonic shift in how we design NLP models.

BERT has inspired many recent NLP architectures, training approaches and language models, such as Google’s TransformerXL, OpenAI’s GPT-2, XLNet, ERNIE2.0, RoBERTa, etc.

I aim to give you a comprehensive guide to not only BERT but also what impact it has had and how this is going to affect the future of NLP research. And yes, there’s a lot of Python code to work on, too!

Ready to start your data science journey?

Master 23+ tools & learn 50+ real-world projects to transform your career in Data Science.

Enroll Now

Note: In this article, we are going to talk a lot about Transformers. If you aren’t familiar with it, feel free to read this article first – How do Transformers Work in NLP? A Guide to the Latest State-of-the-Art Models.

What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.

That sounds way too complex as a starting point. But it does summarize what BERT does pretty well so let’s break it down:

First, it’s easy to get that BERT stands for Bidirectional Encoder Representations from Transformers. Each word here has a meaning to it and we will encounter that one by one in this article. For now, the key takeaway from this line is – BERT is based on the Transformer architecture.
Second, BERT is pre-trained on a large corpus of unlabelled text including the entire Wikipedia(that’s 2,500 million words!) and Book Corpus (800 million words). This pre-training step is half the magic behind BERT’s success. This is because as we train a model on a large text corpus, our model starts to pick up the deeper and intimate understandings of how the language works. This knowledge is the swiss army knife that is useful for almost any NLP task.
Third, BERT is a “deeply bidirectional” model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase.

BERT Example

The bidirectionality of a model is important for truly understanding the meaning of a language. Let’s see an example to illustrate this. There are two sentences in this example and both of them involve the word “bank”:

If we try to predict the nature of the word “bank” by only taking either the left or the right context, then we will be making an error in at least one of the two given examples.

One way to deal with this is to consider both the left and the right context before making a prediction. That’s exactly what BERT does! We will see later in the article how this is achieved.

And finally, the most impressive aspect of BERT. We can fine-tune it by adding just a couple of additional output layers to create state-of-the-art models for a variety of NLP tasks.

From Word2Vec to BERT: NLP’s Quest for Learning Language Representations

One of the biggest challenges in natural language processing is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labelled training examples.

Word2Vec and GloVe

The quest for learning language representations by pre-training models on large unlabelled text data started from word embeddings like Word2Vec and GloVe. These embeddings changed the way we performed NLP tasks. We now had embeddings that could capture contextual relationships among words.

These embeddings were used to train models on downstream NLP tasks and make better predictions. This could be done even with less task-specific data by utilizing the additional information from the embeddings itself.

One limitation of these embeddings was the use of very shallow Language Models. This meant there was a limit to the amount of information they could capture and this motivated the use of deeper and more complex language models (layers of LSTMs and GRUs).

Another key limitation was that these models did not take the context of the word into account. Let’s take the above “bank” example. The same word has different meanings in different contexts, right? However, an embedding like Word2Vec will give the same vector for “bank” in both the contexts.

That’s valuable information we are losing.

Enter ELMO and ULMFiT

ELMo was the NLP community’s response to the problem of Polysemy?– same words having different meanings based on their context. From training shallow feed-forward networks (Word2vec), we graduated to training word embeddings using layers of complex Bi-directional LSTM architectures. This meant that the same word can have multiple ELMO embeddings based on the context it is in.

That’s when we started seeing the advantage of pre-training as a training mechanism for NLP.

ULMFiT took this a step further. This framework could train language models that could be fine-tuned to provide excellent results even with fewer data (less than 100 examples) on a variety of document classification tasks. It is safe to say that ULMFiT cracked the code to transfer learning in NLP.

This is when we established the golden formula for transfer learning in NLP:

Transfer Learning in NLP = Pre-Training and Fine-Tuning

Most of the NLP breakthroughs that followed ULMFIT tweaked components of the above equation and gained state-of-the-art benchmarks.

OpenAI’s GPT

OpenAI’s GPT extended the methods of pre-training and fine-tuning that were introduced by ULMFiT and ELMo. GPT essentially replaced the LSTM-based architecture for Language Modeling with a Transformer-based architecture.

The GPT model could be fine-tuned to multiple NLP tasks beyond document classification, such as common sense reasoning, semantic similarity, and reading comprehension.

GPT also emphasized the importance of the Transformer framework, which has a simpler architecture and can train faster than an LSTM-based model. It is also able to learn complex patterns in the data by using the Attention mechanism.

OpenAI’s GPT validated the robustness and usefulness of the Transformer architecture by achieving multiple State-of-the-Arts.

And this is how Transformer inspired BERT and all the following breakthroughs in NLP.

Now, there were some other crucial breakthroughs and research outcomes that we haven’t mentioned yet, such as semi-supervised sequence learning. This is because they are slightly out of the scope of this article but feel free to read the linked paper to know more about it.

Moving onto BERT

So, the new approach to solving NLP tasks became a 2-step process:

Train a language model on a large unlabelled text corpus (unsupervised or semi-supervised)
Fine-tune this large model to specific NLP tasks to utilize the large repository of knowledge this model has gained (supervised)

With that context, let’s understand how BERT takes over from here to build a model that will become a benchmark of excellence in NLP for a long time.

How Does BERT Work?

Let’s look a bit closely at BERT and understand why it is such an effective method to model language. We’ve already seen what BERT can do earlier – but how does it do it? We’ll answer this pertinent question in this section:

1. BERT’s Architecture

The BERT architecture builds on top of Transformer. We currently have two variants available:

BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters

Source

领英推荐

Master Large Language Models in 10 Weeks with AI…

Ritesh Kanjee 3 个月前

WHAT IS TEXT STEMMING IN NLP?

Shanti A 3 年前

Exploring Text Analytics: Unveiling Insights from…

Kumar Nachiket 1 年前

The BERT Base architecture has the same model size as OpenAI’s GPT for comparison purposes. All of these Transformer layers are Encoder-only blocks.

If your understanding of the underlying architecture of the Transformer is hazy, I will recommend that you read about it here.

Now that we know the overall architecture of BERT, let’s see what kind of text processing steps are required before we get to the model building phase.

2. Text Preprocessing

The developers behind BERT have added a specific set of rules to represent the input text for the model. Many of these are creative design choices that make the model even better.

For starters, every input embedding is a combination of 3 embeddings:

Position Embeddings: BERT learns and uses positional embeddings to express the position of words in a sentence. These are added to overcome the limitation of Transformer which, unlike an RNN, is not able to capture “sequence” or “order” information
Segment Embeddings: BERT can also take sentence pairs as inputs for tasks (Question-Answering). That’s why it learns a unique embedding for the first and the second sentences to help the model distinguish between them. In the above example, all the tokens marked as EA belong to sentence A (and similarly for EB)
Token Embeddings: These are the embeddings learned for the specific token from the WordPiece token vocabulary

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings.

Such a comprehensive embedding scheme contains a lot of useful information for the model.

These combinations of preprocessing steps make BERT so versatile. This implies that without making any major change in the model’s architecture, we can easily train it on multiple kinds of NLP tasks.

3. Pre-training Tasks

BERT is pre-trained on two NLP tasks:

Masked Language Modeling
Next Sentence Prediction

Let’s understand both of these tasks in a little more detail!

Masked Language Modeling (Bi-directionality)

BERT is designed as a?deeply bidirectional?model. The network effectively captures information from both the right and left context of a token from the first layer itself and all the way through to the last layer.

Traditionally, we had language models either trained to predict the next word in a sentence (right-to-left context used in GPT) or language models that were trained on a left-to-right context. This made our models susceptible to errors due to loss in information.

Predicting the word in a sequence

ELMo tried to deal with this problem by training two LSTM language models on left-to-right and right-to-left contexts and shallowly concatenating them. Even though it greatly improved upon existing techniques, it wasn’t enough.

That’s where BERT greatly improves upon both GPT and ELMo. Look at the below image:

The arrows indicate the information flow from one layer to the next. The green boxes at the top indicate the final contextualized representation of each input word.

It’s evident from the above image: BERT is bi-directional, GPT is unidirectional (information flows only from left-to-right), and ELMO is shallowly bidirectional.

This is where the Masked Language Model comes into the picture.

About Masked Language Models

Let’s say we have a sentence – “I love to read data science blogs on Analytics Vidhya”. We want to train a bi-directional language model. Instead of trying to predict the next word in the sequence, we can build a model to predict a missing word from within the sequence itself.

Let’s replace “Analytics” with “[MASK]”. This is a token to denote that the token is missing. We’ll then train the model in such a way that it should be able to predict “Analytics” as the missing token: “I love to read data science blogs on [MASK] Vidhya.”

This is the crux of a Masked Language Model. The authors of BERT also include some caveats to further improve this technique:

To prevent the model from focusing too much on a particular position or tokens that are masked, the researchers randomly masked 15% of the words
The masked words were not always replaced by the masked tokens [MASK] because the [MASK] token would never appear during fine-tuning

So, the researchers used the below technique:

80% of the time the words were replaced with the masked token [MASK]
10% of the time the words were replaced with random words
10% of the time the words were left unchanged

I have shown how to implement a Masked Language Model in Python in one of my previous articles here:

Introduction to PyTorch-Transformers: An Incredible Library for State-of-the-Art NLP (with Python code)

Next Sentence Prediction

Masked Language Models (MLMs) learn to understand the relationship between words. Additionally, BERT is also trained on the task of Next Sentence Prediction for tasks that require an understanding of the relationship between sentences.

A good example of such a task would be question answering systems.

The task is simple. Given two sentences – A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence?

Since it is a binary classification task, the data can be easily generated from any corpus by splitting it into sentence pairs. Just like MLMs, the authors have added some caveats here too. Let’s take this with an example:

Consider that we have a text dataset of 100,000 sentences. So, there will be 50,000 training examples or pairs of sentences as the training data.

For 50% of the pairs, the second sentence would actually be the next sentence to the first sentence
For the remaining 50% of the pairs, the second sentence would be a random sentence from the corpus
The labels for the first case would be?‘IsNext’?and?‘NotNext’?for the second case

And this is how BERT is able to become a true task-agnostic model. It combines both the Masked Language Model (MLM) and the Next Sentence Prediction (NSP) pre-training tasks.

Implementing BERT for Text Classification in Python

Your mind must be whirling with the possibilities BERT has opened up. There are many ways we can take advantage of BERT’s large repository of knowledge for our NLP applications.

One of the most potent ways would be fine-tuning it on your own task and task-specific data. We can then use the embeddings from BERT as embeddings for our text documents.

In this section, we will learn how to use BERT’s embeddings for our NLP task. We’ll take up the concept of fine-tuning an entire BERT model in one of the future articles.

For extracting embeddings from BERT, we will use a really useful open source project called Bert-as-Service:

Running BERT can be a painstaking process since it requires a lot of code and installing multiple packages. That’s why this open-source project is so helpful because it lets us use BERT to extract encodings for each sentence in just two lines of code.

Installing BERT-As-Service

BERT-As-Service works in a simple way. It creates a BERT server which we can access using the Python code in our notebook. Every time we send it a sentence as a list, it will send the embeddings for all the sentences.

We can install the server and client via pip. They can be installed separately or even on?different?machines:

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

Note that the server MUST be running on?Python >= 3.5?with?TensorFlow >= 1.10?(one-point-ten).

Also, since running BERT is a GPU intensive task, I’d suggest installing the bert-serving-server on a cloud-based GPU or some other machine that has high compute capacity.

Now, go back to your terminal and download a model listed below. Then, uncompress the zip file into some folder, say /tmp/english_L-12_H-768_A-12/.

要查看或添加评论，请登录

Darshika Srivastava的更多文章

METADATA

2025年3月19日

METADATA

WHAT IS METADATA? Often referred to as data that describes other data, metadata is structured reference data that helps…
SSL

2025年3月18日

SSL

What is SSL? SSL, or Secure Sockets Layer, is an encryption-based Internet security protocol. It was first developed by…
BLOATWARE

2025年3月17日

BLOATWARE

What is bloatware? How to identify and remove it Unwanted pre-installed software -- also known as bloatware -- has long…
Data Democratization

2025年3月15日

Data Democratization

What is Data Democratization? Unlocking the Power of Data Cultures For Businesses Data is a vital asset in today's…
Rooting

2025年3月13日

Rooting

What is Rooting? Rooting is the process by which users of Android devices can attain privileged control (known as root…
Data Strategy

2025年3月12日

Data Strategy

What is a Data Strategy? A data strategy is a long-term plan that defines the technology, processes, people, and rules…
Product

2025年3月11日

Product

What is the Definition of Product? Ask a few people that question, and their specific answers will vary, but they’ll…
API

2025年3月10日

API

What is an API? APIs are mechanisms that enable two software components to communicate with each other using a set of…
Apple Intelligence

2025年3月8日

Apple Intelligence

What Is Apple Intelligence? Apple Intelligence is an artificial intelligence developed by Apple Inc. Relying on a…
Snowflake

2025年3月6日

Snowflake

What is the Snowflake Data Platform? While data is a core asset for modern enterprises, technology’s ability to scale…

See all articles

BERT

Darshika Srivastava

Associate Project Manager @ HuQuo | MBA,Amity Business School

Introduction

Ready to start your data science journey?

Table of contents

What is BERT?

BERT Example

From Word2Vec to BERT: NLP’s Quest for Learning Language Representations

Word2Vec and GloVe

Enter ELMO and ULMFiT

OpenAI’s GPT

Moving onto BERT

How Does BERT Work?

1. BERT’s Architecture

领英推荐

2. Text Preprocessing

3. Pre-training Tasks

Masked Language Modeling (Bi-directionality)

About Masked Language Models

Next Sentence Prediction

Implementing BERT for Text Classification in Python

Installing BERT-As-Service

Darshika Srivastava的更多文章

社区洞察

其他会员也浏览了

How to Clean and Preprocess Text Data Using Pandas for NLP Tasks

?? ALBERT: Transforming NLP with Lightweight Innovation ??

Word Embedding: Unveiling the Hidden Semantics of Words

Discover Your Westeros Legacy: Use NLP to Find Your Affiliation with the 7 Great Houses

Power of Data with Semantics: How Semantic Analysis is Revolutionizing Data Science

NLP Text Preprocessing Techniques in Python for Sentiment Analysis

Unraveling the Power of Machine Learning in NLP: A Beginner's Guide to MLP

Data Science Case Study 2: NLP Complaint Classification

Fine Tuning OPEN AI GPT 3 Transformer Model for Custom Dataset

Foundational Papers in NLP: Bi-Directional Attention Flow (BIDAF) network - Seo et al 2016.

Introduction

Ready to start your data science journey?

Table of contents

What is BERT?

BERT Example

From Word2Vec to BERT: NLP’s Quest for Learning Language Representations

Word2Vec and GloVe

Enter ELMO and ULMFiT

OpenAI’s GPT

Moving onto BERT

How Does BERT Work?

1. BERT’s Architecture

领英推荐

2. Text Preprocessing

3. Pre-training Tasks

Masked Language Modeling (Bi-directionality)

About Masked Language Models

Next Sentence Prediction

Implementing BERT for Text Classification in Python

Installing BERT-As-Service

Darshika Srivastava的更多文章

METADATA

SSL

BLOATWARE

Data Democratization

Rooting

Data Strategy

Product

API

Apple Intelligence

Snowflake

社区洞察

其他会员也浏览了

How to Clean and Preprocess Text Data Using Pandas for NLP Tasks

?? ALBERT: Transforming NLP with Lightweight Innovation ??

Word Embedding: Unveiling the Hidden Semantics of Words

Discover Your Westeros Legacy: Use NLP to Find Your Affiliation with the 7 Great Houses

Power of Data with Semantics: How Semantic Analysis is Revolutionizing Data Science

NLP Text Preprocessing Techniques in Python for Sentiment Analysis

Unraveling the Power of Machine Learning in NLP: A Beginner's Guide to MLP

Data Science Case Study 2: NLP Complaint Classification

Fine Tuning OPEN AI GPT 3 Transformer Model for Custom Dataset

Foundational Papers in NLP: Bi-Directional Attention Flow (BIDAF) network - Seo et al 2016.