登录查看更多内容

BERT for easier NLP/NLU [code included] ??

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

发布日期: 2021年1月8日

In this article, we are going to introduce BERT and see how to use it for much better NLP / NLU tasks, sentiment classification is also presented as a case study with code.

Content:

What is the problem?
Pre-training is the solution
What is BERT?
Why BERT is different?
How BERT is trained?
BERT Architecture
Pre-training and fine-tuning procedures
Code: Let's BERT

1) What is the problem?

One of the biggest challenges in natural language processing (NLP) is the shortage of training data for many distinct tasks. However, modern deep learning-based NLP models improve when trained on millions, or billions, of annotated training examples.

2) Pre-training is the solution

To help close this gap, a variety of techniques have been developed for training general-purpose language representation models using the enormous amount of unannotated text (known as pre-training). The pre-trained model can then be fine-tuned on small data for different tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.

3) What is BERT?

As mentioned in the original paper, BERT stands for Bidirectional Encoder Representations from Transformers.

BERT learns a representation of text by pre-training on massive unlabeled texts in a bidirectional fashion.

"Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers"

Accordingly, the pre-trained BERT model can be fine-tuned by adding an additional output layer, leading to having state-of-the-art models for a wide range of NLP tasks.

"BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks"

BERT is open sourced! Anyone in the world can train their own state-of-the-art results in a few hours of training.

BERT and other models are considered as the NLP's VGGNET

What is Word embedding?

Word embedding is converting words into vectors (embedding, list of numbers) where similar words are near each other in this vector space. Word embedding can be conducted by pre-training models based on unlabeled text data. Generally speaking, these learned vector representations of words are crucial for robust and accurate NLP models.

Pre-trained representations can be:

Context-free: such as word2vec or GloVe that generates a single/fixed word embedding (vector) representation for each word in the vocabulary (independent of the context of that word at test time)
Contextual: generates a representation of each word based on the other words in the sentence.

4) Why BERT is different?

BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.

BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional.

Contextual Language models can be:

Causal language model (CML): Predict the next token passed on previous ones. (GPT)
Masked language model (MLM): Predict the masked token based on the surrounding contextual tokens (BERT)

5) How BERT is trained?

BERT is pre-train BERT using two unsupervised tasks:

1) Masked Language Model (MLM)

Unidirectional models are trained by predicting each word conditioned on the previous words in the sentence.

However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.

To solve this problem, BERT uses a straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words.

2) Next Sentence Prediction (NSP)

BERT learns to model relationships between sentences by pre-training. Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence?

6) BERT Architecture

BERT is basically a trained Transformer Encoder stack. For more information about transformers, read this introductory article Transformers without pain.

Like the vanilla encoder of the transformer, BERT takes a sequence of words as input, then each encoder layer applies self-attention, and passes the outcome to a feed-forward network, and then hands it to the next encoder and so on.

Models:

BERTBASE (L=12, H=768, A=12, Total Parameters=110M)
BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).

where: L: Transformer blocks; H: hidden size; A: Number of self-attention heads.

Input/Output Representations

To make BERT handle a variety of down-stream tasks, the input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., Question and Answer) in one token sequence.

A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together. WordPiece embeddings are used with a 30,000 token vocabulary.

The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.

Sentence pairs are packed together into a single sequence. The sentences are differentiated in two ways.

First, we separate them with a special token ([SEP]).
Second, a learned embedding is added to every token indicating whether it belongs to sentence A or sentence B.

The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings

[CLS] is a special symbol added in front of every input example (used for classification), and [SEP] is a special separator token (e.g. separating questions/answers).

7) Pre-training and fine-tuning procedures

Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks.
During fine-tuning, all parameters are fine-tuned.

Illustrations of Fine-tuning BERT on Different Tasks:

8) Code: Let's BERT

In this example, we are going to use BERT for the sentiment analysis task, in different settings:

Baseline bidirectional LSTM (70%)
BERT as a feature extractor for [CLS] (82%)
BERT as a feature extractor for the full sequence representation (85%)

Dataset: download SST2 dataset for Sentiment Classification

df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
df = df[:2000]
print(df.shape)

df.head(10)



0	a stirring , funny and finally transporting re...	1
1	apparently reassembled from the cutting room f...	0
2	they presume their audience wo n't sit still f...	0
3	this is a visually stunning rumination on love...	1
4	jonathan parker 's bartleby should have been t...	1
5	campanella gets the tone just right funny in t...	1
6	a fan film that for the uninitiated plays bett...	0
7	b art and berling are both superb , while hupp...	1
8	a little less extreme than in the past , with ...	0
9	the film is strictly routine	0

Classes are balanced

The data split: 80% for training and 20% for testing

1- Baseline (70%): Here we use bidirectional LSTM on the data to classify the text.

inputs = keras.Input(shape=(None,), dtype="int32")
x = layers.Embedding(voc_size, 8)(inputs)
encoded = layers.Bidirectional(layers.LSTM(16))(x)
x = layers.Dropout(0.5)(encoded)
outputs = layers.Dense(2, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.summary()



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 8)           8000      
_________________________________________________________________
bidirectional (Bidirectional (None, 32)                3200      
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 2)                 66        
=================================================================
Total params: 11,266
Trainable params: 11,266
Non-trainable params: 0

TSN-E of the testing data two classes

2- BERT as a feature extractor for [CLS] (82%)

Data preparation for BERT

# Data preparation: 
# 1) split words into tokens
# 2) pad sequences 
# 3) add [CLS] at start and [SEP] at end
# 4) convert tokens to ids
# now the data is ready for the BERT model 


tokenized = df[0].apply((lambda x: bert_tokenizer.encode(x, add_special_tokens=True)))
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
print(np.array(padded).shape)
attention_mask = np.where(padded != 0, 1, 0)
print(attention_mask.shape)

Extract features [CLS] for each input sample

input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
    last_hidden_states = bert_model(input_ids, attention_mask=attention_mask)

X_bert = last_hidden_states[0][:,0,:].numpy() # the CLS output
X_train, X_test, y_train, y_test = train_test_split(X_bert, y, test_size=0.2, random_state=42)

Train a very simple model on the extracted [CLS]

s_input = keras.Input(shape=(X_bert.shape[1],), dtype='float32')
s_encoded = layers.Dense(32)(s_input)
s_ouput = layers.Dense(2, activation='softmax')(s_encoded)
s_model = keras.Model(s_input, s_ouput)
s_model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(None, 768)]             0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                24608     
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 66        
=================================================================
Total params: 24,674
Trainable params: 24,674

Non-trainable params: 0

TSN-E of the testing data two classes

3- BERT as a feature extractor for the full sequence representation (85%)

Here, instead of using only the CLS output, we are using the whole output sequence and feed it to a bidirectional network.

X_seq_bert = last_hidden_states[0][:,1:,:].numpy() # (samples, seq, feat)
X_train, X_test, y_train, y_test = train_test_split(X_seq_bert, y, test_size=0.2, random_state=42)

inputs = keras.Input(shape=(None,X_bert.shape[1]), dtype="float32")

encoded = layers.Bidirectional(layers.LSTM(16))(inputs)
x = layers.Dropout(0.1)(encoded)
outputs = layers.Dense(2, activation="softmax")(x)
bs_model = keras.Model(inputs, outputs)
bs_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         [(None, None, 768)]       0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 32)                100480    
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 66        
=================================================================
Total params: 100,546
Trainable params: 100,546

Non-trainable params: 0

TSN-E of the testing data two classes

The source code is available as colab

Bonus ?? huggingface

?? Transformers provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

?? Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with seamless integration between them, allowing you to train your models with one then load it for inference with the other.

References:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Code BERT End to End (Fine-tuning + Predicting) with Cloud TPU

Code A Visual Notebook to Using BERT for the First Time

Regards

要查看或添加评论，请登录

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

2025年3月1日

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

Article created by Perplexity Deep Research. Prompt: "You are a deep-learning experienced researcher.

1 条评论
The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

2025年3月1日

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

Research Report Created by Perplexity Deep Research My Research Question : "Now I want to dig deeper in the human judge…

3 条评论
How to Learn Artificial Intelligence: A Beginner’s Guide

2024年5月31日

How to Learn Artificial Intelligence: A Beginner’s Guide

Artificial Intelligence (AI) is a fascinating field that simulates human intelligence and task performance using…
[????????????] ?????????????????? ???????????? explained with code ??

2023年1月28日

[????????????] ?????????????????? ???????????? explained with code ??

"During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion…

2 条评论
A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

2023年1月21日

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

Hello everyone, and thank you all for being here today! Let me introduce our new star, the ChatGPT, who will discuss…
10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

2022年2月17日

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

In this article, 10 well-known pre-trained object detectors are loaded and used in a standard and easy way. YOLOF: You…

6 条评论
FNet: Do we need the attention layer at all? [Explained with code]

2021年10月30日

FNet: Do we need the attention layer at all? [Explained with code]

FNet: Mixing Tokens with Fourier Transforms "In this work, we investigate whether simpler token mixing mechanisms can…
Patches Are All You Need! [with code]

2021年10月28日

Patches Are All You Need! [with code]

"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have…
MLP is all you need! [with code]

2021年10月23日

MLP is all you need! [with code]

From Google: MLP-Mixer: An all-MLP Architecture for Vision Main idea: "While convolutions and attention are both…

2 条评论
9 Steps for solving any machine learning problem

2021年8月28日

9 Steps for solving any machine learning problem

In this article, we will present a universal blueprint that we can use to attack and solve any machine-learning…

2 条评论

See all articles

BERT for easier NLP/NLU [code included] ??

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

1) What is the problem?

2) Pre-training is the solution

3) What is BERT?

4) Why BERT is different?

5) How BERT is trained?

6) BERT Architecture

7) Pre-training and fine-tuning procedures

8) Code: Let's BERT

Bonus ?? huggingface

References:

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了

NLP, Conversational AI, and a whole lot of time....

NLP vs. LLMs: A Practical Guide for Engineering Teams

8 Of The Leading Language Models for NLP

What I Wish I Knew About NLP When I Started

AI Has Boosted Voice NLP, Allowing it to Better Assign Meaning

8 Of The Leading Language Models for NLP

ModernBERT vs BERT: Key Differences and Advantages

NLP: Current Trends and Future Directions

Understanding NLP, NLG and NLU

A brief Introduction to Hugging Face Transformers

1) What is the problem?

2) Pre-training is the solution

3) What is BERT?

4) Why BERT is different?

5) How BERT is trained?

6) BERT Architecture

7) Pre-training and fine-tuning procedures

8) Code: Let's BERT

Bonus ?? huggingface

References:

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

How to Learn Artificial Intelligence: A Beginner’s Guide

[????????????] ?????????????????? ???????????? explained with code ??

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

FNet: Do we need the attention layer at all? [Explained with code]

Patches Are All You Need! [with code]

MLP is all you need! [with code]

9 Steps for solving any machine learning problem

社区洞察

其他会员也浏览了

NLP, Conversational AI, and a whole lot of time....

NLP vs. LLMs: A Practical Guide for Engineering Teams

8 Of The Leading Language Models for NLP

What I Wish I Knew About NLP When I Started

AI Has Boosted Voice NLP, Allowing it to Better Assign Meaning

8 Of The Leading Language Models for NLP

ModernBERT vs BERT: Key Differences and Advantages

NLP: Current Trends and Future Directions

Understanding NLP, NLG and NLU

A brief Introduction to Hugging Face Transformers