BERT for easier NLP/NLU [code included] ??

BERT for easier NLP/NLU [code included] ??

In this article, we are going to introduce BERT and see how to use it for much better NLP / NLU tasks, sentiment classification is also presented as a case study with code.

Content:

  1. What is the problem?
  2. Pre-training is the solution
  3. What is BERT?
  4. Why BERT is different?
  5. How BERT is trained?
  6. BERT Architecture
  7. Pre-training and fine-tuning procedures
  8. Code: Let's BERT
No alt text provided for this image


1) What is the problem?

One of the biggest challenges in natural language processing (NLP) is the shortage of training data for many distinct tasks. However, modern deep learning-based NLP models improve when trained on millions, or billions, of annotated training examples.

2) Pre-training is the solution

To help close this gap, a variety of techniques have been developed for training general-purpose language representation models using the enormous amount of unannotated text (known as pre-training). The pre-trained model can then be fine-tuned on small data for different tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.

3) What is BERT?

As mentioned in the original paper, BERT stands for Bidirectional Encoder Representations from Transformers.

BERT learns a representation of text by pre-training on massive unlabeled texts in a bidirectional fashion.

"Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers"


Accordingly, the pre-trained BERT model can be fine-tuned by adding an additional output layer, leading to having state-of-the-art models for a wide range of NLP tasks.

"BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks"

BERT is open sourced! Anyone in the world can train their own state-of-the-art results in a few hours of training.

BERT and other models are considered as the NLP's VGGNET


What is Word embedding?

  • Word embedding is converting words into vectors (embedding, list of numbers) where similar words are near each other in this vector space. Word embedding can be conducted by pre-training models based on unlabeled text data. Generally speaking, these learned vector representations of words are crucial for robust and accurate NLP models.

Pre-trained representations can be:

  • Context-free: such as word2vec or GloVe that generates a single/fixed word embedding (vector) representation for each word in the vocabulary (independent of the context of that word at test time)
  • Contextual: generates a representation of each word based on the other words in the sentence.

4) Why BERT is different?

BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.
No alt text provided for this image

BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional.

Contextual Language models can be:

  • Causal language model (CML): Predict the next token passed on previous ones. (GPT)
  • Masked language model (MLM): Predict the masked token based on the surrounding contextual tokens (BERT)


5) How BERT is trained?

BERT is pre-train BERT using two unsupervised tasks:

1) Masked Language Model (MLM)

Unidirectional models are trained by predicting each word conditioned on the previous words in the sentence.

However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.

To solve this problem, BERT uses a straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words.

No alt text provided for this image

2) Next Sentence Prediction (NSP)

BERT learns to model relationships between sentences by pre-training. Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence?

No alt text provided for this image

6) BERT Architecture

BERT is basically a trained Transformer Encoder stack. For more information about transformers, read this introductory article Transformers without pain.

Like the vanilla encoder of the transformer, BERT takes a sequence of words as input, then each encoder layer applies self-attention, and passes the outcome to a feed-forward network, and then hands it to the next encoder and so on.

Models:

  1. BERTBASE (L=12, H=768, A=12, Total Parameters=110M)
  2. BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).

where: L: Transformer blocks; H: hidden size; A: Number of self-attention heads.

Input/Output Representations

To make BERT handle a variety of down-stream tasks, the input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., Question and Answer) in one token sequence.

A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together. WordPiece embeddings are used with a 30,000 token vocabulary.

The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.

Sentence pairs are packed together into a single sequence. The sentences are differentiated in two ways.

  • First, we separate them with a special token ([SEP]).
  • Second, a learned embedding is added to every token indicating whether it belongs to sentence A or sentence B.
No alt text provided for this image

The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings

[CLS] is a special symbol added in front of every input example (used for classification), and [SEP] is a special separator token (e.g. separating questions/answers).

7) Pre-training and fine-tuning procedures

No alt text provided for this image
  • Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks.
  • During fine-tuning, all parameters are fine-tuned.

Illustrations of Fine-tuning BERT on Different Tasks:

No alt text provided for this image


8) Code: Let's BERT

In this example, we are going to use BERT for the sentiment analysis task, in different settings:

  1. Baseline bidirectional LSTM (70%)
  2. BERT as a feature extractor for [CLS] (82%)
  3. BERT as a feature extractor for the full sequence representation (85%)

Dataset: download SST2 dataset for Sentiment Classification

df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
df = df[:2000]
print(df.shape)

df.head(10)



0	a stirring , funny and finally transporting re...	1
1	apparently reassembled from the cutting room f...	0
2	they presume their audience wo n't sit still f...	0
3	this is a visually stunning rumination on love...	1
4	jonathan parker 's bartleby should have been t...	1
5	campanella gets the tone just right funny in t...	1
6	a fan film that for the uninitiated plays bett...	0
7	b art and berling are both superb , while hupp...	1
8	a little less extreme than in the past , with ...	0
9	the film is strictly routine	0

Classes are balanced

No alt text provided for this image

The data split: 80% for training and 20% for testing

1- Baseline (70%): Here we use bidirectional LSTM on the data to classify the text.

inputs = keras.Input(shape=(None,), dtype="int32")
x = layers.Embedding(voc_size, 8)(inputs)
encoded = layers.Bidirectional(layers.LSTM(16))(x)
x = layers.Dropout(0.5)(encoded)
outputs = layers.Dense(2, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.summary()



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 8)           8000      
_________________________________________________________________
bidirectional (Bidirectional (None, 32)                3200      
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 2)                 66        
=================================================================
Total params: 11,266
Trainable params: 11,266
Non-trainable params: 0


No alt text provided for this image
No alt text provided for this image

TSN-E of the testing data two classes

2- BERT as a feature extractor for [CLS] (82%)

Data preparation for BERT

# Data preparation: 
# 1) split words into tokens
# 2) pad sequences 
# 3) add [CLS] at start and [SEP] at end
# 4) convert tokens to ids
# now the data is ready for the BERT model 


tokenized = df[0].apply((lambda x: bert_tokenizer.encode(x, add_special_tokens=True)))
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
print(np.array(padded).shape)
attention_mask = np.where(padded != 0, 1, 0)
print(attention_mask.shape)

Extract features [CLS] for each input sample

input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
    last_hidden_states = bert_model(input_ids, attention_mask=attention_mask)

X_bert = last_hidden_states[0][:,0,:].numpy() # the CLS output
X_train, X_test, y_train, y_test = train_test_split(X_bert, y, test_size=0.2, random_state=42)

Train a very simple model on the extracted [CLS]

s_input = keras.Input(shape=(X_bert.shape[1],), dtype='float32')
s_encoded = layers.Dense(32)(s_input)
s_ouput = layers.Dense(2, activation='softmax')(s_encoded)
s_model = keras.Model(s_input, s_ouput)
s_model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(None, 768)]             0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                24608     
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 66        
=================================================================
Total params: 24,674
Trainable params: 24,674

Non-trainable params: 0


No alt text provided for this image
No alt text provided for this image

TSN-E of the testing data two classes

3- BERT as a feature extractor for the full sequence representation (85%)

Here, instead of using only the CLS output, we are using the whole output sequence and feed it to a bidirectional network.

X_seq_bert = last_hidden_states[0][:,1:,:].numpy() # (samples, seq, feat)
X_train, X_test, y_train, y_test = train_test_split(X_seq_bert, y, test_size=0.2, random_state=42)

inputs = keras.Input(shape=(None,X_bert.shape[1]), dtype="float32")

encoded = layers.Bidirectional(layers.LSTM(16))(inputs)
x = layers.Dropout(0.1)(encoded)
outputs = layers.Dense(2, activation="softmax")(x)
bs_model = keras.Model(inputs, outputs)
bs_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         [(None, None, 768)]       0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 32)                100480    
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 66        
=================================================================
Total params: 100,546
Trainable params: 100,546

Non-trainable params: 0
No alt text provided for this image
No alt text provided for this image

TSN-E of the testing data two classes

The source code is available as colab


Bonus ?? huggingface

?? Transformers provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

?? Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with seamless integration between them, allowing you to train your models with one then load it for inference with the other.

No alt text provided for this image


References:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Code BERT End to End (Fine-tuning + Predicting) with Cloud TPU

Code A Visual Notebook to Using BERT for the First Time


Regards

要查看或添加评论,请登录

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了