BERT for easier NLP/NLU [code included] ??
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
In this article, we are going to introduce BERT and see how to use it for much better NLP / NLU tasks, sentiment classification is also presented as a case study with code.
Content:
- What is the problem?
- Pre-training is the solution
- What is BERT?
- Why BERT is different?
- How BERT is trained?
- BERT Architecture
- Pre-training and fine-tuning procedures
- Code: Let's BERT
1) What is the problem?
One of the biggest challenges in natural language processing (NLP) is the shortage of training data for many distinct tasks. However, modern deep learning-based NLP models improve when trained on millions, or billions, of annotated training examples.
2) Pre-training is the solution
To help close this gap, a variety of techniques have been developed for training general-purpose language representation models using the enormous amount of unannotated text (known as pre-training). The pre-trained model can then be fine-tuned on small data for different tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.
3) What is BERT?
As mentioned in the original paper, BERT stands for Bidirectional Encoder Representations from Transformers.
BERT learns a representation of text by pre-training on massive unlabeled texts in a bidirectional fashion.
"Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers"
Accordingly, the pre-trained BERT model can be fine-tuned by adding an additional output layer, leading to having state-of-the-art models for a wide range of NLP tasks.
"BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks"
BERT is open sourced! Anyone in the world can train their own state-of-the-art results in a few hours of training.
BERT and other models are considered as the NLP's VGGNET
What is Word embedding?
- Word embedding is converting words into vectors (embedding, list of numbers) where similar words are near each other in this vector space. Word embedding can be conducted by pre-training models based on unlabeled text data. Generally speaking, these learned vector representations of words are crucial for robust and accurate NLP models.
Pre-trained representations can be:
- Context-free: such as word2vec or GloVe that generates a single/fixed word embedding (vector) representation for each word in the vocabulary (independent of the context of that word at test time)
- Contextual: generates a representation of each word based on the other words in the sentence.
4) Why BERT is different?
BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.
BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional.
Contextual Language models can be:
- Causal language model (CML): Predict the next token passed on previous ones. (GPT)
- Masked language model (MLM): Predict the masked token based on the surrounding contextual tokens (BERT)
5) How BERT is trained?
BERT is pre-train BERT using two unsupervised tasks:
1) Masked Language Model (MLM)
Unidirectional models are trained by predicting each word conditioned on the previous words in the sentence.
However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.
To solve this problem, BERT uses a straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words.
2) Next Sentence Prediction (NSP)
BERT learns to model relationships between sentences by pre-training. Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence?
6) BERT Architecture
BERT is basically a trained Transformer Encoder stack. For more information about transformers, read this introductory article Transformers without pain.
Like the vanilla encoder of the transformer, BERT takes a sequence of words as input, then each encoder layer applies self-attention, and passes the outcome to a feed-forward network, and then hands it to the next encoder and so on.
Models:
- BERTBASE (L=12, H=768, A=12, Total Parameters=110M)
- BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).
where: L: Transformer blocks; H: hidden size; A: Number of self-attention heads.
Input/Output Representations
To make BERT handle a variety of down-stream tasks, the input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., Question and Answer) in one token sequence.
A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together. WordPiece embeddings are used with a 30,000 token vocabulary.
The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
Sentence pairs are packed together into a single sequence. The sentences are differentiated in two ways.
- First, we separate them with a special token ([SEP]).
- Second, a learned embedding is added to every token indicating whether it belongs to sentence A or sentence B.
The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings
[CLS] is a special symbol added in front of every input example (used for classification), and [SEP] is a special separator token (e.g. separating questions/answers).
7) Pre-training and fine-tuning procedures
- Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks.
- During fine-tuning, all parameters are fine-tuned.
Illustrations of Fine-tuning BERT on Different Tasks:
8) Code: Let's BERT
In this example, we are going to use BERT for the sentiment analysis task, in different settings:
- Baseline bidirectional LSTM (70%)
- BERT as a feature extractor for [CLS] (82%)
- BERT as a feature extractor for the full sequence representation (85%)
Dataset: download SST2 dataset for Sentiment Classification
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None) df = df[:2000] print(df.shape) df.head(10) 0 a stirring , funny and finally transporting re... 1 1 apparently reassembled from the cutting room f... 0 2 they presume their audience wo n't sit still f... 0 3 this is a visually stunning rumination on love... 1 4 jonathan parker 's bartleby should have been t... 1 5 campanella gets the tone just right funny in t... 1 6 a fan film that for the uninitiated plays bett... 0 7 b art and berling are both superb , while hupp... 1 8 a little less extreme than in the past , with ... 0 9 the film is strictly routine 0
Classes are balanced
The data split: 80% for training and 20% for testing
1- Baseline (70%): Here we use bidirectional LSTM on the data to classify the text.
inputs = keras.Input(shape=(None,), dtype="int32") x = layers.Embedding(voc_size, 8)(inputs) encoded = layers.Bidirectional(layers.LSTM(16))(x) x = layers.Dropout(0.5)(encoded) outputs = layers.Dense(2, activation="softmax")(x) model = keras.Model(inputs, outputs) model.summary() _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, None)] 0 _________________________________________________________________ embedding (Embedding) (None, None, 8) 8000 _________________________________________________________________ bidirectional (Bidirectional (None, 32) 3200 _________________________________________________________________ dropout (Dropout) (None, 32) 0 _________________________________________________________________ dense (Dense) (None, 2) 66 ================================================================= Total params: 11,266 Trainable params: 11,266 Non-trainable params: 0
TSN-E of the testing data two classes
2- BERT as a feature extractor for [CLS] (82%)
Data preparation for BERT
# Data preparation: # 1) split words into tokens # 2) pad sequences # 3) add [CLS] at start and [SEP] at end # 4) convert tokens to ids # now the data is ready for the BERT model tokenized = df[0].apply((lambda x: bert_tokenizer.encode(x, add_special_tokens=True))) max_len = 0 for i in tokenized.values: if len(i) > max_len: max_len = len(i) padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values]) print(np.array(padded).shape) attention_mask = np.where(padded != 0, 1, 0) print(attention_mask.shape)
Extract features [CLS] for each input sample
input_ids = torch.tensor(padded) attention_mask = torch.tensor(attention_mask) with torch.no_grad(): last_hidden_states = bert_model(input_ids, attention_mask=attention_mask) X_bert = last_hidden_states[0][:,0,:].numpy() # the CLS output X_train, X_test, y_train, y_test = train_test_split(X_bert, y, test_size=0.2, random_state=42)
Train a very simple model on the extracted [CLS]
s_input = keras.Input(shape=(X_bert.shape[1],), dtype='float32') s_encoded = layers.Dense(32)(s_input) s_ouput = layers.Dense(2, activation='softmax')(s_encoded) s_model = keras.Model(s_input, s_ouput) s_model.summary() _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) [(None, 768)] 0 _________________________________________________________________ dense_1 (Dense) (None, 32) 24608 _________________________________________________________________ dense_2 (Dense) (None, 2) 66 ================================================================= Total params: 24,674 Trainable params: 24,674 Non-trainable params: 0
TSN-E of the testing data two classes
3- BERT as a feature extractor for the full sequence representation (85%)
Here, instead of using only the CLS output, we are using the whole output sequence and feed it to a bidirectional network.
X_seq_bert = last_hidden_states[0][:,1:,:].numpy() # (samples, seq, feat) X_train, X_test, y_train, y_test = train_test_split(X_seq_bert, y, test_size=0.2, random_state=42) inputs = keras.Input(shape=(None,X_bert.shape[1]), dtype="float32") encoded = layers.Bidirectional(layers.LSTM(16))(inputs) x = layers.Dropout(0.1)(encoded) outputs = layers.Dense(2, activation="softmax")(x) bs_model = keras.Model(inputs, outputs) bs_model.summary() _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_3 (InputLayer) [(None, None, 768)] 0 _________________________________________________________________ bidirectional_1 (Bidirection (None, 32) 100480 _________________________________________________________________ dropout_1 (Dropout) (None, 32) 0 _________________________________________________________________ dense_3 (Dense) (None, 2) 66 ================================================================= Total params: 100,546 Trainable params: 100,546 Non-trainable params: 0
TSN-E of the testing data two classes
The source code is available as colab
Bonus ?? huggingface
?? Transformers provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
?? Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with seamless integration between them, allowing you to train your models with one then load it for inference with the other.
References:
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
Code BERT End to End (Fine-tuning + Predicting) with Cloud TPU
Code A Visual Notebook to Using BERT for the First Time
Regards