HOW TO USE TRANSFORMER FOR REAL LIFE PROBLEMS USING DIFFERENT MODELS OF TRANSFORMER
Shivam Jha
AI Team Lead | NLP | Computer Vision | Machine Learning | GCP | DevOps | Full Stack Web Development | Full Stack Mobile Application Development | Prompt Engineer | LLM | Langchain
In this article, I have discussed some use cases of transformer, different types of models, a few terms related to transformer, data preprocessing, fine-tuning models, and how to use models for a custom dataset. I have collected some information from the internet so that I can explain it in simple language that you don't have to look anywhere else to learn this
One of the Python packages I used here is called a transformer, which you can use to solve real-world problems
I have written about transformer in my previous article so that you have to learn about transformers before progressing to this article so that you can understand them better in the article that follows
https://www.dhirubhai.net/pulse/long-live-transformer-shivam-jha
Summary Of Some Use Case
Sequence Classification
In sequence classification, a sequence is categorized according to a specific number of classes
In this case, we are determining if a sequence is positive or negative
from transformers import pipeline nlp = pipeline("sentiment-analysis") result = nlp("I hate you")[0] print(f"label: {result['label']}, with score: {round(result['score'], 4)}") label: NEGATIVE, with score: 0.9991 result = nlp("I love you")[0] print(f"label: {result['label']}, with score: {round(result['score'], 4)}") label: POSITIVE, with score: 0.9999
Extractive Question Answering
Extractive Question Answering is the task of extracting an answer from a text called context given a question
from transformers import pipeline nlp = pipeline("question-answering") context = r""" ... Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a ... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune ... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script. ... """ result = nlp(question="What is extractive question answering?", context=context) print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}") Answer: 'the task of extracting an answer from a text given a question.', score: 0.6226, start: 34, end: 96 result = nlp(question="What is a good example of a question answering dataset?", context=context) print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}") Answer: 'SQuAD dataset,', score: 0.5053, start: 147, end: 161
Language Modeling
The task of language modeling involves fitting a model to a corpus, which can be domain-specific. Transformers are trained using a variant of language modeling, e.g. A masked language model for BERT, a causal language model for GPT-2
Masked Language Modeling
The purpose of masked language modeling is to mask tokens in a sequence with a masking token and then prompt the model to fill that mask with an appropriate token. The model can take into account both the right context (tokens on the right of the mask) and the left context (tokens on the left of the mask)
from transformers import pipeline from pprint import pprint nlp = pipeline("fill-mask") pprint(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks.")) [{'score': 0.1792745739221573, 'sequence': '<s>HuggingFace is creating a tool that the community uses to ' 'solve NLP tasks.</s>', 'token': 3944, 'token_str': '?tool'}, {'score': 0.11349421739578247, 'sequence': '<s>HuggingFace is creating a framework that the community uses ' 'to solve NLP tasks.</s>', 'token': 7208, 'token_str': '?framework'}, {'score': 0.05243554711341858, 'sequence': '<s>HuggingFace is creating a library that the community uses to ' 'solve NLP tasks.</s>', 'token': 5560, 'token_str': '?library'}, {'score': 0.03493533283472061, 'sequence': '<s>HuggingFace is creating a database that the community uses ' 'to solve NLP tasks.</s>', 'token': 8503, 'token_str': '?database'}, {'score': 0.02860250137746334, 'sequence': '<s>HuggingFace is creating a prototype that the community uses ' 'to solve NLP tasks.</s>', 'token': 17715, 'token_str': '?prototype'}]
Text Generation
Text generation (also called open-ended text generation) aims to create coherent portions of text that follow from a given context
from transformers import pipeline text_generator = pipeline("text-generation") print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False)) [{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
In this case, the model generates a random text from the context with a maximum length of 50 tokens
Named Entity Recognition
The Named Entity Recognition (NER) task involves classifying tokens according to a class, for example, identifying a token as a person, an organization, or a location
from transformers import pipeline nlp = pipeline("ner") sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, ... therefore very close to the Manhattan Bridge which is visible from the window.""" print(nlp(sequence)) [ {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'}, {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'}, {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'}, {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'}, {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'}, {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'}, {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'}, {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'}, {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'}, {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'}, {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'}, {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'} ]
Here is another example of using pipelines to do named entity recognition, specifically, identifying tokens as belonging to one of nine classes
Summarization
An article or document is summarized by putting the most important points into a shorter table of contents
from transformers import pipeline summarizer = pipeline("summarization") ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. ... A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. ... Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other. ... In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage. ... Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the ... 2010 marriage license application, according to court documents. ... Prosecutors said the marriages were part of an immigration scam. ... On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further. ... After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective ... Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. ... All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say. ... Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages. ... Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted. ... The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s ... Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali. ... Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force. ... If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18. print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)) [{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]
Translation
I love this use case
The translation is the task of translating a text from one language to another
from transformers import pipeline translator = pipeline("translation_en_to_de") print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40)) [{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
Summary Of The Different Types Of Models
The following is a summary of the models available in Transformers
Autoregressive models
A classic language modeling task is used to pretrain autoregressive models
Guess the next token after reading all the previous ones. The decoders correspond to the original transformer model, and on top of the full sentence, a mask is placed so that attention heads can only see what came before, and not what came after. These models can be tuned and used to generate text for a variety of tasks, but generating text naturally comes out as the best application. GPT is a typical example of such a model
For Example
Original GPT
This is the first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.
The library provides versions of the model for language modeling and multitasks language modeling/multiple-choice classification
GPT-2
This is a new and improved GPT, pretrained on WebText (web pages from outgoing links on Reddit with a minimum of 3 karmas).
Models for language modeling and multitask language modeling/multiple-choice classification are available in the library.
Autoencoding models
Autoencoding models are pretrained by corrupting the input tokens in some way and then trying to reconstruct the original sentence. Unlike the encoder of the original transformer model, they can access all inputs without a mask. They usually build a bidirectional representation of the whole sentence. While they can be fine-tuned and perform well on many tasks such as text generation, their most natural application is sentence classification or token classification. BERT is a typical example of such a model
For Example
BERT
It uses random masking to corrupt inputs. More precisely, during pretraining, a percentage of tokens (usually 15%) is masked by:
- A special mask token with a probability of 0.8
- A random token different from the one masked with a probability of 0.1
- The same token with a probability of 0.1
ALBERT
The same as BERT, but with a few tweaks:
- Embedding sizes E and H have different meanings because embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens)
- Layers are split into groups that share parameters (to save memory).
- As an alternative, we predict the sentence order ahead of time after two consecutive sentences (A and B) are fed in: either A is fed first and then B or B is fed first. The model must predict whether they have been swapped or not
A version of the model is available for masked language modeling, token classification, sentence classification, multiple-choice classification, and question answering.
RoBERTa
Same as BERT with better pretraining tricks:
- Masking is dynamic: tokens are changed at each epoch, while BERT does it once and for all
- No NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of contiguous texts together to reach 512 tokens (so the sentences are arranged in an order that may span several documents)
- Train with larger batches
- (because of Unicode characters, use bytes, not characters)
A version of the model is available for masked language modeling, token classification, sentence classification, multiple-choice classification, and question answering
NOTE:
The only difference between autoregressive and autoencoding models lies in how they are pretrained. Therefore, autoencoding and autoregressive models can be built using the same architecture
Sequence-to-sequence models
Models for sequence-to-sequence translation or transformation use both the encoder and the decoder of the original transformer. They can be fine-tuned to perform various tasks, but are best suited to translation, summarization, and answering questions. Using an original transformer is an example of this type of model (for translation only)
Both encoders and decoders of the original transformer are included in these models
For Example
BART
An encoder and a decoder in a sequence-to-sequence model. The encoder is fed a corrupted version of the tokens, while the decoder is fed the original tokens (although it has a mask that hides future words, like a regular transformers decoder) which I discussed in my transformers article. On the pretraining tasks for the encoder, the following transformations are applied:
- Mask random tokens (like in BERT)
- Delete random tokens
- Mask a span of k tokens with a single mask token (an insertion of a mask token is a span of 0 tokens)
- Permute sentences
- Rotate the document to make it start at a specific token
A version of this model is provided by the library for conditional generation and sequence classification
Multimodal models
One of the multimodal models has not been pretrained in the self-supervised manner as the others have
For Example
MMBT
In multimodal settings, this model combines a text and an image to make predictions. This model uses the embeddings of tokenized text and the final activations of a pretrained on images resnet (after the pooling layer), which goes through a linear layer (to get the hidden state dimension from the end of the resnet)
The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the model know which part of the input vector corresponds to the text and which to the image
Classification is the only use of the pretrained model
Retrieval-based models
To answer open-domain questions, for example, some models use document retrieval during (pre)training and inference
For Example
DPR
The Dense Passage Retrieval (DPR) system is a set of tools and models for state-of-the-art open-domain question answering.
DPR consists of three models:
- Question encoder: encode questions as vectors
- Context encoder: encode contexts as vectors
- Reader: extract the answer to the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
In DPR's pipeline (not yet implemented), a retrieval step finds the top k contexts for a certain question, and then the reader is called with the question and the retrieved documents to get the answer
I discussed only some examples of different types of models here you will find more
Preprocessing Data For Different Types Of Models
Transformers can be used to preprocess data. Tokenizers are the main tool we use for this. One can be created using the tokenizer class associated with the model you want to use, or directly with the AutoTokenizer class
Tokenizers begin by splitting a given text into words (or parts of words, punctuation symbols, etc.) called tokens. It will then convert the tokens into numbers so that a tensor can be built from them and fed to the model. The model will also be able to add additional inputs if required
The from_pretrained() method allows you to download the vocabulary used during pretraining or fine-tuning a given model automatically
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
Base use
PreTrainedTokenizer has many methods, but the only one you need to remember for preprocessing is its __call__: you simply feed your sentence to your tokenizer object
encoded_input = tokenizer("Hello, I'm a single sentence!") print(encoded_input) {'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Returns a dictionary string to a list of integers. The input_ids correspond to each token in our sentence. We will see below what the attention_mask is used for, and in the next section, we will see what the token_type_id is used for
The tokenizer can decode a list of token ids in a proper sentence
tokenizer.decode(encoded_input["input_ids"]) "[CLS] Hello, I'm a single sentence! [SEP]"
The tokenizer automatically added a few special tokens that the model expects. For example, if we had created our tokenizer using gpt2-medium instead of bert-base-cased, we would have obtained the same result as the original. Bypassing add_special_tokens=False, you can disable this behavior (which is only recommended if you added those special tokens yourself)
If you have several sentences you want to process, you can do this efficiently by sending them as a list to the tokenizer:
batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"] encoded_inputs = tokenizer(batch_sentences) print(encoded_inputs) {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102], [101, 1262, 1330, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}
A dictionary is returned once again, this time with lists of lists of integers as values.
To build a batch of sentences to feed the model, you may want to send several sentences at a time to the tokenizer:
- To pad each sentence to the maximum length there is in your batch.
- To truncate each sentence to the maximum length the model can accept (if applicable).
- To return tensors.
All of this can be done using the following options when feeding your list of sentences to the tokenizer:
batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt") print(batch) {'input_ids': tensor([[ 101, 8667, 146, 112, 182, 170, 1423, 5650, 102], [ 101, 1262, 1330, 5650, 102, 0, 0, 0, 0], [ 101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
The function returns a dictionary with string keys and tensor values. Now we can see what the attention_mask is all about: it shows which tokens the model should pay attention to and which ones it should ignore (because they represent padding here)
The command above will throw a warning if your model does not have a maximum length. It can be safely ignored. Alternatively, you can pass verbose=False to stop the tokenizer from throwing such warnings
Preprocessing Pairs Of Sentences
You may need to feed your model a pair of sentences. For instance, if you want to classify if two sentences in a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is then represented like this: [CLS] Sequence A [SEP] Sequence B [SEP]
Two sentences can be encoded in the format expected by your model by providing the two sentences as two arguments (not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw earlier). Again, this will return a dict string to a list of integers
encoded_input = tokenizer("How old are you?", "I'm 6 years old") print(encoded_input) {'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
The token_type_ids have one purpose: they tell the model which part of the input corresponds to the first sentence and which part corresponds to the second sentence. Token_type_ids are not required by all models. A tokenizer will default to returning inputs that are expected by its associated model. By using return_input_ids or return_token_type_ids, you can force the return (or non-return) of any of those special arguments
By decoding the token ids, we can see that the special tokens have been correctly added
tokenizer.decode(encoded_input["input_ids"]) "[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
When you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the list of first sentences, and the list of second sentences
batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"] batch_of_second_sentences = ["I'm a sentence that goes with the first sentence", ... "And I should be encoded with the second sentence", ... "And I go with the very last one"] encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences) print(encoded_inputs) {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
We can see that it returns a dictionary with each value being a list of lists of integers.
We can double-check what is fed to the model by decoding each list in input_ids one by one
for ids in encoded_inputs["input_ids"]: print(tokenizer.decode(ids)) [CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP] [CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP] [CLS] And the very very last one [SEP] And I go with the very last one [SEP]
Similarly, you can automatically pad your inputs to the maximum sentence length in your batch, truncate them to the maximum length the model can accept, and return tensors directly as follows
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
What you always wanted to know about padding and truncation
- The padding is controlled by padding. It can be a boolean or a string that should be
- True or 'longest' to pad to the longest sequence in the batch (doing no padding if you only provide a single sequence)
- 'max_length' to pad to a length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). If you only provide a single sequence, padding will still be applied to it.
- False or 'do_not_pad' to not pad the sequences. As we have seen before, this is the default behavior.
- Truncation controls the truncation. It can be a boolean or a string which should be:
- True or 'only_first' truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will only truncate the first sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided.
- 'only_second' truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will only truncate the second sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided
- 'longest_first' truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached
- False or 'do_not_truncate' to not truncate the sequences. As we have seen before, this is the default behavior
- max_length to control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated
You can add padding, truncation, and directly return tensors as before:
batch = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True, padding=True, truncation=True, return_tensors="pt")
Fine-tuning a pretrained model
Pretrained models from the Transformers library can be fine-tuned. TensorFlow allows models to be directly trained using Keras and the fit method. Since PyTorch does not have a generic training loop, the Transformers library provides an API with the class Trainer that lets you train a model from scratch. In the next section, we will show you how the entire training loop can be written in PyTorch
Preparing the datasets
The Datasets library will be used to download and preprocess the IMDB datasets. This part will be covered pretty quickly. The focus of this tutorial is on training, so you should refer to the Datasets documentation or the Preprocessing data tutorial for more information.
First, we can use the load_dataset function to download and cache the dataset:
from datasets import load_dataset raw_datasets = load_dataset("imdb")
From_pretrained works similarly to what we saw for models and tokenizers
Raw_datasets is a dictionary with three keys: "train", "test" and "unsupervised" (which correspond to the three splits of that dataset). We will use the "train" split for training and the "test" split for validation.
We will need a tokenizer to preprocess our data:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
We saw in Preprocessing data that we could prepare text inputs for the model with the following command (this is an example, not a command you can execute)
inputs = tokenizer(sentences, padding="max_length", truncation=True)
All samples will have the maximum length that the model can accept (here 512), either by padding or truncating them.
Instead, we can use the map method to apply these preprocessing steps to all the splits of our dataset at once
def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True) tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
To facilitate faster training, we will generate a small subset of the training and validation set:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) full_train_dataset = tokenized_datasets["train"] full_eval_dataset = tokenized_datasets["test"]
We will always use small_train_dataset and small_eval_dataset in all the examples below. To train or evaluate on the full dataset, just replace them with their full equivalents.
PyTorch fine-tuning with the Trainer API
As PyTorch does not provide a training loop, the Transformers library provides a Trainer API that is optimized for Transformers models. The system offers a wide range of training options as well as features such as logging, gradient accumulation, and mixed-precision
Let's define our model first
from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
There will be a warning about some of the pretrained weights not being used and some weights being randomly initialized. That's because we're replacing the BERT model's pretraining head with one with a classification that's randomly initialized. This model will be fine-tuned for our task, transferring the knowledge of the pretrained model to it (this is known as transfer learning)
As a result, we need to create a TrainingArguments object to define our Trainer. The Trainer class contains all the hyperparameters we can tune for it or the flags to activate the different training options it supports. Let us begin by using the default settings, and then all we need to do is provide the directory where the checkpoints should be saved
from transformers import TrainingArguments training_args = TrainingArguments("test_trainer")
The Trainer can then be instantiated as follows:
from transformers import Trainer trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset )
We can fine-tune our model by calling
trainer.train()
There will be a progress bar showing how long it will take to complete the training (if you have access to a GPU). It won't actually tell you anything about how well (or badly) your model is performing as by default, there is no evaluation during training, and we didn't instruct the trainer to compute any metrics. Let's see how we can accomplish this!
The Trainer needs a compute_metrics function that takes predictions and labels (grouped in a named tuple named EvalPrediction) and returns a dictionary containing string items (the names of the metrics) and float values (the metrics values).
In the Datasets library, the load_metric function provides an easy way to get the common metrics used in NLP. We simply use accuracy here. Lastly, we define the compute_metrics function that just converts logits into predictions (remember that all Transformers models return logits) and feeds them into the compute method of this metric
import numpy as np from datasets import load_metric metric = load_metric("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels)
It must receive a tuple (with logits and labels) and return a dictionary with string keys (the name of the metric) and float values. A call to this function will be made at the end of each evaluation phase on the whole array of predictions/labels
Using our fine-tuned model, let's create a new Trainer to see if it works in practice
trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics, ) trainer.evaluate()
In our case, show the accuracy was 87.5%.
The following is how you should define your training arguments if you want to fine-tune your model and report the evaluation metrics regularly (for example, at the end of each epoch):
from transformers import TrainingArguments training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")
Fine-tuning with custom datasets
Several examples of using Transformers models with your own datasets.
IMDb Reviews and Sequence Classification
The data is organized into pos and neg folders, with one text file per example. Let's create a function that can read this
from pathlib import Path def read_imdb_split(split_dir): split_dir = Path(split_dir) texts = [] labels = [] for label_dir in ["pos", "neg"]: for text_file in (split_dir/label_dir).iterdir(): texts.append(text_file.read_text()) labels.append(0 if label_dir is "neg" else 1) return texts, labels train_texts, train_labels = read_imdb_split('aclImdb/train') test_texts, test_labels = read_imdb_split('aclImdb/test')
In addition to the train and test sets, let's also create a validation set that we can use for evaluation and tuning without affecting the test set results. It is very easy to create such splits with Sklearn
from sklearn.model_selection import train_test_split train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
Okay, we've read in our dataset. Let's talk about tokenization now. Our classifier will be trained using pre-trained DistilBert models, so let's use the DistilBert tokenizer
from transformers import DistilBertTokenizerFast tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
We can now pass our texts directly to the tokenizer. We'll pass the parameters truncation=True and padding=True, which will ensure that all of our sequences are padded to the same length and are truncated to no longer exceed the maximum input length of the model. In this way, we can feed batches of sequences into the model at the same time
train_encodings = tokenizer(train_texts, truncation=True, padding=True) val_encodings = tokenizer(val_texts, truncation=True, padding=True) test_encodings = tokenizer(test_texts, truncation=True, padding=True)
Let's create a Dataset object from our labels and encodings. PyTorch supports this by subclassing torch.utils.data.Dataset objects and implementing __len__ and __getitem__. TensorFlow passes input encodings and labels to the constructor method from_tensor_slices. Putting the data in this format allows us to batch encode the data so that the keys in the batch encoding correspond to the parameters in the forward() method of the models we will train
import torch class IMDbDataset(torch.utils.data.Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels def __getitem__(self, idx): item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()} item['labels'] = torch.tensor(self.labels[idx]) return item def __len__(self): return len(self.labels) train_dataset = IMDbDataset(train_encodings, train_labels) val_dataset = IMDbDataset(val_encodings, val_labels) test_dataset = IMDbDataset(test_encodings, test_labels)
Please refer to the training section for instructions on fine-tuning our datasets either with the Trainer or TFTrainer
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments training_args = TrainingArguments( output_dir='./results', # output directory num_train_epochs=3, # total number of training epochs per_device_train_batch_size=16, # batch size per device during training per_device_eval_batch_size=64, # batch size for evaluation warmup_steps=500, # number of warmup steps for learning rate scheduler weight_decay=0.01, # strength of weight decay logging_dir='./logs', # directory for storing logs logging_steps=10, ) model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") trainer = Trainer( model=model, # the instantiated Transformers model to be trained args=training_args, # training arguments, defined above train_dataset=train_dataset, # training dataset eval_dataset=val_dataset # evaluation dataset ) trainer.train()
Fine-tuning with native PyTorch/TensorFlow If you want you can use it also
from torch.utils.data import DataLoader from transformers import DistilBertForSequenceClassification, AdamW device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased') model.to(device) model.train() train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True) optim = AdamW(model.parameters(), lr=5e-5) for epoch in range(3): for batch in train_loader: optim.zero_grad() input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['labels'].to(device) outputs = model(input_ids, attention_mask=attention_mask, labels=labels) loss = outputs[0] loss.backward() optim.step() model.eval()
I only discussed sentiment analysis here
The transformer package of Python can also be used to solve other problems like Question-Answer that I discussed above for sentiment analysis with a custom dataset