HOW TO USE TRANSFORMER FOR REAL LIFE PROBLEMS USING DIFFERENT MODELS OF TRANSFORMER

Shivam Jha

AI Team Lead | NLP | Computer Vision | Machine Learning | GCP | DevOps | Full Stack Web Development | Full Stack Mobile Application Development | Prompt Engineer | LLM | Langchain

发布日期: 2021年6月7日

In this article, I have discussed some use cases of transformer, different types of models, a few terms related to transformer, data preprocessing, fine-tuning models, and how to use models for a custom dataset. I have collected some information from the internet so that I can explain it in simple language that you don't have to look anywhere else to learn this

One of the Python packages I used here is called a transformer, which you can use to solve real-world problems

I have written about transformer in my previous article so that you have to learn about transformers before progressing to this article so that you can understand them better in the article that follows

https://www.dhirubhai.net/pulse/long-live-transformer-shivam-jha

Summary Of Some Use Case

Sequence Classification

In sequence classification, a sequence is categorized according to a specific number of classes

In this case, we are determining if a sequence is positive or negative

from transformers import pipeline

nlp = pipeline("sentiment-analysis")

result = nlp("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: NEGATIVE, with score: 0.9991

result = nlp("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: POSITIVE, with score: 0.9999

Extractive Question Answering

Extractive Question Answering is the task of extracting an answer from a text called context given a question

from transformers import pipeline

nlp = pipeline("question-answering")

context = r"""
... Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
... """

result = nlp(question="What is extractive question answering?", context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: 'the task of extracting an answer from a text given a question.', score: 0.6226, start: 34, end: 96

result = nlp(question="What is a good example of a question answering dataset?", context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: 'SQuAD dataset,', score: 0.5053, start: 147, end: 161

Language Modeling

The task of language modeling involves fitting a model to a corpus, which can be domain-specific. Transformers are trained using a variant of language modeling, e.g. A masked language model for BERT, a causal language model for GPT-2

Masked Language Modeling

The purpose of masked language modeling is to mask tokens in a sequence with a masking token and then prompt the model to fill that mask with an appropriate token. The model can take into account both the right context (tokens on the right of the mask) and the left context (tokens on the left of the mask)

from transformers import pipeline
from pprint import pprint

nlp = pipeline("fill-mask")


pprint(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))

[{'score': 0.1792745739221573,
  'sequence': '<s>HuggingFace is creating a tool that the community uses to '
              'solve NLP tasks.</s>',
  'token': 3944,
  'token_str': '?tool'},
 {'score': 0.11349421739578247,
  'sequence': '<s>HuggingFace is creating a framework that the community uses '
              'to solve NLP tasks.</s>',
  'token': 7208,
  'token_str': '?framework'},
 {'score': 0.05243554711341858,
  'sequence': '<s>HuggingFace is creating a library that the community uses to '
              'solve NLP tasks.</s>',
  'token': 5560,
  'token_str': '?library'},
 {'score': 0.03493533283472061,
  'sequence': '<s>HuggingFace is creating a database that the community uses '
              'to solve NLP tasks.</s>',
  'token': 8503,
  'token_str': '?database'},
 {'score': 0.02860250137746334,
  'sequence': '<s>HuggingFace is creating a prototype that the community uses '
              'to solve NLP tasks.</s>',
  'token': 17715,
  'token_str': '?prototype'}]

Text Generation

Text generation (also called open-ended text generation) aims to create coherent portions of text that follow from a given context

from transformers import pipeline

text_generator = pipeline("text-generation")

print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))

[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]

In this case, the model generates a random text from the context with a maximum length of 50 tokens

Named Entity Recognition

The Named Entity Recognition (NER) task involves classifying tokens according to a class, for example, identifying a token as a person, an organization, or a location

from transformers import pipeline

nlp = pipeline("ner")

sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
... therefore very close to the Manhattan Bridge which is visible from the window."""

print(nlp(sequence))
[
    {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
    {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
    {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
    {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
    {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
    {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
    {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
    {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
    {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
    {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
    {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
    {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]

Here is another example of using pipelines to do named entity recognition, specifically, identifying tokens as belonging to one of nine classes

Summarization

An article or document is summarized by putting the most important points into a shorter table of contents

from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
... A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
... Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
... In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
... Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
... 2010 marriage license application, according to court documents.
... Prosecutors said the marriages were part of an immigration scam.
... On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
... After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
... Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
... All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
... Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
... Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
... The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
... Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
... Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
... If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.

print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]

Translation

I love this use case

The translation is the task of translating a text from one language to another

from transformers import pipeline

translator = pipeline("translation_en_to_de")

print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]

Summary Of The Different Types Of Models

The following is a summary of the models available in Transformers

Autoregressive models

A classic language modeling task is used to pretrain autoregressive models

Guess the next token after reading all the previous ones. The decoders correspond to the original transformer model, and on top of the full sentence, a mask is placed so that attention heads can only see what came before, and not what came after. These models can be tuned and used to generate text for a variety of tasks, but generating text naturally comes out as the best application. GPT is a typical example of such a model

For Example

Original GPT

This is the first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.

The library provides versions of the model for language modeling and multitasks language modeling/multiple-choice classification

GPT-2

This is a new and improved GPT, pretrained on WebText (web pages from outgoing links on Reddit with a minimum of 3 karmas).

Models for language modeling and multitask language modeling/multiple-choice classification are available in the library.

Autoencoding models

Autoencoding models are pretrained by corrupting the input tokens in some way and then trying to reconstruct the original sentence. Unlike the encoder of the original transformer model, they can access all inputs without a mask. They usually build a bidirectional representation of the whole sentence. While they can be fine-tuned and perform well on many tasks such as text generation, their most natural application is sentence classification or token classification. BERT is a typical example of such a model

For Example

BERT

It uses random masking to corrupt inputs. More precisely, during pretraining, a percentage of tokens (usually 15%) is masked by:

A special mask token with a probability of 0.8
A random token different from the one masked with a probability of 0.1
The same token with a probability of 0.1

ALBERT

The same as BERT, but with a few tweaks:

Embedding sizes E and H have different meanings because embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens)
Layers are split into groups that share parameters (to save memory).
As an alternative, we predict the sentence order ahead of time after two consecutive sentences (A and B) are fed in: either A is fed first and then B or B is fed first. The model must predict whether they have been swapped or not

A version of the model is available for masked language modeling, token classification, sentence classification, multiple-choice classification, and question answering.

RoBERTa

Same as BERT with better pretraining tricks:

Masking is dynamic: tokens are changed at each epoch, while BERT does it once and for all
No NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of contiguous texts together to reach 512 tokens (so the sentences are arranged in an order that may span several documents)
Train with larger batches
(because of Unicode characters, use bytes, not characters)

A version of the model is available for masked language modeling, token classification, sentence classification, multiple-choice classification, and question answering

NOTE:

The only difference between autoregressive and autoencoding models lies in how they are pretrained. Therefore, autoencoding and autoregressive models can be built using the same architecture

Sequence-to-sequence models

Models for sequence-to-sequence translation or transformation use both the encoder and the decoder of the original transformer. They can be fine-tuned to perform various tasks, but are best suited to translation, summarization, and answering questions. Using an original transformer is an example of this type of model (for translation only)

Both encoders and decoders of the original transformer are included in these models

For Example

BART

An encoder and a decoder in a sequence-to-sequence model. The encoder is fed a corrupted version of the tokens, while the decoder is fed the original tokens (although it has a mask that hides future words, like a regular transformers decoder) which I discussed in my transformers article. On the pretraining tasks for the encoder, the following transformations are applied:

Mask random tokens (like in BERT)
Delete random tokens
Mask a span of k tokens with a single mask token (an insertion of a mask token is a span of 0 tokens)
Permute sentences
Rotate the document to make it start at a specific token

A version of this model is provided by the library for conditional generation and sequence classification

Multimodal models

One of the multimodal models has not been pretrained in the self-supervised manner as the others have

For Example

MMBT

In multimodal settings, this model combines a text and an image to make predictions. This model uses the embeddings of tokenized text and the final activations of a pretrained on images resnet (after the pooling layer), which goes through a linear layer (to get the hidden state dimension from the end of the resnet)

The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the model know which part of the input vector corresponds to the text and which to the image

Classification is the only use of the pretrained model

Retrieval-based models

To answer open-domain questions, for example, some models use document retrieval during (pre)training and inference

For Example

DPR

The Dense Passage Retrieval (DPR) system is a set of tools and models for state-of-the-art open-domain question answering.

DPR consists of three models:

Question encoder: encode questions as vectors
Context encoder: encode contexts as vectors
Reader: extract the answer to the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).

In DPR's pipeline (not yet implemented), a retrieval step finds the top k contexts for a certain question, and then the reader is called with the question and the retrieved documents to get the answer

I discussed only some examples of different types of models here you will find more

Preprocessing Data For Different Types Of Models

Transformers can be used to preprocess data. Tokenizers are the main tool we use for this. One can be created using the tokenizer class associated with the model you want to use, or directly with the AutoTokenizer class

Tokenizers begin by splitting a given text into words (or parts of words, punctuation symbols, etc.) called tokens. It will then convert the tokens into numbers so that a tensor can be built from them and fed to the model. The model will also be able to add additional inputs if required

The from_pretrained() method allows you to download the vocabulary used during pretraining or fine-tuning a given model automatically

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Base use

PreTrainedTokenizer has many methods, but the only one you need to remember for preprocessing is its __call__: you simply feed your sentence to your tokenizer object

encoded_input = tokenizer("Hello, I'm a single sentence!")

print(encoded_input)

{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102],

 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],

 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Returns a dictionary string to a list of integers. The input_ids correspond to each token in our sentence. We will see below what the attention_mask is used for, and in the next section, we will see what the token_type_id is used for

The tokenizer can decode a list of token ids in a proper sentence

tokenizer.decode(encoded_input["input_ids"])

"[CLS] Hello, I'm a single sentence! [SEP]"

The tokenizer automatically added a few special tokens that the model expects. For example, if we had created our tokenizer using gpt2-medium instead of bert-base-cased, we would have obtained the same result as the original. Bypassing add_special_tokens=False, you can disable this behavior (which is only recommended if you added those special tokens yourself)

If you have several sentences you want to process, you can do this efficiently by sending them as a list to the tokenizer:

batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]

encoded_inputs = tokenizer(batch_sentences)

print(encoded_inputs)

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
               [101, 1262, 1330, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1]]}

A dictionary is returned once again, this time with lists of lists of integers as values.

To build a batch of sentences to feed the model, you may want to send several sentences at a time to the tokenizer:

To pad each sentence to the maximum length there is in your batch.
To truncate each sentence to the maximum length the model can accept (if applicable).
To return tensors.

All of this can be done using the following options when feeding your list of sentences to the tokenizer:

batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")

print(batch)

{'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
                      [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
                      [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 0]])}

The function returns a dictionary with string keys and tensor values. Now we can see what the attention_mask is all about: it shows which tokens the model should pay attention to and which ones it should ignore (because they represent padding here)

The command above will throw a warning if your model does not have a maximum length. It can be safely ignored. Alternatively, you can pass verbose=False to stop the tokenizer from throwing such warnings

Preprocessing Pairs Of Sentences

You may need to feed your model a pair of sentences. For instance, if you want to classify if two sentences in a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is then represented like this: [CLS] Sequence A [SEP] Sequence B [SEP]

Two sentences can be encoded in the format expected by your model by providing the two sentences as two arguments (not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw earlier). Again, this will return a dict string to a list of integers

encoded_input = tokenizer("How old are you?", "I'm 6 years old")

print(encoded_input)

{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102],

 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],

 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The token_type_ids have one purpose: they tell the model which part of the input corresponds to the first sentence and which part corresponds to the second sentence. Token_type_ids are not required by all models. A tokenizer will default to returning inputs that are expected by its associated model. By using return_input_ids or return_token_type_ids, you can force the return (or non-return) of any of those special arguments

By decoding the token ids, we can see that the special tokens have been correctly added

tokenizer.decode(encoded_input["input_ids"])

"[CLS] How old are you? [SEP] I'm 6 years old [SEP]"

When you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the list of first sentences, and the list of second sentences

batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
...                              "And I should be encoded with the second sentence",
...                              "And I go with the very last one"]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)

print(encoded_inputs)

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],
               [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]],

'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]],

'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

We can see that it returns a dictionary with each value being a list of lists of integers.

We can double-check what is fed to the model by decoding each list in input_ids one by one

for ids in encoded_inputs["input_ids"]:

print(tokenizer.decode(ids))

[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]

[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]

[CLS] And the very very last one [SEP] And I go with the very last one [SEP]

Similarly, you can automatically pad your inputs to the maximum sentence length in your batch, truncate them to the maximum length the model can accept, and return tensors directly as follows

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")

What you always wanted to know about padding and truncation

The padding is controlled by padding. It can be a boolean or a string that should be
True or 'longest' to pad to the longest sequence in the batch (doing no padding if you only provide a single sequence)
'max_length' to pad to a length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). If you only provide a single sequence, padding will still be applied to it.
False or 'do_not_pad' to not pad the sequences. As we have seen before, this is the default behavior.
Truncation controls the truncation. It can be a boolean or a string which should be:
True or 'only_first' truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will only truncate the first sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided.
'only_second' truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will only truncate the second sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided
'longest_first' truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached
False or 'do_not_truncate' to not truncate the sequences. As we have seen before, this is the default behavior
max_length to control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated

You can add padding, truncation, and directly return tensors as before:

batch = tokenizer(batch_sentences,
                  batch_of_second_sentences,
                  is_split_into_words=True,
                  padding=True,
                  truncation=True,
                  return_tensors="pt")

Fine-tuning a pretrained model

Pretrained models from the Transformers library can be fine-tuned. TensorFlow allows models to be directly trained using Keras and the fit method. Since PyTorch does not have a generic training loop, the Transformers library provides an API with the class Trainer that lets you train a model from scratch. In the next section, we will show you how the entire training loop can be written in PyTorch

Preparing the datasets

The Datasets library will be used to download and preprocess the IMDB datasets. This part will be covered pretty quickly. The focus of this tutorial is on training, so you should refer to the Datasets documentation or the Preprocessing data tutorial for more information.

First, we can use the load_dataset function to download and cache the dataset:

from datasets import load_dataset

raw_datasets = load_dataset("imdb")

From_pretrained works similarly to what we saw for models and tokenizers

Raw_datasets is a dictionary with three keys: "train", "test" and "unsupervised" (which correspond to the three splits of that dataset). We will use the "train" split for training and the "test" split for validation.

We will need a tokenizer to preprocess our data:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

We saw in Preprocessing data that we could prepare text inputs for the model with the following command (this is an example, not a command you can execute)

inputs = tokenizer(sentences, padding="max_length", truncation=True)

All samples will have the maximum length that the model can accept (here 512), either by padding or truncating them.

Instead, we can use the map method to apply these preprocessing steps to all the splits of our dataset at once

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

To facilitate faster training, we will generate a small subset of the training and validation set:

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))

small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

We will always use small_train_dataset and small_eval_dataset in all the examples below. To train or evaluate on the full dataset, just replace them with their full equivalents.

PyTorch fine-tuning with the Trainer API

As PyTorch does not provide a training loop, the Transformers library provides a Trainer API that is optimized for Transformers models. The system offers a wide range of training options as well as features such as logging, gradient accumulation, and mixed-precision

Let's define our model first

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

There will be a warning about some of the pretrained weights not being used and some weights being randomly initialized. That's because we're replacing the BERT model's pretraining head with one with a classification that's randomly initialized. This model will be fine-tuned for our task, transferring the knowledge of the pretrained model to it (this is known as transfer learning)

As a result, we need to create a TrainingArguments object to define our Trainer. The Trainer class contains all the hyperparameters we can tune for it or the flags to activate the different training options it supports. Let us begin by using the default settings, and then all we need to do is provide the directory where the checkpoints should be saved

from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer")

The Trainer can then be instantiated as follows:

from transformers import Trainer

trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)

We can fine-tune our model by calling

trainer.train()

There will be a progress bar showing how long it will take to complete the training (if you have access to a GPU). It won't actually tell you anything about how well (or badly) your model is performing as by default, there is no evaluation during training, and we didn't instruct the trainer to compute any metrics. Let's see how we can accomplish this!

The Trainer needs a compute_metrics function that takes predictions and labels (grouped in a named tuple named EvalPrediction) and returns a dictionary containing string items (the names of the metrics) and float values (the metrics values).

In the Datasets library, the load_metric function provides an easy way to get the common metrics used in NLP. We simply use accuracy here. Lastly, we define the compute_metrics function that just converts logits into predictions (remember that all Transformers models return logits) and feeds them into the compute method of this metric

import numpy as np

from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

It must receive a tuple (with logits and labels) and return a dictionary with string keys (the name of the metric) and float values. A call to this function will be made at the end of each evaluation phase on the whole array of predictions/labels

Using our fine-tuned model, let's create a new Trainer to see if it works in practice

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.evaluate()

In our case, show the accuracy was 87.5%.

The following is how you should define your training arguments if you want to fine-tune your model and report the evaluation metrics regularly (for example, at the end of each epoch):

from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")

Fine-tuning with custom datasets

Several examples of using Transformers models with your own datasets.

IMDb Reviews and Sequence Classification

The data is organized into pos and neg folders, with one text file per example. Let's create a function that can read this

from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')

test_texts, test_labels = read_imdb_split('aclImdb/test')

In addition to the train and test sets, let's also create a validation set that we can use for evaluation and tuning without affecting the test set results. It is very easy to create such splits with Sklearn

from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

Okay, we've read in our dataset. Let's talk about tokenization now. Our classifier will be trained using pre-trained DistilBert models, so let's use the DistilBert tokenizer

from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

We can now pass our texts directly to the tokenizer. We'll pass the parameters truncation=True and padding=True, which will ensure that all of our sequences are padded to the same length and are truncated to no longer exceed the maximum input length of the model. In this way, we can feed batches of sequences into the model at the same time

train_encodings = tokenizer(train_texts, truncation=True, padding=True)

val_encodings = tokenizer(val_texts, truncation=True, padding=True)

test_encodings = tokenizer(test_texts, truncation=True, padding=True)

Let's create a Dataset object from our labels and encodings. PyTorch supports this by subclassing torch.utils.data.Dataset objects and implementing __len__ and __getitem__. TensorFlow passes input encodings and labels to the constructor method from_tensor_slices. Putting the data in this format allows us to batch encode the data so that the keys in the batch encoding correspond to the parameters in the forward() method of the models we will train

import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)

val_dataset = IMDbDataset(val_encodings, val_labels)

test_dataset = IMDbDataset(test_encodings, test_labels)

Please refer to the training section for instructions on fine-tuning our datasets either with the Trainer or TFTrainer

from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

Fine-tuning with native PyTorch/TensorFlow If you want you can use it also

from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()

I only discussed sentiment analysis here

The transformer package of Python can also be used to solve other problems like Question-Answer that I discussed above for sentiment analysis with a custom dataset

要查看或添加评论，请登录

Shivam Jha的更多文章

Essential Technologies for AI Engineers

2025年2月19日

Essential Technologies for AI Engineers

AI engineer should learn this technologies. Elasticsearch Elasticsearch is a search engine that helps store, search…
Converting Raw Text to a Knowledge Graph

2024年5月18日

Converting Raw Text to a Knowledge Graph

Graphs excel at organizing and holding different kinds of connected information in a structured way. They're really…

1 条评论
Agentic RAG

2024年3月15日

Agentic RAG

RAG (Retrieval Augmented Generation): An approach that combines a language model with a retrieval system to better…
Evaluation of RAG Application

2024年3月11日

Evaluation of RAG Application

The focus is on visualizing the embedding space using dimensionality reduction techniques like UMAP, which can reveal…

2 条评论
OPENAI DEV DAY

2023年11月7日

OPENAI DEV DAY

Showcasing new different versions of the GPT-4 model and their descriptions. The models mentioned are:…
Base64 Encoded String In JavaScript And Python

2021年11月26日

Base64 Encoded String In JavaScript And Python

You need to be familiar with this concept How to convert Image file, Audio file and Video file to base64 encoded string…

1 条评论
Getting Started With The Python Data Structure

2021年9月9日

Getting Started With The Python Data Structure

Listed below are important Data Structures from which you can learn how to use them in Python By modifying code and…
Asynchronous JavaScript

2021年8月18日

Asynchronous JavaScript

You can quickly learn and revise every concept of asynchronous JavaScript here CALLBACKS Callbacks are nothing but…

1 条评论
TRANSFORMER

2021年6月7日

TRANSFORMER

In this article, I will discuss the most advanced topic known as a transformer. You won't need to go anywhere as I have…
REACT COMPLEX PROJECT FOLDER STRUCTURE

2021年6月5日

REACT COMPLEX PROJECT FOLDER STRUCTURE

I have previously written about redux here I will cover react app folder structure for developing huge projects such as…

See all articles

Summary Of Some Use Case

Sequence Classification

Extractive Question Answering

Language Modeling

Masked Language Modeling

Text Generation

Named Entity Recognition

Summarization

Translation

Summary Of The Different Types Of Models

Autoregressive models

Original GPT

GPT-2

Autoencoding models

BERT

ALBERT

RoBERTa

Sequence-to-sequence models

BART

Multimodal models

MMBT

Retrieval-based models

DPR

Preprocessing Data For Different Types Of Models

Base use

Preprocessing Pairs Of Sentences

What you always wanted to know about padding and truncation

Fine-tuning a pretrained model

Preparing the datasets

PyTorch fine-tuning with the Trainer API

Fine-tuning with custom datasets

IMDb Reviews and Sequence Classification

Fine-tuning with native PyTorch/TensorFlow If you want you can use it also

Shivam Jha的更多文章

Essential Technologies for AI Engineers

Converting Raw Text to a Knowledge Graph

Agentic RAG

Evaluation of RAG Application

OPENAI DEV DAY

Base64 Encoded String In JavaScript And Python

Getting Started With The Python Data Structure

Asynchronous JavaScript

TRANSFORMER

REACT COMPLEX PROJECT FOLDER STRUCTURE

社区洞察

其他会员也浏览了

Word embeddings - your secret weapon for instant intelligence

Putting Semantic Representational Models to the Test (tf-idf, k-means, LDA, word vectors, paragraph vectors and skip-thought vectors)

Vector Embedding : UnSung Hero

NLP Transformers brings new life to the knowledge base

Mastering Sentiment Analysis: How to Build a BERT Model with IMDb Data

Fundamentals -4: Lemmatization

NLP for SPAM detection (preview of the next post)

Bag of Words(BoW)

First Step Towards N.L.P: Bag Of Words Model

NLP: A Technical Exploration through Web Scraping and Text Analysis