Optimizing BERT Model for Low-Resource Scenarios: How to Run a Faster BERT Model on a Small Server or Cloud Instance
Deep learning models have become increasingly popular and influential in recent years, but they also come with challenges. One of the main challenges is the high computational cost and memory requirement of these models, which can limit their applicability in low-resource scenarios, such as running on a small server or a cloud instance with limited resources. In this article, we will try to deploy a BERT base model on a single-core instance with only 1 Gb of ram. By the end of this article, you will better understand how to run a faster and more efficient deep learning model on a small server or cloud instance. In this article, I assume you have the basic MLOps (Machine Learning Operations) knowledge, such as building Machine Learning Pipeline and Serving it using TensorFlow Serving.
Step 1. Preparation
To begin with, you have to install these packages:
transformers
torch
tensorflow
Make sure you have installed docker on your device.
Step 2. Save model in TensorFlow SavedModel format
Then, we have to download the model, convert it to TensorFlow format if it's a PyTorch model, and save it in SavedModel format. SavedModel is a format for serializing and deserializing TensorFlow models. It contains a complete description of the model architecture, weights, hyperparameters, signatures, and assets. SavedModel allows models to be used across different platforms and frameworks, such as TensorFlow Serving, TensorFlow Lite, TensorFlow.js, and TensorFlow Hub. In this article, we will use nateraw/bert-base-uncased-imdb, which is a Pytorch model, but you can use any model that has been trained on any task. Therefore we have to set the parameter "from_pt" to True to convert the model from PyTorch to TensorFlow.
from transformers import TFBertForQuestionAnswering
model = TFBertForQuestionAnswering.from_pretrained("nateraw/bert-base-uncased-imdb", from_pt=True)
model.save_pretrained("model", saved_model=True))
Step 3. Creating the Docker Image for serving the model
In this part, instead of using the normal Tensorflow Serving, we will use the Intel Optimized Tensorflow Serving. Based on my experience, deploying an Intel Optimized TensorFlow Serving can increase my BERT model's performance by 1.7-2 times, which using the normal TensorFlow Serving itself has already improved the performance of the model instead of deploying it using Flask. The Intel? Optimization of TensorFlow Serving leverages Intel? oneAPI Deep Neural Network Library (Intel? oneDNN) to perform inferencing tasks significantly faster than the default installation on Intel? processors.
Making the Docker Image for serving using Intel Optimized TensorFlow Serving is the same as using a normal Tensorflow Serving. However, we changed the pulled docker image on our DockerFile to "intel/intel-optimized-tensorflow-serving:latest" and added several configurations you can customize.
领英推荐
FROM intel/intel-optimized-tensorflow-serving:latest
# Expose ports
# gRPC
EXPOSE 8500
# REST
EXPOSE 8501
# The only required piece is the model name in order to differentiate endpoints
ENV MODEL_NAME=bert
# Set where models should be stored in the container
ENV MODEL_BASE_PATH=/models
# copy the model to the working directory
COPY ./model/saved_model ${MODEL_BASE_PATH}/${MODEL_NAME}
# Setting MKL environment variables can improve performance.
# https://www.tensorflow.org/guide/performance/overview
# Read about Tuning MKL for the best performance
# Add export MKLDNN_VERBOSE=1 to the below script,
# to see MKL messages in the docker logs when you send predict request.
ENV OMP_NUM_THREADS=2
ENV KMP_BLOCKTIME=1
ENV KMP_SETTINGS=1
ENV KMP_AFFINITY='granularity=fine,verbose,compact,1,0'
ENV MKLDNN_VERBOSE=1
# No. of physical cores
ENV TENSORFLOW_INTRA_OP_PARALLELISM=2
# No. of sockets
ENV TENSORFLOW_INTER_OP_PARALLELISM=2
ENV PORT=8501
# Create a script that runs the model server so we can use environment variable
# while also passing in arguments from the docker command line
RUN echo '#!/bin/bash \n\n\
tensorflow_model_server --port=8500 --rest_api_port=${PORT} \
--tensorflow_intra_op_parallelism=${TENSORFLOW_INTRA_OP_PARALLELISM} \
--tensorflow_inter_op_parallelism=${TENSORFLOW_INTER_OP_PARALLELISM} \
--model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} \
"$@"' > /usr/bin/tf_serving_entrypoint.sh \
&& chmod +x /usr/bin/tf_serving_entrypoint.sh
ENTRYPOINT ["/usr/bin/tf_serving_entrypoint.sh"]s
Then build the docker image as usual with this command:
docker build -t <image name> .
Step 4. Deploy
I use DigitalOcean's app platform with 1vCPU and 1Gb of ram in this article, but you can deploy it on any other platform. To make it easier to update and redeploy, I will deploy the image to the DigitalOcean registry platform by:
c:\Users\Lenovo> doctl auth init
c:\Users\Lenovo> doctl registry create <your-registry-name>
c:\Users\Lenovo> doctl registry login
c:\Users\Lenovo> docker tag <your image name> registry.digitalocean.com/<your-registry-name>/<your image name>
c:\Users\Lenovo> docker push registry.digitalocean.com/<your-registry-name>/<your image name>
For a more detailed step on this part, you can check this article: Build and Deploy Your First Image to Your First Cluster :: DigitalOcean Documentation
In the DigitalOcean dashboard, you can create a new App Platform, select "DigitalOcean Container Registry" as your source, and select your image name.
Then you can just set all the other configurations you need. You can let the port as 8080 for easier access, but you can change it to anything you want.
Step 4. Query the Model through the REST API
To query your model, you have to tokenize your input, format them as a JSON, and send it to <your model API URL>/v1/models/bert:predict.
from transformers import TFBertTokenizerFast, BertConfig
import requests
import json
import numpy as np
sentence = "I love the new TensorFlow update in transformers."
# Load the corresponding tokenizer of our SavedModel
tokenizer = TFBertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")
# Load the model config of our SavedModel
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")
# Tokenize the sentence
batch = tokenizer(sentence)
# Convert the batch into a proper dict
batch = dict(batch)
# Put the example into a list of size 1, that corresponds to the batch size
batch = [batch]
# The REST API needs a JSON that contains the key instances to declare the examples to process
input_data = {"instances": batch}
# Query the REST API, the path corresponds to https://host:port/model_version/models_root_folder/model_name:method
r = requests.post("https://localhost:8501/v1/models/bert:predict", data=json.dumps(input_data))
# Parse the JSON result. The results are contained in a list with a root key called "predictions"
# and as there is only one example, takes the first element of the list
result = json.loads(r.text)["predictions"][0]
# The returned results are probabilities, that can be positive or negative hence we take their absolute value
abs_scores = np.abs(result)
# Take the argmax that correspond to the index of the max probability.
label_id = np.argmax(abs_scores)
# Print the proper LABEL with its index
print(config.id2label[label_id])
You can improve the performance further by including the preprocessing step on your Tensorflow Extended Pipeline and using gRPC instead of REST.
This article discusses some techniques and strategies for optimizing the BERT model for low-resource scenarios. We could improve the model performance twice using only a slight change. This article will help you leverage the power of the BERT model for your natural language processing tasks without compromising on performance or cost.
Product Engineer (RnD) at Dicoding Indonesia
1 年Nice explanation! ??