登录查看更多内容

Optimizing BERT Model for Low-Resource Scenarios: How to Run a Faster BERT Model on a Small Server or Cloud Instance

Rifky Bujana Bisri

Making AI Accessible to Anyone

发布日期: 2023年4月17日

Deep learning models have become increasingly popular and influential in recent years, but they also come with challenges. One of the main challenges is the high computational cost and memory requirement of these models, which can limit their applicability in low-resource scenarios, such as running on a small server or a cloud instance with limited resources. In this article, we will try to deploy a BERT base model on a single-core instance with only 1 Gb of ram. By the end of this article, you will better understand how to run a faster and more efficient deep learning model on a small server or cloud instance. In this article, I assume you have the basic MLOps (Machine Learning Operations) knowledge, such as building Machine Learning Pipeline and Serving it using TensorFlow Serving.

BERT (Bidirectional Encoder Representation From Transformers): BERT is a natural language processing model that uses deep neural networks and self-attention mechanisms to learn contextual representations of words and sentences from large-scale text corpora.

Step 1. Preparation

To begin with, you have to install these packages:

transformers
torch
tensorflow

Make sure you have installed docker on your device.

Step 2. Save model in TensorFlow SavedModel format

Then, we have to download the model, convert it to TensorFlow format if it's a PyTorch model, and save it in SavedModel format. SavedModel is a format for serializing and deserializing TensorFlow models. It contains a complete description of the model architecture, weights, hyperparameters, signatures, and assets. SavedModel allows models to be used across different platforms and frameworks, such as TensorFlow Serving, TensorFlow Lite, TensorFlow.js, and TensorFlow Hub. In this article, we will use nateraw/bert-base-uncased-imdb, which is a Pytorch model, but you can use any model that has been trained on any task. Therefore we have to set the parameter "from_pt" to True to convert the model from PyTorch to TensorFlow.

from transformers import TFBertForQuestionAnswering

model = TFBertForQuestionAnswering.from_pretrained("nateraw/bert-base-uncased-imdb", from_pt=True)
model.save_pretrained("model", saved_model=True))

Step 3. Creating the Docker Image for serving the model

In this part, instead of using the normal Tensorflow Serving, we will use the Intel Optimized Tensorflow Serving. Based on my experience, deploying an Intel Optimized TensorFlow Serving can increase my BERT model's performance by 1.7-2 times, which using the normal TensorFlow Serving itself has already improved the performance of the model instead of deploying it using Flask. The Intel? Optimization of TensorFlow Serving leverages Intel? oneAPI Deep Neural Network Library (Intel? oneDNN) to perform inferencing tasks significantly faster than the default installation on Intel? processors.

Making the Docker Image for serving using Intel Optimized TensorFlow Serving is the same as using a normal Tensorflow Serving. However, we changed the pulled docker image on our DockerFile to "intel/intel-optimized-tensorflow-serving:latest" and added several configurations you can customize.

领英推荐

How do you select the right machine learning algorithm…

Machine Learning 2 年前

Do you have someone who can turn you off when they…

佩尼戈阿利斯泰尔 2 年前

The Evolution, Mechanisms, and Applications of Machine…

Nelinia (Nel) Varenas, MBA 7 个月前

FROM intel/intel-optimized-tensorflow-serving:latest

# Expose ports
# gRPC
EXPOSE 8500
# REST
EXPOSE 8501

# The only required piece is the model name in order to differentiate endpoints
ENV MODEL_NAME=bert
# Set where models should be stored in the container
ENV MODEL_BASE_PATH=/models
# copy the model to the working directory
COPY ./model/saved_model ${MODEL_BASE_PATH}/${MODEL_NAME}

# Setting MKL environment variables can improve performance.
# https://www.tensorflow.org/guide/performance/overview
# Read about Tuning MKL for the best performance
# Add export MKLDNN_VERBOSE=1 to the below script,
# to see MKL messages in the docker logs when you send predict request.
ENV OMP_NUM_THREADS=2
ENV KMP_BLOCKTIME=1
ENV KMP_SETTINGS=1
ENV KMP_AFFINITY='granularity=fine,verbose,compact,1,0'
ENV MKLDNN_VERBOSE=1

# No. of physical cores
ENV TENSORFLOW_INTRA_OP_PARALLELISM=2
# No. of sockets
ENV TENSORFLOW_INTER_OP_PARALLELISM=2

ENV PORT=8501

# Create a script that runs the model server so we can use environment variable
# while also passing in arguments from the docker command line
RUN echo '#!/bin/bash \n\n\
tensorflow_model_server --port=8500 --rest_api_port=${PORT} \
--tensorflow_intra_op_parallelism=${TENSORFLOW_INTRA_OP_PARALLELISM} \
--tensorflow_inter_op_parallelism=${TENSORFLOW_INTER_OP_PARALLELISM} \
--model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} \
"$@"' > /usr/bin/tf_serving_entrypoint.sh \
&& chmod +x /usr/bin/tf_serving_entrypoint.sh

ENTRYPOINT ["/usr/bin/tf_serving_entrypoint.sh"]s

Then build the docker image as usual with this command:

docker build -t <image name> .

Step 4. Deploy

I use DigitalOcean's app platform with 1vCPU and 1Gb of ram in this article, but you can deploy it on any other platform. To make it easier to update and redeploy, I will deploy the image to the DigitalOcean registry platform by:

c:\Users\Lenovo> doctl auth init
c:\Users\Lenovo> doctl registry create <your-registry-name>
c:\Users\Lenovo> doctl registry login
c:\Users\Lenovo> docker tag <your image name> registry.digitalocean.com/<your-registry-name>/<your image name>
c:\Users\Lenovo> docker push registry.digitalocean.com/<your-registry-name>/<your image name>

For a more detailed step on this part, you can check this article: Build and Deploy Your First Image to Your First Cluster :: DigitalOcean Documentation

In the DigitalOcean dashboard, you can create a new App Platform, select "DigitalOcean Container Registry" as your source, and select your image name.

No alt text provided for this image — Set your source as DigitalOcean Container Registry.

Then you can just set all the other configurations you need. You can let the port as 8080 for easier access, but you can change it to anything you want.

Step 4. Query the Model through the REST API

To query your model, you have to tokenize your input, format them as a JSON, and send it to <your model API URL>/v1/models/bert:predict.

from transformers import TFBertTokenizerFast, BertConfig
import requests
import json
import numpy as np

sentence = "I love the new TensorFlow update in transformers."

# Load the corresponding tokenizer of our SavedModel
tokenizer = TFBertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")

# Load the model config of our SavedModel
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")

# Tokenize the sentence
batch = tokenizer(sentence)

# Convert the batch into a proper dict
batch = dict(batch)

# Put the example into a list of size 1, that corresponds to the batch size
batch = [batch]

# The REST API needs a JSON that contains the key instances to declare the examples to process
input_data = {"instances": batch}

# Query the REST API, the path corresponds to https://host:port/model_version/models_root_folder/model_name:method
r = requests.post("https://localhost:8501/v1/models/bert:predict", data=json.dumps(input_data))

# Parse the JSON result. The results are contained in a list with a root key called "predictions"
# and as there is only one example, takes the first element of the list
result = json.loads(r.text)["predictions"][0]

# The returned results are probabilities, that can be positive or negative hence we take their absolute value
abs_scores = np.abs(result)

# Take the argmax that correspond to the index of the max probability.
label_id = np.argmax(abs_scores)

# Print the proper LABEL with its index
print(config.id2label[label_id])

You can improve the performance further by including the preprocessing step on your Tensorflow Extended Pipeline and using gRPC instead of REST.

This article discusses some techniques and strategies for optimizing the BERT model for low-resource scenarios. We could improve the model performance twice using only a slight change. This article will help you leverage the power of the BERT model for your natural language processing tasks without compromising on performance or cost.

Zanuar ER

Product Engineer (RnD) at Dicoding Indonesia

1 年

Nice explanation! ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Rifky Bujana Bisri的更多文章

Linear Regression dengan Gradient Descent menggunakan C++

2022年2月8日

Linear Regression dengan Gradient Descent menggunakan C++

Konsep Linear regression adalah sebuah model machine learning yang sangat sederhana. Model ini mencari perkiraan…

Optimizing BERT Model for Low-Resource Scenarios: How to Run a Faster BERT Model on a Small Server or Cloud Instance

Rifky Bujana Bisri

Making AI Accessible to Anyone

Step 1. Preparation

Step 2. Save model in TensorFlow SavedModel format

Step 3. Creating the Docker Image for serving the model

领英推荐

Step 4. Deploy

Step 4. Query the Model through the REST API

Rifky Bujana Bisri的更多文章

社区洞察

其他会员也浏览了

The Encoder Component of the Transformer Architecture: Source code Demystified

Machine Learning, AI and Big Data Tools Open-Sourced By Major Corporations

Data Science Explained!

BxD Primer Series: Support Vector Machine (SVM) Models

A simple CNN In TensorFlow: Practical CIFAR-10 Guide

Evolution of Machine Learning: From Regression to Transformers Models

The Evolution of Machine Learning: From Theory to Practice

Top 25 AI and Machine Learning Books You Should Read

Exploring TensorFlow: Computation Graphs, Optimizations, and Differentiation

The Big 3 of Machine Learning Tasks

Step 1. Preparation

Step 2. Save model in TensorFlow SavedModel format

Step 3. Creating the Docker Image for serving the model

领英推荐

Step 4. Deploy

Step 4. Query the Model through the REST API

Rifky Bujana Bisri的更多文章

Linear Regression dengan Gradient Descent menggunakan C++

社区洞察

其他会员也浏览了

The Encoder Component of the Transformer Architecture: Source code Demystified

Machine Learning, AI and Big Data Tools Open-Sourced By Major Corporations

Data Science Explained!

BxD Primer Series: Support Vector Machine (SVM) Models

A simple CNN In TensorFlow: Practical CIFAR-10 Guide

Evolution of Machine Learning: From Regression to Transformers Models

The Evolution of Machine Learning: From Theory to Practice

Top 25 AI and Machine Learning Books You Should Read

Exploring TensorFlow: Computation Graphs, Optimizations, and Differentiation

The Big 3 of Machine Learning Tasks