TensorFlow Serving API & gRPC

TensorFlow Serving API & gRPC

To serve models for production applications, one can use REST API or gRPC. gRPC is a high-performance, binary, and strongly typed protocol using HTTP/2, while REST is a simpler, text-based, and stateless protocol using HTTP with JSON/XML.

Here are some differences between gRPC and REST:

  • Protocol: gRPC uses HTTP/2 for transport, while REST typically uses HTTP/1.1.
  • Data format: gRPC employs Protocol Buffers for serialisation, while REST usually leverages JSON or XML.
  • API design: gRPC is based on the RPC (Remote Procedure Call) paradigm, while REST follows the architectural constraints of the Representational State Transfer model.
  • Streaming: gRPC supports bidirectional streaming, whereas REST is limited to request-response communication patterns.

Using the ML model, TensorFlow Serving can receive client requests and provide responses from the back end. This is a flexible, high-performance serving system for machine learning models designed for production environments. TensorFlow Serving makes deploying new algorithms and experiments easy while keeping the same server architecture and APIs. It provides out-of-the-box integration with TensorFlow models but can be extended to serve other models. More information about TensorFlow Serving can be found here.

TensorFlow can be installed in Ubuntu by using pip:

pip install tensorflow-serving-api        

Build & Save a model

Let's assume that we trained an ML model like the following:


model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28, 1]),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))        

and we make some predictions:

np.round(model.predict(X_new), 2)        

And the prediction response is the following:

Now we can save the model:

# Generate model verion
model_version = "0001"
model_name = "my_mnist_model"
model_path = os.path.join(model_name, model_version)

# Save the model
tf.saved_model.save(model, model_path)        

After saving the model, we will create a directory with the necessary files, and we can inspect it with "shuttle".Shutil stands for "shell utilities" and provides a comprehensive set of functions for file and directory operations. Whether you need to copy, move, rename, or delete files and directories, the Shutil module can be used because have user?friendly and efficient functionalities.

First, we will need to import the module:

import shutil


And we can explore the tree of the directory:

for root, dirs, files in os.walk(model_name):
    indent = '    ' * root.count(os.sep)
    print('{}{}/'.format(indent, os.path.basename(root)))
    for filename in files:
        print('{}{}'.format(indent + '    ', filename))        

And the output will be:

And by using "saved_model_cli", we can check the signatures:

!saved_model_cli show --dir {model_path} --tag_set serve        

And the output will be:

Use TensorFlow Serving

To start, the server will require to use the model directory:

os.environ["MODEL_DIR"] = os.path.split(os.path.abspath(model_path))[0]        

And run the server in the background on port 8501:

%%bash --bg
nohup tensorflow_model_server \
     --rest_api_port=8501 \
     --model_name=my_mnist_model \
     --model_base_path="${MODEL_DIR}" >server.log 2>&1        

And we can check the listening ports:

!lsof -i -P -n | grep LISTEN        

The output will show ports 8500 and 8501:

Because REST API supports JSON, we will need to input the data in a JSON format:

import json

input_data_json = json.dumps({
               "signature_name": "serving_default", 
               "instances": X_new.tolist(),

And now can use the TensorFlow Serving's REST API to make predictions:

import requests

SERVER_URL = 'https://localhost:8501/v1/models/my_mnist_model:predict'
response = requests.post(SERVER_URL, data=input_data_json)
response.raise_for_status() # raise an exception in case of error
response = response.json()

y_proba = np.array(response["predictions"])

and the output will be:


To use gRPC API will require to load the serving API prediction module:

from tensorflow_serving.apis.predict_pb2 import PredictRequest

request = PredictRequest()
request.model_spec.name = model_name
request.model_spec.signature_name = "serving_default"
input_name = model.input_names[0]

and then the gRPC:

import grpc
from tensorflow_serving.apis import prediction_service_pb2_grpc

channel = grpc.insecure_channel('localhost:8500')
predict_service = prediction_service_pb2_grpc.PredictionServiceStub(channel)
response = predict_service.Predict(request, timeout=10.0)        

and by using the command:


we will get the output:

Also we can convert it to tensor:

output_name = model.output_names[0]
outputs_proto = response.outputs[output_name]
y_proba = tf.make_ndarray(outputs_proto)

and the output will be:

The most common way is to use REST API, but you should consider the advantages and disadvantages of both options.

#tensoflorserving #restapi #grpc #machinelearning


Andrew Antonopoulos的更多文章

  • Sustainable ML - Monitor Power Consumption

    Sustainable ML - Monitor Power Consumption

    Training models will also consider the power consumption of the hardware. The following paper compares the most common…

  • Blockchain & Web3 Technology

    Blockchain & Web3 Technology

    Blockchain is a technology that securely stores transactional information by linking blocks together in a specific…

  • NVIDIA Mixed Precision - Loss & Accuracy - Part 2

    NVIDIA Mixed Precision - Loss & Accuracy - Part 2

    Part 1 explained how Nvidia's mixed precision can help reduce power consumption. However, we also need to consider…

  • NVIDIA Mixed Precision & Power Consumption - Part 1

    NVIDIA Mixed Precision & Power Consumption - Part 1

    Deep Learning has enabled progress in many different applications and can be used for developing models for…

  • Nvidia GPU & TensorFlow for ML in Ubuntu 24.04 LTS

    Nvidia GPU & TensorFlow for ML in Ubuntu 24.04 LTS

    Tensorflow announced that it would stop supporting GPUs for Windows. The latest support version was 2.

    5 条评论
  • FreeBSD 13 & TCP BBR Congestion Control

    FreeBSD 13 & TCP BBR Congestion Control

    Finally TCP BBR is available for FreeBSD new release 13.x.

    2 条评论
  • Kubernetes - Open Source Tools

    Kubernetes - Open Source Tools

    Kubernetes (also known as k8s or “kube”) is a very popular container orchestration platform that automates many of the…

  • Cache-Control Headers

    Cache-Control Headers

    The performance of content that is available via web sites and applications can be significantly improved by reusing…

  • CDN Cache and Machine Learning

    CDN Cache and Machine Learning

    The majority of the Internet’s content is delivered by global caching networks, also known as Content Delivery Networks…

  • OTT & Mobile Battle in Africa

    OTT & Mobile Battle in Africa

    OTT and specially SVOD is growing in Africa. Recently big OTT providers such as Netflix, muvi, Showmax, iFlix, MTN and…

