Explore the Power of Task-Specific Transformer Models with Amazon SageMaker and Hugging Face
Gary Stafford
Principal Solutions Architect @AWS | Data Analytics and Generative AI Specialist | Experienced Technology Leader, Consultant, CTO, COO, President | 10x AWS Certified
Explore the efficiencies of specialized transformer models available from Hugging Face to accomplish specific NLP tasks in Amazon SageMaker using real-time and batch inference
In this post, we’ll dive into the capabilities of specialized, task-specific transformer models and how they can tackle common natural language processing (NLP) and computer vision (CV) challenges. Using the power of Amazon SageMaker, we’ll perform real-time and batch inference with transformer models deployed from Hugging Face. Moreover, we’ll examine the advantages of hosting task-specific transformer models on Amazon SageMaker, and compare them to alternative solutions, including fully managed AI and Generative AI services offered on AWS.
For this post, the term task-specific is used to describe specialized transformer models that are optimized to accomplish specific tasks, such as audio classification , speech emotion recognition (SER), monocular depth estimation , machine language translation , Segment Anything Model (SAM) for image masking, programming code completion , or one of my favorite, wine quality tabular classification ??.
Source Code
The source code used in this post’s demonstration is open-sourced and available on GitHub . I suggest starting with the project’s Jupyter Notebook , which contains all the examples in the post.
Transformers
According to NVIDIA , “A transformer model is a neural network that learns context and thus meaning by tracking relationships in sequential data like the words in this sentence. Transformer models apply an evolving set of mathematical techniques, called attention or self-attention, to detect subtle ways even distant data elements in a series influence and depend on each other.”
According to Wikipedia , transformer models have had great success with natural language processing (NLP) and computer vision (CV) tasks such as:
Transformer models are often categorized by the tasks that they are designed to complete, such as text, vision, audio, video, and multimodal. Some common examples of transformers include:
Getting Started with Hugging Face and SageMaker
For this demonstration, we will use an open-source transformer model available on Hugging Face , the well-known “platform where the machine learning community collaborates on models, datasets, and applications.”
We will host the Hugging Face transformer model for inference on Amazon SageMaker . AWS SageMaker allows users to “build, train, and deploy machine learning models for any use case with fully managed infrastructure, tools, and workflows.” Hugging Face has direct API integrations with Amazon SageMaker.
Machine Translations
For this demo, we will use the Helsinki-NLP/opus-mt-en-zh model . The Language Technology Research Group developed the transformer model at the University of Helsinki to perform English-to-Chinese machine translations. The research group has published over 1,440 models and eight datasets on Hugging Face. The opus-mt-en-zh model had over 85k downloads in February 2024, while its counterpart, the opus-mt-zh-en Chinese-to-English model had over 2.7M downloads in the same month!
Model Size
Based on the size of the pytorch_model.bin file, the opus-mt-en-zh model is 312 MB. Comparatively, typical large language models (LLMs) can range from 10’s to 100’s of GBs. For example, the 175 billion parameter GTP-3 requires 350 GB of disk space at 2 bytes/parameter.
Alternative to Open Source Task-Specific Models
Leading model builders like AI21 Labs offer task-specific models for distinct use cases as an alternative to open-source models. AI21’s models specialize in paraphrasing, grammatical error corrections (GEC), text improvement, summarization, text segmentation, contextual answers, semantic search, and embeddings. According to AI21 Labs , “AI21 Studio’s Task-Specific Models offer a range of powerful tools. These models have been specifically designed for their respective tasks and provide high-quality results while optimizing efficiency…As specialized models, each was optimized for a dedicated purpose, making it significantly more efficient than building it from scratch and much more cost-effective.”
Fully-managed Alternatives on?AWS
Amazon Translate
Alternatively, you may choose fully-managed AI services, such as Amazon Translate , instead of deploying the transformer model on Amazon SageMaker for machine translation. AWS describes Amazon Translate as “a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation.” Amazon Translate provides both real-time and batch translation capabilities.
Amazon Bedrock
Another option is using a general-purpose text-based Generative AI foundation model with Amazon Bedrock , described as the “easiest way to build and scale generative AI applications with foundation models.” Amazon Bedrock offers real-time inference and, most recently, batch inference, allowing you to run multiple inference requests asynchronously.
The challenge is finding a general-purpose generative text model that can consistently and accurately accomplish your specific NLP or CV task, such as translating English to Chinese. Below is an example of using the latest Anthropic Claude 3 Haiku model on Amazon Bedrock Text Playground. The translation was successful, but the quality of the results were mixed.
Below is another example using the Mistral AI’s Mixtral 8x7B Instruct model. The machine translation results were less accurate than those of other foundation models tested. Further, note the extraneous translations to additional languages, which were not requested, including German, Italian, Japanese, Korean, Russian, and Spanish; this adds cost and time.
A second attempt with Mistral AI’s Mixtral 8x7B Instruct model with a lower temperature also gave bizarre results, translating my request, plus providing several more translations of other English texts. Again, this inconsistency and added response adds cost and time.
These examples demonstrate the trade-offs of relying on a general-purpose foundation model for specialized tasks. The pros and cons of using one method or model over the other comes down to a few important considerations:
Fully managed AI services often excel at ease of use and lower cost for tasks with lower volumes (smaller datasets). However, you may find that with very high volumes (large datasets), deploying task-specific models is more cost-efficient and provides more flexibility at scale than fully managed services. Select the right tool for the job.
Service Quotas
Before starting the demonstration, based on your budget, ensure you have 1–2 instances available for real-time inference and batch transforms. In this post, I have arbitrarily used a mix of ml.p3.2xlarge, ml.g5.12xlarge, ml.g4dn.2xlarge, and ml.g4dn.8xlarge GPU-based instances for inference. You can use Service Quotas in the AWS Management Console to check your available instance types and request additional instances if necessary.
Dataset
We will use the Quotes?—?500k dataset, available on Kaggle, for the demonstration. The dataset contains 500,000 quotes from well-known authors. There is no need to download the entire dataset. Due to its poor quality, the data required considerable cleansing to ensure we could perform inference without issues. I have included a clean set of 10,000 quotes for the demonstration in the GitHub project.
Amazon SageMaker Studio
All code used for this demo is contained in a Jupyter Notebook, built and managed in Amazon SageMaker Studio , the latest web-based experience for running ML workflows on AWS. According to AWS , Studio offers a suite of integrated development environments (IDEs). These include Code Editor, based on Code-OSS, Visual Studio Code?—?Open Source, a new JupyterLab application, RStudio, and Amazon SageMaker Studio Classic.
Getting Started
Using the supplied Jupyter Notebook , first install or update the required Python package for your environment.
%%sh
python3 -m pip install sagemaker boto3 botocore jsonlines -Uq
Next, deploy the Hugging Face-based transformer model as an Amazon SageMaker real-time inference endpoint. AWS states, “Real-time inference is ideal for inference workloads with real-time, interactive, low-latency requirements. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference.”
Given the close integration of Hugging Face with SageMaker, we can pull a copy of the model artifacts and deploy them to a real-time inference endpoint with only a few lines of boilerplate code using the HuggingFaceModel class’s deploy() method. Below is an example of deploying the model to a single ml.g5.12xlarge instance for real-time inference.
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel
# hugging face model
HF_MODEL_ID = "Helsinki-NLP/opus-mt-en-zh"
try:
role = sagemaker.get_execution_role()
except ValueError:
client_iam = boto3.client("iam")
role = client_iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
# hub model configuration. https://huggingface.co/models
hub = {
"HF_MODEL_ID": HF_MODEL_ID,
"HF_TASK": "translation"
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version="4.37.0",
pytorch_version="2.1.0",
py_version="py310",
env=hub,
role=role,
)
# 1x to deploy the model
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type="ml.g5.12xlarge", # ec2 instance type
)
Once complete, the model’s real-time inference endpoint appears in the SageMaker Studio Deployment’s > Endpoints tab. IMPORTANT: this endpoint will persist and you will continue to pay for it until you delete it.
领英推荐
You will need the name of the model endpoint to perform inference. The name, for example, huggingface-pytorch-inference-2024–03–18–00–22–55–664, can be obtained in the Endpoints > Details tab (shown above) or from within the notebook by running the following command:
# output contains endpoint name
predictor.endpoint_context()
Using the predict() method, we can test the deployed model’s real-time inference endpoint:
predictor.predict(
{
"inputs": "A heart filled with anger has no room for love.",
}
)
You should see results similar to the following:
[{'translation_text': '充满愤怒的心没有爱的空间'}]
You can validate the accuracy of the translation results using several methods, including Google Translate. It is important to note that this post is focused on how to use the models, not on the choice of models or their performance. Lacking proficiency in the Chinese language, I cannot recommend this model over other similar models for machine translation.
Using the existing real-time endpoint for inference, we can use the predict() method. You should get the same results as the previous inference method.
from sagemaker.huggingface.model import HuggingFacePredictor
SAGEMAKER_ENDPOINT = "<your_endpoint_name>"
session = sagemaker.session.Session()
predictor = HuggingFacePredictor(
endpoint_name=SAGEMAKER_ENDPOINT, sagemaker_session=session
)
predictor.predict(
{
"inputs": "A heart filled with anger has no room for love",
"parameters": {"max_length": 1024, "min_length": 1},
}
)
Using SageMaker Runtime API for Real-time Inference
As an alternative to the HuggingFacePredictor Class, we can use the Amazon SageMaker runtime API, SageMakerRuntime, calling the SageMakerRuntime.Client Class’s invoke_endpoint() method to perform real-time inference.
import boto3
client_smr = boto3.client("sagemaker-runtime")
# reference: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint.html
response = client_smr.invoke_endpoint(
EndpointName=SAGEMAKER_ENDPOINT,
Body=bytes(
'{"inputs": "A heart filled with anger has no room for love."}',
"utf-8"
),
ContentType="application/json",
)
# decodes and prints the response body:
print(response["Body"].read().decode("utf-8"))
You should get the same results as the previous two inference methods.
[{"translation_text":"充满愤怒的心没有爱的空间"}]
Bulk Inference with Real-time Inference Endpoint
A real-time inference endpoint is suitable for one-off translations or exposing it as part of a customer-facing translation application. Although not optimal, it can also be adopted for low-volume bulk translations. To demonstrate this, I imported the cleansed dataset from the project’s CSV and stored a subset of quotes in a series of Python list data types.
import csv
file = open("./_prelims/quotes_10k_clean.csv", "r")
data = list(csv.reader(file, delimiter=","))
file.close()
quotes = []
for row in enumerate(data):
# skip longer quotes ("model_max_length": 512 tokens)
if len(row[0]) > 1024:
continue
quotes.append(row)
# create lists of varying lengths of quotes for testing
quotes_10 = [column[0] for column in quotes[1:11]]
quotes_100 = [column[0] for column in quotes[1:101]]
quotes_1k = [column[0] for column in quotes[1:1001]]
quotes_10k = [column[0] for column in quotes[1:10001]]
You can then translate each quote by iterating over the list, calling the real-time endpoint, and writing the results back to an in-memory Python list of dictionaries, which could later be written to Amazon S3. Below, we see an example of iterating over a list of 1,000 quotes.
import boto3
client_smr = boto3.client("sagemaker-runtime")
translations = [] # holds the translations
for idx, quote in enumerate(quotes_1k):
try:
json = f'"inputs": "{quote}"'
json = "{" + json + "}"
response = client_smr.invoke_endpoint(
EndpointName=SAGEMAKER_ENDPOINT,
Body=bytes(json, "utf-8"),
ContentType="application/json",
)
response_str = response["Body"].read().decode("utf-8")
response_dict = eval(response_str)
translation_text = response_dict[0]["translation_text"]
translations.append({"input": quote, "output": translation_text})
except client_smr.exceptions.ModelError as e:
print(e)
print(f"Translating quote: {idx}/1000", end="\r")
CPU times: user 3.04 s, sys: 222 ms, total: 3.26 s
Wall time: 6min 20s
The output will look similar to the following:
[
{
"input": "A friend is someone who knows all about you and still loves you.",
"output": "朋友是了解你的一切 仍然爱你的人"
},
{
"input": "It is better to be hated for what you are than to be loved for what you are not.",
"output": "更好地被憎恨你是什么 比被爱 不是你是什么。"
},
{
"input": "Love all, trust a few, do wrong to none.",
"output": "爱,信任少数人,对任何人做错事"
},
{
"input": "You love me. Real or not real? I tell him, Real.",
"output": "你爱我,真的还是不是真的?"
},
{
"input": "Love is like the wind, you can't see it but you can feel it.",
"output": "爱就像风,你看不到它,但你能感觉到它。"
}
]
Inference Results
In my tests, using (1) ml.g5.2xlarge instance with (1) NVIDIA A10G GPU and 24 GiB of GPU memory, 1,000 translations took an average of 6min 20s, or about 2.63 transactions/second (0.38 s/t). Using (1) ml.g5.12xlarge instance with (4) NVIDIA A10G GPUs and 96 GiB of GPU memory, 1,000 translations took an average of 4min 25s, or about 3.77 transactions/second (0.265 s/t). Not bad for a series of sequential inference endpoint invocations with no parallelization. Further evaluation could be done to optimize the model’s performance while maintaining or reducing the inference costs. This does not include I/O time for Amazon S3 to store translation results.
Amazon SageMaker Batch Transform
In contrast to real-time inference, batch transform allows us to get inferences from large datasets and run inference when we don’t need a persistent endpoint. With batch transform, SageMaker handles initializing compute instances and distributing the inference workload between them.
I have written the quotes to JSON Lines format files in order to prepare the data for batch transform. I found JSON Lines easier to work with than CSV for batch transform jobs using the quotes.
{"input": "A friend is someone who knows all about you and still loves you."}
{"input": "It is better to be hated for what you are than to be loved for what you are not."}
{"inputs": "Love all, trust a few, do wrong to none."}
{"inputs": "You love me. Real or not real? I tell him, Real."}
{"inputs": "Love is like the wind, you can't see it but you can feel it."}
Since SageMaker’s batch transform can distribute the inference workloads across compute instances, I broke the list of 10,000 quotes into four JSON Lines files, each containing 2,500 quotes. You don’t need to create the files yourself (as demonstrated below); they are part of the GitHub project.
import jsonlines
filename = "./10k_quotes/quotes_10k_1.jsonl"
items = []
for quote in quotes_10k[0:2500]:
items.append({"inputs": quote})
with jsonlines.open(filename, "w") as writer:
writer.write_all(items)
To prepare for batch transforms, copy the JSON Lines files from your local copy of the GitHub project into your Amazon S3 bucket.
from sagemaker.s3 import S3Uploader, s3_path_join
files = [
"quotes/quotes_10.jsonl",
"quotes/quotes_100.jsonl",
"quotes/quotes_1k.jsonl",
"quotes/quotes_10k.jsonl",
]
for file in files:
input_s3_path = s3_path_join("s3://", S3_BUCKET, "input_batch", "quotes")
s3_file_uri = S3Uploader.upload(file, input_s3_path)
print(f"{file} uploaded to {s3_file_uri}")
files = [
"10k_quotes/quotes_10k_1.jsonl",
"10k_quotes/quotes_10k_2.jsonl",
"10k_quotes/quotes_10k_3.jsonl",
"10k_quotes/quotes_10k_4.jsonl",
]
for file in files:
input_s3_path = s3_path_join("s3://", S3_BUCKET, "input_batch", "10k_quotes")
s3_file_uri = S3Uploader.upload(file, input_s3_path)
print(f"{file} uploaded to {s3_file_uri}")
Next, start a batch transform job. Hugging Face has documentation on using this method for batch transform jobs. The following batch transform job will use (2) ml.g4dn.8xlarge instances to process the (4) JSON Lines files, each containing 2,500 quotes.
%%time
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel
try:
role = sagemaker.get_execution_role()
except ValueError:
client_iam = boto3.client("iam")
role = client_iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
# hub model configuration
hub = {
"HF_MODEL_ID": HF_MODEL_ID,
"HF_TASK": "translation"
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version="4.37.0",
pytorch_version="2.1.0",
py_version="py310",
env=hub,
role=role,
)
output_s3_path = f"s3://{S3_BUCKET}/output_batch"
s3_data_input = f"s3://{S3_BUCKET}/input_batch/10k_quotes/"
# starts batch transform job and uses S3 data as input
batch_job = huggingface_model.transformer(
accept="application/json",
assemble_with="Line",
instance_count=2,
instance_type="ml.g4dn.8xlarge",
output_path=output_s3_path,
strategy="SingleRecord",
)
batch_job.transform(
content_type="application/json",
data=s3_data_input,
split_type="Line",
logs=False,
)
Batch Transform Results
Using (2) ml.g4dn.8xlarge instances, costing $2.72/hr., to process (4) JSON Lines files, each containing 2,500 quotes, for a total of 10k quotes, with a SingleRecord strategy, the ~34-minute long job achieved 5.55 translations/second. This instance type uses (1) NVIDIA T4 GPU with 16 GB of GPU memory.
Using (4) smaller ml.g4dn.2xlarge instances, costing just $0.94/hr., to process (8) JSON Lines files, each containing 2,500 quotes, for a total of 20k quotes, the ~32-minute long job achieved 10.41 translations/second. This instance type also uses (1) NVIDIA T4 GPU with 16 GB of GPU memory but with one-quarter of the vCPUs and memory as the ml.g4dn.8xlarge instance.
Lastly, again using (4) smaller ml.g4dn.2xlarge instances to process (20) JSON Lines files, each containing 2,500 quotes, for a total of 50k quotes, the ~73-minute long job achieved nearly identical results of 10.07 translations/second.
The batch transform jobs write the JSON Lines output to the same Amazon S3 bucket. One JSON Lines output file will be created for each input file.
The translation results should look as follows:
[{"translation_text":"朋友是了解你的一切 仍然爱你的人"}]
[{"translation_text":"我们接受我们认为我们应得的爱。"}]
[{"translation_text":"爱,信任少数人,对任何人做错事"}]
[{"translation_text":"被某人深爱 给了你力量 而爱一个人深爱 给了你勇气"}]
[{"translation_text":"爱就像风,你看不到它,但你能感觉到它。"}]
As an alternative to using the HuggingFaceModel Class, we can use the Amazon SageMaker runtime API, SageMaker, calling the SageMaker.Client class’s create_transform_job() method to perform a batch transform. The parameters are nearly identical between the two methods.
import sagemaker
import boto3
import time
session = sagemaker.session.Session()
client_sm = boto3.client("sagemaker")
try:
role = sagemaker.get_execution_role()
except ValueError:
client_iam = boto3.client("iam")
role = client_iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
output_s3_path = f"s3://{S3_BUCKET}/output_batch"
s3_data_input = f"s3://{S3_BUCKET}/input_batch/10k_quotes/"
model_name = "<your_deployed_model_name>"
batch_job_name = f"quotes-batch-{int(time.time())}-10k"
# lauch batch transform job
response = client_sm.create_transform_job(
TransformJobName=batch_job_name,
ModelName=model_name,
BatchStrategy="SingleRecord",
TransformInput={
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": s3_data_input,
}
},
"ContentType": "application/json",
"SplitType": "Line",
},
TransformOutput={
"S3OutputPath": output_s3_path,
"AssembleWith": "Line",
"Accept": "application/json",
},
TransformResources={
"InstanceType": "ml.g4dn.8xlarge",
"InstanceCount": 2,
},
)
print(response["TransformJobArn"])
The model_name variable is the model that will be deployed for the batch transform job. You can use the model you deployed to the real-time inference endpoint earlier in the demonstration. Models can be found in the SageMaker console in the inference> Models tab.
If you want to track the progress of your batch transform job, I use this helpful routine, found in the AWS SageMaker’s samples on GitHub, pytorch_flores_batch_transform :
%%time
while True:
response = client_sm.describe_transform_job(
TransformJobName=batch_job_name
)
status = response["TransformJobStatus"]
if status == "Completed":
print(f"Transform job ended with status: {status}")
break
if status == "Failed":
message = response["FailureReason"]
print("Transform failed with the following error: {}".format(message))
raise Exception("Transform job failed")
print(f"Transform job is still in status: {status}...", end="\r")
time.sleep(30)
Below, using Amazon CloudWatch, you can observe the GPU and GPU Memory Utilization for the (2) ml.g4dn.8xlarge instances across the run time of the 10,000 quote batch transform job.
Here are similar GPU metrics for a batch transform job using (4) ml.g4dn.2xlarge instances across the run time of the 50,000 quote batch transform job. These metrics can be used to optimize the batch size, file count, and instance type, as well as count for both performance and cost.
Delete Real-time Inference Endpoint
Important: Don't forget to delete your real-time inference endpoint(s) or you will continue to be charged hourly for each instance. Optionally, you can delete the associated endpoint configuration(s) and the model(s).
import boto3
sagemaker_client = boto3.client("sagemaker")
# delete endpoint
sagemaker_client.delete_endpoint(EndpointName=SAGEMAKER_ENDPOINT)
# delete endpoint configuration
endpoint_config_name="<endpoint_name>"
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
# delete model
model_name = "<model_name>"
sagemaker_client.delete_model(ModelName=model_name)
Conclusion
In this post, we explored how to utilize the capabilities of specialized task-specific transformer models to accomplish common NLP tasks. We also learned to use Amazon SageMaker to deploy Hugging Face transformer models for real-time and batch transform (batch inference). Lastly, we compared this method with fully managed AI and Generative AI services on AWS.
This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, images, logos, and brands are the property of their respective owners.
Director of Cloud & AI/ML | Marine Corps Veteran | ex-AWS | Advisor
7 个月Gary Stafford great blog I always learn something. Question, have you seen issues with licenses of open source models?
Principal Solutions Architect @AWS | Data Analytics and Generative AI Specialist | Experienced Technology Leader, Consultant, CTO, COO, President | 10x AWS Certified
7 个月All code available on GitHub: