Using NLP with AWS SageMaker
Nick Gupta
Senior ML Engineer @ Amex | Machine Learning Specialization | GenAI | LLM | RAG | LangChain | XAI | Ethical AI | Multi-Modal ML | Columbia University Computer Science | Seeking Staff/Principal/Director GenAI/ML roles
Hello, Everyone! Today, we are going to learn how to use Natural Language Processing (NLP) with AWS SageMaker. This tutorial is designed to help you leverage SageMaker's powerful ML capabilities and build a text classification model using NLP.
Prerequisites:
Let's dive in!
Step 1: Setting up your Environment
First, log in to your AWS Management Console and navigate to the SageMaker service.
Create a new notebook instance (under the 'Notebook' section) and name it as you prefer. For the notebook instance type, select the instance suitable for your needs. For instance, 'ml.t2.medium' would suffice for this example.
Wait for your instance to be ready, it might take a couple of minutes. Once it's ready, open Jupyter.
Step 2: Data Preparation
Next, we need a dataset. For this tutorial, we will use the 'Sentiment140' dataset, which is commonly used for sentiment analysis. This dataset contains tweets labeled as 'positive', 'negative', or 'neutral'.
import pandas as pd
# Assuming the data file is named 'Sentiment140.csv'
data = pd.read_csv('Sentiment140.csv')
The 'Sentiment140' dataset, a popular dataset for sentiment analysis, can be obtained from the following sources:
Make sure to review any terms of use or licensing restrictions when downloading and using the data. It's crucial to respect the data usage policies set by the dataset provider.
The data must be split into training and validation datasets. We will use 80% of the data for training and the remaining 20% for validation.
from sklearn.model_selection import train_test_split
train_data, validation_data = train_test_split(data, test_size=0.2, random_state=42)
Next, we will upload the data to an S3 bucket.
领英推荐
import?sagemaker?
import?os
session = sagemaker.Session()?
# Specify your bucket name
bucket_name =?'your-bucket-name'?
train_data.to_csv(os.path.join('s3://', bucket_name,?'train.csv'), index=False)
validation_data.to_csv(os.path.join('s3://', bucket_name,?'validation.csv'), index=False)
Step 3: Training the Model
For NLP tasks, SageMaker provides several built-in algorithms like BlazingText, Seq2Seq, and more. Here, we will use BlazingText for text classification.
from sagemaker import get_execution_role
# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()
# Define the location of the training data
s3_train_data = 's3://{}/train.csv'.format(bucket_name)
# Define the model output location
s3_output_location = 's3://{}/output'.format(bucket_name)
container = sagemaker.amazon.amazon_estimator.get_image_uri(
? ? session.boto_region_name,
? ? "blazingtext",?
? ? "latest")
# Configure the training job
bt_model = sagemaker.estimator.Estimator(
? ? container,
? ? role,?
? ? train_instance_count=1,?
? ? train_instance_type='ml.m4.xlarge',
? ? train_volume_size = 5,
? ? train_max_run = 360000,
? ? input_mode= 'File',
? ? output_path=s3_output_location,
? ? sagemaker_session=session)
# Set the hyperparameters
bt_model.set_hyperparameters(mode="supervised",
? ? ? ? ? ? ? ? ? ? ? ? ? ? epochs=10,
? ? ? ? ? ? ? ? ? ? ? ? ? ? min_count=2,
? ? ? ? ? ? ? ? ? ? ? ? ? ? learning_rate=0.05,
? ? ? ? ? ? ? ? ? ? ? ? ? ? vector_dim=10,
? ? ? ? ? ? ? ? ? ? ? ? ? ? early_stopping=True,
? ? ? ? ? ? ? ? ? ? ? ? ? ? patience=4,
? ? ? ? ? ? ? ? ? ? ? ? ? ? min_epochs=5)
# Define the data channels
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated',?
? ? ? ? ? ? ? ? ? ? ? ? content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data}
# Start the training job
bt_model.fit(inputs=data_channels, logs=True)
Step4: Deploy the Model
Now that our model is trained, we can deploy it to make predictions.
text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')
This might take a few minutes.
Step 5: Predicting with the Model
Now we can use the deployed model to infer sentiments of tweets.
import json
# The sample tweet to predict sentiment
tweet = "I love AWS SageMaker. It's awesome!"
# The payload must be a list
payload = {"instances" : [tweet]}
response = text_classifier.predict(json.dumps(payload))
predictions = json.loads(response)
print(json.dumps(predictions, indent=2))
This would return the predicted sentiment of the tweet.
Step 6: Clean Up
Finally, remember to delete the SageMaker endpoint to avoid unnecessary charges.
text_classifier.delete_endpoint()
And that's it! You have just built a text classification model using NLP with AWS SageMaker. I hope this tutorial has been helpful to both beginners and advanced users alike. Feel free to comment with any questions or insights.
Remember, this is just the beginning. AWS SageMaker offers a lot more features and capabilities that you can explore. Keep learning and experimenting!