AWS SageMaker on a Budget: Train Models with Custom Docker Images and Spot Instances
Md. Moniruzzaman
Lead Software Engineer @ Cefalo | Data Engineering | Airflow | Aws | Aws IoT Core | Kinesis Data Stream | IaC | Terraform | Terragrunt | Python | Pyspark | DynamoDB
AWS SageMaker is a fully managed service that simplifies building, training, and deploying machine learning models at scale. One of its standout features is the ability to use custom Docker images for training your models. In this article, we’ll explore how to build and deploy a custom model training pipeline in AWS SageMaker using your own Docker container.
While SageMaker provides pre-built containers for popular deep learning frameworks such as TensorFlow, PyTorch, MXNet, HuggingFace, and XGBoost, there are times when you may want to use a custom algorithm for your training. If the provided SageMaker managed containers don’t suit your needs, you can easily bring your own container for a personalized training solution.
How SageMaker Manages Your Infrastructure
Simplifying Training with SageMaker Managed Spot Instances
Training machine learning models on spot instances can be tricky since these instances may be terminated with little notice (as short as two minutes!). This can potentially disrupt your training. However, AWS SageMaker makes it simple by automatically backing up your training checkpoints to S3.
If a training instance is interrupted, SageMaker seamlessly restarts the job, resuming from the last checkpoint without skipping a beat. It handles the logistics of copying your data and the latest checkpoint to a new instance, so you don’t lose any progress. With SageMaker’s spot instance management, you can train your models efficiently and without interruption.
Understanding SageMaker's Container Folder Structure
SageMaker organizes training data and model artifacts inside containers under the /opt/ml directory. Here’s a breakdown of the key directories:
After training, SageMaker packages the model into a tar archive and uploads it to the designated S3 location for easy retrieval.
Writing the Dockerfile
To create a custom Docker container for your training job, start by defining your Dockerfile. Below is an example of a base Dockerfile with CUDA and cuDNN support:
# Base image with CUDA and cuDNN
FROM nvidia/cuda:12.6.2-cudnn-devel-ubuntu22.04
# Install dependencies
RUN apt-get -y update && apt-get install -y --no-install-recommends \
libusb-1.0-0-dev \
libudev-dev \
build-essential \
python3 \
python3-pip \
python3-dev \
ca-certificates \
openssl \
libgl1-mesa-glx \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender-dev \
cmake && \
rm -fr /var/lib/apt/lists/*
# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONUNBUFFERED=TRUE \
PYTHONDONTWRITEBYTECODE=TRUE \
PATH=/usr/local/cuda/bin:$PATH \
PYTHONPATH=/opt/ml/input/data/code \
CUDA_VISIBLE_DEVICES=0
# Link python3 to python
RUN ln -s /usr/bin/python3 /usr/bin/python
# Set working directory for SageMaker code
WORKDIR /opt/ml/input/data/code
# Copy requirements.txt to the container
COPY requirements.txt ./
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
Avoiding Frequent Container Rebuilds
If you package your training code inside the container, you’ll need to rebuild it every time you make changes. This can slow down your workflow. Instead, SageMaker allows you to upload your training code externally, so you can easily iterate without rebuilding the entire container.
If you prefer to package your code inside the container, SageMaker fully supports that as well. Either way, you’ll have flexibility to choose what works best for your workflow.
Building and Pushing the Docker Image
After writing your Dockerfile, you can build and push the container image to AWS Elastic Container Registry (ECR):
1. Build the Docker image
docker build -t {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Region}.amazonaws.com/{Custom_Image_Name}:{tag} .
2. Login to AWS ECR:
aws ecr get-login-password --region {Your_AWS_Region} | docker login --username AWS --password-stdin {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Region}.amazonaws.com
3. Push the image:
docker push {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Region}.amazonaws.com/{Custom_Image_Name}:{tag}
Now your custom container is ready to use in SageMaker!
领英推荐
Constructing the SageMaker Training Job JSON Object
Once your image is pushed to ECR, configure the SageMaker training job by creating a JSON object for the job:
def configure_training_job():
return {
'TrainingJobName': JOB_NAME,
'HyperParameters': HYPERPARAMETERS,
'AlgorithmSpecification': {
'TrainingImage': ECR_CONTAINER_URL,
'TrainingInputMode': 'File',
"ContainerEntrypoint": ["python"],
"ContainerArguments": ["/opt/ml/input/data/code/train.py"],
},
'RoleArn': SAGEMAKER_ROLE,
'InputDataConfig': [
{
'ChannelName': 'code',
'DataSource': {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": uris['code'],
"S3DataDistributionType": "FullyReplicated"
}
},
"InputMode": "File",
},
{
'ChannelName': 'training_data',
'DataSource': {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": uris['training_data'],
"S3DataDistributionType": "FullyReplicated"
}
},
'ContentType': 'image/jpg',
'CompressionType': 'None'
}
],
'OutputDataConfig': {
"S3OutputPath": <outputpath>
},
'ResourceConfig': {
"InstanceType": INSTANCE_TYPE,
"InstanceCount": INSTANCE_COUNT,
"VolumeSizeInGB": MEMORY_VOLUME
},
'StoppingCondition': {
'MaxWaitTimeInSeconds': 86400 * 4,
'MaxRuntimeInSeconds': 86400 * 4
},
'EnableManagedSpotTraining': True,
'CheckpointConfig': {
'S3Uri': <s3 path for storing checkpoints>,
'LocalPath': '/opt/ml/checkpoints'
},
"Environment": {
'ENV': 'SageMaker'
},
"Tags": [{
'Key': 'JOB_NAME',
'Value': job_name
}]
}
Here we are specifying the docker ENTRYPOINT as python and it will run the train.py file which will be available at /opt/ml/input/data/code/train.py path.
'AlgorithmSpecification': {
'TrainingImage': ECR_CONTAINER_URL,
'TrainingInputMode': 'File',
"ContainerEntrypoint": ["python"],
"ContainerArguments": ["/opt/ml/input/data/code/train.py"],
}
This is how the train.py file along with other helper files will be available to the docker container. SageMaker will download all the python script from s3 location so that it can run.
'InputDataConfig': [
{
'ChannelName': 'code',
'DataSource': {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": uris['code'],
"S3DataDistributionType": "FullyReplicated"
}
},
"InputMode": "File",
}
]
Note: we are providing the HYPERPARAMETERS a json object.
Handling Spot Instance Interruptions
Training jobs on spot instances can be interrupted at any time. To safeguard against lost progress, you can save your model checkpoints after each epoch. When SageMaker reclaims the spot instance, your training will resume from the latest checkpoint, ensuring no work is lost.
Here's how to manage checkpoints and resume training:
def save_checkpoint(epoch, model, training_components, checkpoints_file_path):
checkpoint_name = f"epoch_{epoch}.pth"
checkpoint_path = os.path.join(checkpoints_file_path, checkpoint_name)
checkpoint_data = {
'epoch': epoch,
'model': model.state_dict(),
'optim': training_components.optimizer.state_dict()
}
torch.save(checkpoint_data, checkpoint_path)
logger.info(f"Epoch {epoch}: Saving checkpoint to {checkpoint_path}")
def find_latest_epoch_file(cls):
checkpoints_location = MLPaths.checkpoints_path
files = [file for file in os.listdir(checkpoints_location) if file.startswith("epoch_") and file.endswith(".pth")]
if not files:
return None, 1
latest_file = max(files, key=lambda x: int(x.split('_')[1].split('.')[0]))
latest_epoch = int(latest_file.split('_')[1].split('.')[0]) + 1
return latest_file, latest_epoch
Here's a brief explanation of the code:
Creating the SageMaker Training Job
Once your configuration is complete, launch the SageMaker training job:
def main():
sage_maker_client = boto3.client('sagemaker', region_name='eu-north-1')
training_job_config = configure_training_job()
sage_maker_client.create_training_job(**training_job_config)
if __name__ == "__main__":
main()
With this setup, your model will train on SageMaker, and spot instance interruptions will no longer disrupt your training process.
Bonus: You can also set up AWS EventBridge to receive notifications about your training job status, sending updates to Slack or via email!
Conclusion
In this tutorial, we’ve walked through the entire process of building, training, and deploying a custom model training pipeline using AWS SageMaker and Docker. By leveraging SageMaker’s managed infrastructure, we can focus on building sophisticated machine learning models while AWS takes care of the heavy lifting—provisioning resources, handling spot instances, and seamlessly managing interruptions.
Why Use AWS SageMaker Spot Instances? Spot Instances are an excellent choice for machine learning training jobs, especially when dealing with large datasets or computationally expensive models. They provide significant cost savings, sometimes as much as 90% compared to on-demand instances, making them an attractive option for budget-conscious projects.
With AWS SageMaker, using Spot Instances becomes hassle-free due to automatic checkpointing and seamless resumption of training. If a spot instance is interrupted, SageMaker will restore your training from the last saved checkpoint, eliminating the risk of losing progress. This built-in resilience ensures that you can take advantage of the cost savings without compromising the reliability of your training pipeline.
Whether you’re working with popular deep learning frameworks or your own custom training algorithm, SageMaker’s flexibility allows you to bring your own Docker containers, providing complete control over the training environment. Additionally, with easy access to Amazon S3 for storing model artifacts and the ability to scale efficiently, SageMaker makes it easier to train your models faster and more cost-effectively.
By the end of this guide, you should have the knowledge to confidently implement a scalable and reliable machine learning pipeline on AWS SageMaker, enabling you to deploy models at scale with minimal hassle. With SageMaker handling the infrastructure, including the efficient use of Spot Instances, you can focus on what truly matters—innovating and advancing your machine learning models.
Happy building!