登录查看更多内容

AWS SageMaker on a Budget: Train Models with Custom Docker Images and Spot Instances

Md. Moniruzzaman

Lead Software Engineer @ Cefalo | Data Engineering | Airflow | Aws | Aws IoT Core | Kinesis Data Stream | IaC | Terraform | Terragrunt | Python | Pyspark | DynamoDB

发布日期: 2025年1月11日

AWS SageMaker is a fully managed service that simplifies building, training, and deploying machine learning models at scale. One of its standout features is the ability to use custom Docker images for training your models. In this article, we’ll explore how to build and deploy a custom model training pipeline in AWS SageMaker using your own Docker container.

While SageMaker provides pre-built containers for popular deep learning frameworks such as TensorFlow, PyTorch, MXNet, HuggingFace, and XGBoost, there are times when you may want to use a custom algorithm for your training. If the provided SageMaker managed containers don’t suit your needs, you can easily bring your own container for a personalized training solution.

How SageMaker Manages Your Infrastructure

Uploading Code and Dependencies: Your training script and dependencies are uploaded to Amazon S3.
Provisioning Infrastructure: SageMaker provisions the necessary compute resources in a fully managed cluster.
Container Setup: Your specified container image (e.g., a PyTorch container) is pulled and instantiated on each training instance.
Code Deployment: The training code is retrieved from S3 and made available inside the container.
Dataset Access: The training data is pulled from S3 and made accessible to the container.
Training Execution: SageMaker starts the training process on the provisioned instances.
Model Storage: Once training completes, the model artifacts are saved to a designated S3 location.

Simplifying Training with SageMaker Managed Spot Instances

Training machine learning models on spot instances can be tricky since these instances may be terminated with little notice (as short as two minutes!). This can potentially disrupt your training. However, AWS SageMaker makes it simple by automatically backing up your training checkpoints to S3.

If a training instance is interrupted, SageMaker seamlessly restarts the job, resuming from the last checkpoint without skipping a beat. It handles the logistics of copying your data and the latest checkpoint to a new instance, so you don’t lose any progress. With SageMaker’s spot instance management, you can train your models efficiently and without interruption.

Understanding SageMaker's Container Folder Structure

SageMaker organizes training data and model artifacts inside containers under the /opt/ml directory. Here’s a breakdown of the key directories:

Input Data: Training data is stored in /opt/ml/input/data/<channel_name>/.
Model Artifacts: Trained model files are saved in /opt/ml/model/.

After training, SageMaker packages the model into a tar archive and uploads it to the designated S3 location for easy retrieval.

Writing the Dockerfile

To create a custom Docker container for your training job, start by defining your Dockerfile. Below is an example of a base Dockerfile with CUDA and cuDNN support:

# Base image with CUDA and cuDNN
FROM nvidia/cuda:12.6.2-cudnn-devel-ubuntu22.04

# Install dependencies
RUN apt-get -y update && apt-get install -y --no-install-recommends \
    libusb-1.0-0-dev \
    libudev-dev \
    build-essential \
    python3 \
    python3-pip \
    python3-dev \
    ca-certificates \
    openssl \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    cmake && \
    rm -fr /var/lib/apt/lists/*

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=TRUE \
    PYTHONDONTWRITEBYTECODE=TRUE \
    PATH=/usr/local/cuda/bin:$PATH \
    PYTHONPATH=/opt/ml/input/data/code \
    CUDA_VISIBLE_DEVICES=0

# Link python3 to python
RUN ln -s /usr/bin/python3 /usr/bin/python

# Set working directory for SageMaker code
WORKDIR /opt/ml/input/data/code

# Copy requirements.txt to the container
COPY requirements.txt ./

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

Avoiding Frequent Container Rebuilds

If you package your training code inside the container, you’ll need to rebuild it every time you make changes. This can slow down your workflow. Instead, SageMaker allows you to upload your training code externally, so you can easily iterate without rebuilding the entire container.

If you prefer to package your code inside the container, SageMaker fully supports that as well. Either way, you’ll have flexibility to choose what works best for your workflow.

Building and Pushing the Docker Image

After writing your Dockerfile, you can build and push the container image to AWS Elastic Container Registry (ECR):

1. Build the Docker image

docker build -t {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Region}.amazonaws.com/{Custom_Image_Name}:{tag} .

2. Login to AWS ECR:

aws ecr get-login-password --region {Your_AWS_Region} | docker login --username AWS --password-stdin {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Region}.amazonaws.com

3. Push the image:

docker push {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Region}.amazonaws.com/{Custom_Image_Name}:{tag}

Now your custom container is ready to use in SageMaker!

领英推荐

AWS Cloud-Based Deployment

360DigiTMG 5 个月前

AWS re:Invent 2022

IPSpecialist 2 年前

AWS Lambda Layers: Simplifying Serverless Codebases

Antematter 11 个月前

Constructing the SageMaker Training Job JSON Object

Once your image is pushed to ECR, configure the SageMaker training job by creating a JSON object for the job:

def configure_training_job():
    return {
        'TrainingJobName': JOB_NAME,
        'HyperParameters': HYPERPARAMETERS,
        'AlgorithmSpecification': {
            'TrainingImage': ECR_CONTAINER_URL,
            'TrainingInputMode': 'File',
            "ContainerEntrypoint": ["python"],
            "ContainerArguments": ["/opt/ml/input/data/code/train.py"],
        },
        'RoleArn': SAGEMAKER_ROLE,
        'InputDataConfig': [
            {
                'ChannelName': 'code',
                'DataSource': {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": uris['code'],
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                "InputMode": "File",
            },
            {
                'ChannelName': 'training_data',
                'DataSource': {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": uris['training_data'],
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                'ContentType': 'image/jpg',
                'CompressionType': 'None'
            }
        ],
        'OutputDataConfig': {
            "S3OutputPath": <outputpath>
        },
        'ResourceConfig': {
            "InstanceType": INSTANCE_TYPE,
            "InstanceCount": INSTANCE_COUNT,
            "VolumeSizeInGB": MEMORY_VOLUME
        },
        'StoppingCondition': {
            'MaxWaitTimeInSeconds': 86400 * 4,
            'MaxRuntimeInSeconds': 86400 * 4
        },
        'EnableManagedSpotTraining': True,
        'CheckpointConfig': {
            'S3Uri': <s3 path for storing checkpoints>,
            'LocalPath': '/opt/ml/checkpoints'
        },
        "Environment": {
            'ENV': 'SageMaker'
        },
        "Tags": [{
            'Key': 'JOB_NAME',
            'Value': job_name
        }]
    }

Here we are specifying the docker ENTRYPOINT as python and it will run the train.py file which will be available at /opt/ml/input/data/code/train.py path.

        'AlgorithmSpecification': {
            'TrainingImage': ECR_CONTAINER_URL,
            'TrainingInputMode': 'File',
            "ContainerEntrypoint": ["python"],
            "ContainerArguments": ["/opt/ml/input/data/code/train.py"],
        }

This is how the train.py file along with other helper files will be available to the docker container. SageMaker will download all the python script from s3 location so that it can run.

        'InputDataConfig': [
            {
                'ChannelName': 'code',
                'DataSource': {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": uris['code'],
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                "InputMode": "File",
            }
]

Note: we are providing the HYPERPARAMETERS a json object.

Handling Spot Instance Interruptions

Training jobs on spot instances can be interrupted at any time. To safeguard against lost progress, you can save your model checkpoints after each epoch. When SageMaker reclaims the spot instance, your training will resume from the latest checkpoint, ensuring no work is lost.

Here's how to manage checkpoints and resume training:

def save_checkpoint(epoch, model, training_components, checkpoints_file_path):
    checkpoint_name = f"epoch_{epoch}.pth"
    checkpoint_path = os.path.join(checkpoints_file_path, checkpoint_name)
    checkpoint_data = {
        'epoch': epoch,
        'model': model.state_dict(),
        'optim': training_components.optimizer.state_dict()
    }
    torch.save(checkpoint_data, checkpoint_path)
    logger.info(f"Epoch {epoch}: Saving checkpoint to {checkpoint_path}")
    
def find_latest_epoch_file(cls):
    checkpoints_location = MLPaths.checkpoints_path
    files = [file for file in os.listdir(checkpoints_location) if file.startswith("epoch_") and file.endswith(".pth")]
    if not files:
        return None, 1
    latest_file = max(files, key=lambda x: int(x.split('_')[1].split('.')[0]))
    latest_epoch = int(latest_file.split('_')[1].split('.')[0]) + 1
    return latest_file, latest_epoch

Here's a brief explanation of the code:

save_checkpoint: Saves the model and optimizer state at the end of each epoch to a checkpoint file (epoch_{epoch}.pth), allowing training to resume later.
find_latest_epoch_file: Finds the most recent checkpoint file in the specified directory, returns the filename and the next epoch to continue from. If no checkpoint exists, it starts from epoch 1.

Creating the SageMaker Training Job

Once your configuration is complete, launch the SageMaker training job:

def main():
    sage_maker_client = boto3.client('sagemaker', region_name='eu-north-1')
    training_job_config = configure_training_job()
    sage_maker_client.create_training_job(**training_job_config)

if __name__ == "__main__":
    main()

With this setup, your model will train on SageMaker, and spot instance interruptions will no longer disrupt your training process.

Bonus: You can also set up AWS EventBridge to receive notifications about your training job status, sending updates to Slack or via email!

Conclusion

In this tutorial, we’ve walked through the entire process of building, training, and deploying a custom model training pipeline using AWS SageMaker and Docker. By leveraging SageMaker’s managed infrastructure, we can focus on building sophisticated machine learning models while AWS takes care of the heavy lifting—provisioning resources, handling spot instances, and seamlessly managing interruptions.

Why Use AWS SageMaker Spot Instances? Spot Instances are an excellent choice for machine learning training jobs, especially when dealing with large datasets or computationally expensive models. They provide significant cost savings, sometimes as much as 90% compared to on-demand instances, making them an attractive option for budget-conscious projects.

With AWS SageMaker, using Spot Instances becomes hassle-free due to automatic checkpointing and seamless resumption of training. If a spot instance is interrupted, SageMaker will restore your training from the last saved checkpoint, eliminating the risk of losing progress. This built-in resilience ensures that you can take advantage of the cost savings without compromising the reliability of your training pipeline.

Whether you’re working with popular deep learning frameworks or your own custom training algorithm, SageMaker’s flexibility allows you to bring your own Docker containers, providing complete control over the training environment. Additionally, with easy access to Amazon S3 for storing model artifacts and the ability to scale efficiently, SageMaker makes it easier to train your models faster and more cost-effectively.

By the end of this guide, you should have the knowledge to confidently implement a scalable and reliable machine learning pipeline on AWS SageMaker, enabling you to deploy models at scale with minimal hassle. With SageMaker handling the infrastructure, including the efficient use of Spot Instances, you can focus on what truly matters—innovating and advancing your machine learning models.

Happy building!

要查看或添加评论，请登录

Md. Moniruzzaman的更多文章

From Duplication to Deduplication: A Scalable IoT Architecture with AWS IoT Core and SQS FIFO

2024年10月26日

From Duplication to Deduplication: A Scalable IoT Architecture with AWS IoT Core and SQS FIFO

In the world of IoT, managing message integrity and ensuring consistency across distributed systems can be challenging.…

1 条评论
Building a Real-Time Data Streaming Architecture with AWS IoT, Kinesis, Timestream, and WebSockets

2024年8月10日

Building a Real-Time Data Streaming Architecture with AWS IoT, Kinesis, Timestream, and WebSockets

In the era of IoT, handling real-time data efficiently is crucial. Today, I will share how we implemented a real-time…

1 条评论
Revolutionizing Fleet Management: Seamless IoT Device Onboarding with AWS IoT Core Bulk Registration

2024年6月1日

Revolutionizing Fleet Management: Seamless IoT Device Onboarding with AWS IoT Core Bulk Registration

In today’s rapidly advancing technological landscape, managing a fleet of IoT devices efficiently is paramount. We…

1 条评论
Stay Ahead of IoT Disconnections: How to Use AWS IoT Core’s Last Will Feature for Instant Slack Alerts

2024年5月18日

Stay Ahead of IoT Disconnections: How to Use AWS IoT Core’s Last Will Feature for Instant Slack Alerts

Ensuring reliable connectivity and receiving timely notifications about device status changes is crucial in IoT…
Optimizing Data Flows: Harnessing the Potential of Kinesis DataStream, Lambda, and Timestream in AWS

2024年2月22日

Optimizing Data Flows: Harnessing the Potential of Kinesis DataStream, Lambda, and Timestream in AWS

Introduction: In today's data-driven era, rapid and efficient data processing is crucial for businesses to derive…
Automating AWS IoT Core Data Ingestion into Timestream with Terraform

2024年1月21日

Automating AWS IoT Core Data Ingestion into Timestream with Terraform

Navigating the IoT landscape involves efficiently managing and analyzing vast amounts of data. Today, let's explore an…

3 条评论
AWS Lambda Deployments with Terraform Modules and Terragrunt

2023年12月15日

AWS Lambda Deployments with Terraform Modules and Terragrunt

In the dynamic landscape of cloud computing, server-less architectures have become the go-to solution for their…

1 条评论
Streamlining Airflow on AWS EKS with Slack Integration: A Step-by-Step Guide

2023年11月12日

Streamlining Airflow on AWS EKS with Slack Integration: A Step-by-Step Guide

Enhancing Security with Kubernetes secrets and AWS Secret Manager Airflow has become an essential tool for…
Setting up Airflow on EKS with SSL and Nginx Ingress Controller

2023年11月6日

Setting up Airflow on EKS with SSL and Nginx Ingress Controller

Introduction: Apache Airflow is a popular open-source platform to programmatically author, schedule, and monitor…

5 条评论

See all articles

AWS SageMaker on a Budget: Train Models with Custom Docker Images and Spot Instances

Md. Moniruzzaman

Lead Software Engineer @ Cefalo | Data Engineering | Airflow | Aws | Aws IoT Core | Kinesis Data Stream | IaC | Terraform | Terragrunt | Python | Pyspark | DynamoDB

How SageMaker Manages Your Infrastructure

Simplifying Training with SageMaker Managed Spot Instances

Understanding SageMaker's Container Folder Structure

Writing the Dockerfile

Avoiding Frequent Container Rebuilds

Building and Pushing the Docker Image

领英推荐

Constructing the SageMaker Training Job JSON Object

Handling Spot Instance Interruptions

Creating the SageMaker Training Job

Conclusion

Md. Moniruzzaman的更多文章

社区洞察

其他会员也浏览了

An Introduction to AWS SageMaker

2025 - Week 3 (13 Jan - 19 Jan)

AWS Community Builders: How to Join the Program

Deploying an LLM Using Amazon SageMaker JumpStart: A Step-by-Step Guide

Serverless Computing

AWS update of Week 26 (26Jun - 2Jul)

AWS update of Week 23 (5Jun - 11Jun)

Hands on with AWS Bedrock

Implementing and Validating Cloud-Native Microservices Using CNTI/CNF Test Catalog on AWS...

Understand AWS Lambda with Serverless Framework & Creating First Lambda Function from Scratch.

How SageMaker Manages Your Infrastructure

Simplifying Training with SageMaker Managed Spot Instances

Understanding SageMaker's Container Folder Structure

Writing the Dockerfile

Avoiding Frequent Container Rebuilds

Building and Pushing the Docker Image

领英推荐

Constructing the SageMaker Training Job JSON Object

Handling Spot Instance Interruptions

Creating the SageMaker Training Job

Conclusion

Md. Moniruzzaman的更多文章

From Duplication to Deduplication: A Scalable IoT Architecture with AWS IoT Core and SQS FIFO

Building a Real-Time Data Streaming Architecture with AWS IoT, Kinesis, Timestream, and WebSockets

Revolutionizing Fleet Management: Seamless IoT Device Onboarding with AWS IoT Core Bulk Registration

Stay Ahead of IoT Disconnections: How to Use AWS IoT Core’s Last Will Feature for Instant Slack Alerts

Optimizing Data Flows: Harnessing the Potential of Kinesis DataStream, Lambda, and Timestream in AWS

Automating AWS IoT Core Data Ingestion into Timestream with Terraform

AWS Lambda Deployments with Terraform Modules and Terragrunt

Streamlining Airflow on AWS EKS with Slack Integration: A Step-by-Step Guide

Setting up Airflow on EKS with SSL and Nginx Ingress Controller

社区洞察

其他会员也浏览了

An Introduction to AWS SageMaker

2025 - Week 3 (13 Jan - 19 Jan)

AWS Community Builders: How to Join the Program

Deploying an LLM Using Amazon SageMaker JumpStart: A Step-by-Step Guide

Serverless Computing

AWS update of Week 26 (26Jun - 2Jul)

AWS update of Week 23 (5Jun - 11Jun)

Hands on with AWS Bedrock

Implementing and Validating Cloud-Native Microservices Using CNTI/CNF Test Catalog on AWS...

Understand AWS Lambda with Serverless Framework & Creating First Lambda Function from Scratch.