AWS SageMaker on a Budget: Train Models with Custom Docker Images and Spot Instances

AWS SageMaker on a Budget: Train Models with Custom Docker Images and Spot Instances

AWS SageMaker is a fully managed service that simplifies building, training, and deploying machine learning models at scale. One of its standout features is the ability to use custom Docker images for training your models. In this article, we’ll explore how to build and deploy a custom model training pipeline in AWS SageMaker using your own Docker container.

While SageMaker provides pre-built containers for popular deep learning frameworks such as TensorFlow, PyTorch, MXNet, HuggingFace, and XGBoost, there are times when you may want to use a custom algorithm for your training. If the provided SageMaker managed containers don’t suit your needs, you can easily bring your own container for a personalized training solution.

How SageMaker Manages Your Infrastructure

  1. Uploading Code and Dependencies: Your training script and dependencies are uploaded to Amazon S3.
  2. Provisioning Infrastructure: SageMaker provisions the necessary compute resources in a fully managed cluster.
  3. Container Setup: Your specified container image (e.g., a PyTorch container) is pulled and instantiated on each training instance.
  4. Code Deployment: The training code is retrieved from S3 and made available inside the container.
  5. Dataset Access: The training data is pulled from S3 and made accessible to the container.
  6. Training Execution: SageMaker starts the training process on the provisioned instances.
  7. Model Storage: Once training completes, the model artifacts are saved to a designated S3 location.

Simplifying Training with SageMaker Managed Spot Instances

Training machine learning models on spot instances can be tricky since these instances may be terminated with little notice (as short as two minutes!). This can potentially disrupt your training. However, AWS SageMaker makes it simple by automatically backing up your training checkpoints to S3.

If a training instance is interrupted, SageMaker seamlessly restarts the job, resuming from the last checkpoint without skipping a beat. It handles the logistics of copying your data and the latest checkpoint to a new instance, so you don’t lose any progress. With SageMaker’s spot instance management, you can train your models efficiently and without interruption.

Understanding SageMaker's Container Folder Structure

SageMaker organizes training data and model artifacts inside containers under the /opt/ml directory. Here’s a breakdown of the key directories:

  • Input Data: Training data is stored in /opt/ml/input/data/<channel_name>/.
  • Model Artifacts: Trained model files are saved in /opt/ml/model/.

After training, SageMaker packages the model into a tar archive and uploads it to the designated S3 location for easy retrieval.


SageMaker folder structure

Writing the Dockerfile

To create a custom Docker container for your training job, start by defining your Dockerfile. Below is an example of a base Dockerfile with CUDA and cuDNN support:

# Base image with CUDA and cuDNN
FROM nvidia/cuda:12.6.2-cudnn-devel-ubuntu22.04

# Install dependencies
RUN apt-get -y update && apt-get install -y --no-install-recommends \
    libusb-1.0-0-dev \
    libudev-dev \
    build-essential \
    python3 \
    python3-pip \
    python3-dev \
    ca-certificates \
    openssl \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    cmake && \
    rm -fr /var/lib/apt/lists/*

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=TRUE \
    PYTHONDONTWRITEBYTECODE=TRUE \
    PATH=/usr/local/cuda/bin:$PATH \
    PYTHONPATH=/opt/ml/input/data/code \
    CUDA_VISIBLE_DEVICES=0

# Link python3 to python
RUN ln -s /usr/bin/python3 /usr/bin/python

# Set working directory for SageMaker code
WORKDIR /opt/ml/input/data/code

# Copy requirements.txt to the container
COPY requirements.txt ./

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt        

Avoiding Frequent Container Rebuilds

If you package your training code inside the container, you’ll need to rebuild it every time you make changes. This can slow down your workflow. Instead, SageMaker allows you to upload your training code externally, so you can easily iterate without rebuilding the entire container.

If you prefer to package your code inside the container, SageMaker fully supports that as well. Either way, you’ll have flexibility to choose what works best for your workflow.

Building and Pushing the Docker Image

After writing your Dockerfile, you can build and push the container image to AWS Elastic Container Registry (ECR):

1. Build the Docker image

docker build -t {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Region}.amazonaws.com/{Custom_Image_Name}:{tag} .        

2. Login to AWS ECR:

aws ecr get-login-password --region {Your_AWS_Region} | docker login --username AWS --password-stdin {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Region}.amazonaws.com        

3. Push the image:

docker push {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Region}.amazonaws.com/{Custom_Image_Name}:{tag}        

Now your custom container is ready to use in SageMaker!

Constructing the SageMaker Training Job JSON Object

Once your image is pushed to ECR, configure the SageMaker training job by creating a JSON object for the job:

def configure_training_job():
    return {
        'TrainingJobName': JOB_NAME,
        'HyperParameters': HYPERPARAMETERS,
        'AlgorithmSpecification': {
            'TrainingImage': ECR_CONTAINER_URL,
            'TrainingInputMode': 'File',
            "ContainerEntrypoint": ["python"],
            "ContainerArguments": ["/opt/ml/input/data/code/train.py"],
        },
        'RoleArn': SAGEMAKER_ROLE,
        'InputDataConfig': [
            {
                'ChannelName': 'code',
                'DataSource': {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": uris['code'],
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                "InputMode": "File",
            },
            {
                'ChannelName': 'training_data',
                'DataSource': {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": uris['training_data'],
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                'ContentType': 'image/jpg',
                'CompressionType': 'None'
            }
        ],
        'OutputDataConfig': {
            "S3OutputPath": <outputpath>
        },
        'ResourceConfig': {
            "InstanceType": INSTANCE_TYPE,
            "InstanceCount": INSTANCE_COUNT,
            "VolumeSizeInGB": MEMORY_VOLUME
        },
        'StoppingCondition': {
            'MaxWaitTimeInSeconds': 86400 * 4,
            'MaxRuntimeInSeconds': 86400 * 4
        },
        'EnableManagedSpotTraining': True,
        'CheckpointConfig': {
            'S3Uri': <s3 path for storing checkpoints>,
            'LocalPath': '/opt/ml/checkpoints'
        },
        "Environment": {
            'ENV': 'SageMaker'
        },
        "Tags": [{
            'Key': 'JOB_NAME',
            'Value': job_name
        }]
    }
        

Here we are specifying the docker ENTRYPOINT as python and it will run the train.py file which will be available at /opt/ml/input/data/code/train.py path.

        'AlgorithmSpecification': {
            'TrainingImage': ECR_CONTAINER_URL,
            'TrainingInputMode': 'File',
            "ContainerEntrypoint": ["python"],
            "ContainerArguments": ["/opt/ml/input/data/code/train.py"],
        }        

This is how the train.py file along with other helper files will be available to the docker container. SageMaker will download all the python script from s3 location so that it can run.

        'InputDataConfig': [
            {
                'ChannelName': 'code',
                'DataSource': {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": uris['code'],
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                "InputMode": "File",
            }
]        

Note: we are providing the HYPERPARAMETERS a json object.

Handling Spot Instance Interruptions

Training jobs on spot instances can be interrupted at any time. To safeguard against lost progress, you can save your model checkpoints after each epoch. When SageMaker reclaims the spot instance, your training will resume from the latest checkpoint, ensuring no work is lost.

Here's how to manage checkpoints and resume training:

def save_checkpoint(epoch, model, training_components, checkpoints_file_path):
    checkpoint_name = f"epoch_{epoch}.pth"
    checkpoint_path = os.path.join(checkpoints_file_path, checkpoint_name)
    checkpoint_data = {
        'epoch': epoch,
        'model': model.state_dict(),
        'optim': training_components.optimizer.state_dict()
    }
    torch.save(checkpoint_data, checkpoint_path)
    logger.info(f"Epoch {epoch}: Saving checkpoint to {checkpoint_path}")
    
def find_latest_epoch_file(cls):
    checkpoints_location = MLPaths.checkpoints_path
    files = [file for file in os.listdir(checkpoints_location) if file.startswith("epoch_") and file.endswith(".pth")]
    if not files:
        return None, 1
    latest_file = max(files, key=lambda x: int(x.split('_')[1].split('.')[0]))
    latest_epoch = int(latest_file.split('_')[1].split('.')[0]) + 1
    return latest_file, latest_epoch        

Here's a brief explanation of the code:

  • save_checkpoint: Saves the model and optimizer state at the end of each epoch to a checkpoint file (epoch_{epoch}.pth), allowing training to resume later.
  • find_latest_epoch_file: Finds the most recent checkpoint file in the specified directory, returns the filename and the next epoch to continue from. If no checkpoint exists, it starts from epoch 1.

Creating the SageMaker Training Job

Once your configuration is complete, launch the SageMaker training job:

def main():
    sage_maker_client = boto3.client('sagemaker', region_name='eu-north-1')
    training_job_config = configure_training_job()
    sage_maker_client.create_training_job(**training_job_config)

if __name__ == "__main__":
    main()        

With this setup, your model will train on SageMaker, and spot instance interruptions will no longer disrupt your training process.

Bonus: You can also set up AWS EventBridge to receive notifications about your training job status, sending updates to Slack or via email!


Conclusion

In this tutorial, we’ve walked through the entire process of building, training, and deploying a custom model training pipeline using AWS SageMaker and Docker. By leveraging SageMaker’s managed infrastructure, we can focus on building sophisticated machine learning models while AWS takes care of the heavy lifting—provisioning resources, handling spot instances, and seamlessly managing interruptions.

Why Use AWS SageMaker Spot Instances? Spot Instances are an excellent choice for machine learning training jobs, especially when dealing with large datasets or computationally expensive models. They provide significant cost savings, sometimes as much as 90% compared to on-demand instances, making them an attractive option for budget-conscious projects.

With AWS SageMaker, using Spot Instances becomes hassle-free due to automatic checkpointing and seamless resumption of training. If a spot instance is interrupted, SageMaker will restore your training from the last saved checkpoint, eliminating the risk of losing progress. This built-in resilience ensures that you can take advantage of the cost savings without compromising the reliability of your training pipeline.

Whether you’re working with popular deep learning frameworks or your own custom training algorithm, SageMaker’s flexibility allows you to bring your own Docker containers, providing complete control over the training environment. Additionally, with easy access to Amazon S3 for storing model artifacts and the ability to scale efficiently, SageMaker makes it easier to train your models faster and more cost-effectively.

By the end of this guide, you should have the knowledge to confidently implement a scalable and reliable machine learning pipeline on AWS SageMaker, enabling you to deploy models at scale with minimal hassle. With SageMaker handling the infrastructure, including the efficient use of Spot Instances, you can focus on what truly matters—innovating and advancing your machine learning models.

Happy building!

要查看或添加评论,请登录

Md. Moniruzzaman的更多文章

社区洞察

其他会员也浏览了