AWS S3 Multipart Uploading

AWS S3 Multipart Uploading

Hello Guys, I am Milind Verma and in this article I am going to show how you can perform multipart upload in the S3 bucket using Python Boto3.

In this era of cloud technology, we all are working with huge data sets on a daily basis. Part of our job description is to transfer data with low latency :). Amazon Simple Storage Service (S3) can store files up to 5TB, yet with a single PUT operation, we can upload objects up to 5 GB only. Amazon suggests, for objects larger than 100 MB, customers should consider using the Multipart upload capability.

AWS SDK, AWS CLI, and AWS S3 REST API can be used for Multipart Upload/Download. We will be using Python SDK for this guide. Before we start, you need to have your environment ready to work with Python and Boto3. Install the proper version of python and boto3.

I have created a program that we can use as a Linux command to upload the data from on-premises to S3.

This command consists of help option:

-h: this option gives us the help for the command. It consists of the command information.

[root@localhost items]# s3mulp  -h


This program is used to upload files to AWS S3 using multipart upload.


Usage: s3mulp [-s LOCAL_DIRECTORY] [ -b BUCKET_NAME] [-d DEST_DIRECTORY] [-ext FILES_EXTENSIONS]
where:
    LOCAL_DIRECTORY is the path to the local directory
    which contains the local files to be transferred.


    BUCKET_NAME is the name of the destination S3 bucket.


    DEST_DIRECTORY (optional) is the path inside the destination
    bucket that the files need to be transferred to. Note that
    it should not start with '/'. If it is not specified,
    files will be uploaded to the main bucket directory.


    FILES_EXTENSION (optional) is the extensions of the files in


    LOCAL_DIRECTORY that need to be transfered. Extensions
    should be separated by ',' only. If FILES_EXTENSION is
    not specified, all files in the directory are uploaded
    (except files whose names start with '.').


This program uploads files only; folders are ignored.


Enclose all arguments with quotation marks, as shown
in the example below.


Example:
s3mulp -s "/Users/abc/xyz/" -b "bucket_3" -d "2018/Nov/" -ext "png,csv"
This will upload all png and csv files in the local directory 'xyz'
to the directory '2018/Nov/' inside bucket_3.

No alt text provided for this image

Now, let's start with the python script.

Firstly we include the following libraries that we are using in this code.

import boto3
from boto3.s3.transfer import TransferConfig
import os
import threading
import sys
import requests

boto3 is used for connecting to AWS cloud through python.

TransferConfig is used to set the multipart configuration including multipart_threshold, multipart_chunksize, number of threads, max_concurency.

Sys is used for system commands that we are using in the code.

Now we create a function as functions are easy to handle the code.

this code consists of multiple parameters to configure the multipart threshold.

def multipart_upload_boto3():

    # multipart_threshold : Ensure that multipart uploads/downloads only happen if    the size of a transfer
    # is larger than 25 MB
    # multipart_chunksize : Each part size is of 25 MB
    
    config = TransferConfig(multipart_threshold=1024 * 25,
                        max_concurrency=10,
                        multipart_chunksize=1024 * 25,
                        use_threads=True)

this code takes the command parameters at runtime. Here I also include the help option to print the command usage.

 cmd_args = sys.argv


    if len(cmd_args) == 2 and ('-h' in cmd_args or '--help' in cmd_args):
        print(__doc__)
        sys.exit()

Now we create the s3 resource so that we can connect to s3 using the python SDK.

 # create a resource instance
    s3 = boto3.resource('s3')

Now here I have given the use of options that we are using in the command. These options include:

-b for the bucket name.

-d option for the destination directory.

-s for the source directory.

-ext if we want to only send the files whose extension matches with the given pattern.

 if len(cmd_args) > 3:

        if '-b' in cmd_args:
            b_index = cmd_args.index('-b')
            # the bucket name in the aws cloud
            bucket = cmd_args[b_index + 1]
        else:
            print("ERROR: specify [-b] option with bucket name")

        if '-d' in cmd_args:
            d_index = cmd_args.index('-d')
            # the destination folder in the destination bucket
            dest_directory = cmd_args[d_index + 1]
        else:
            dest_directory = ''

        if '-s' in cmd_args:
            l_index = cmd_args.index('-s')
            # the src folder in the destination bucket
            local_directory = cmd_args[l_index + 1]
        else:
            print("ERROR: specify [-s] with the local dir")

        if '-ext' in cmd_args:
            ext_index = cmd_args.index('-ext')
            extensions = tuple(cmd_args[ext_index + 1].split(','))
            files_list = [
                x for x in os.listdir(local_directory) if (
                    not x.startswith(".") and
                    os.path.isfile(os.path.join(local_directory, x))
                    and x.endswith(extensions))
            ]
        else:
            files_list = [
                x for x in os.listdir(local_directory) if (
                    not x.startswith(".") and
                    os.path.isfile(os.path.join(local_directory, x)))
            ]

Finally, we are gathering the file information and running the loop to locate the local directory path and destination directory path.

And at last, we are uploading the file by inputting all the parameters.

 for f in files_list:
        # get source file path
        src_path = os.path.join(local_directory, f)


        # specify the destination path inside the bucket
        dest_path = os.path.join(dest_directory, f)


        # upload the file
        s3.Object(bucket, dest_path).upload_file(src_path,
                            ExtraArgs={'ContentType': 'text/plain'},
                            Config=config,
                            Callback=ProgressPercentage(src_path)
                            )

This code is for progress percentage when the files are uploading into s3.

class ProgressPercentage(object):
    def __init__(self, filename):
        self._filename = filename
        self._size = float(os.path.getsize(filename))
        self._seen_so_far = 0
        self._lock = threading.Lock()


    def __call__(self, bytes_amount):
        # To simplify we'll assume this is hooked up
        # to a single filename.
        with self._lock:
            self._seen_so_far += bytes_amount
            percentage = (self._seen_so_far / self._size) * 100
            sys.stdout.write(
                "\r%s  %s / %s  (%.2f%%)" % (
                    self._filename, self._seen_so_far, self._size,
                    percentage))
            sys.stdout.flush()

Here we are calling the function.

if __name__ == '__main__':
  multipart_upload_boto3()

So finally this will upload the folder to s3 using the multipart upload.

FINAL CODE:

#! /usr/bin/env python3


"""
This program is used to upload files to AWS S3 using multipart upload.


Usage: s3mulp [-s LOCAL_DIRECTORY] [ -b BUCKET_NAME] [-d DEST_DIRECTORY] [-ext FILES_EXTENSIONS]
where:
    LOCAL_DIRECTORY is the path to the local directory
    which contains the local files to be transferred.


    BUCKET_NAME is the name of the destination S3 bucket.


    DEST_DIRECTORY (optional) is the path inside the destination
    bucket that the files need to be transferred to. Note that
    it should not start with '/'. If it is not specified,
    files will be uploaded to the main bucket directory.


    FILES_EXTENSION (optional) is the extensions of the files in


    LOCAL_DIRECTORY that need to be transfered. Extensions
    should be separated by ',' only. If FILES_EXTENSION is
    not specified, all files in the directory are uploaded
    (except files whose names start with '.').


This program uploads files only; folders are ignored.


Enclose all arguments with quotation marks, as shown
in the example below.


Example:
s3mulp -s "/Users/abc/xyz/" -b "bucket_3" -d "2018/Nov/" -ext "png,csv"
This will upload all png and csv files in the local directory 'xyz'
to the directory '2018/Nov/' inside bucket_3.
"""


import boto3
from boto3.s3.transfer import TransferConfig
import os
import threading
import sys
import requests


def multipart_upload_boto3():

    # multipart_threshold : Ensure that multipart uploads/downloads only happen if the size of a transfer
    # is larger than 25 MB
    # multipart_chunksize : Each part size is of 25 MB
    config = TransferConfig(multipart_threshold=1024 * 25,
                        max_concurrency=10,
                        multipart_chunksize=1024 * 25,
                        use_threads=True)


    cmd_args = sys.argv


    if len(cmd_args) == 2 and ('-h' in cmd_args or '--help' in cmd_args):
        print(__doc__)
        sys.exit()


    # create a resource instance
    s3 = boto3.resource('s3')


    if len(cmd_args) > 3:
        if ('-b' in cmd_args):
            b_index = cmd_args.index('-b')
            # the bucket name in the aws cloud
            bucket = cmd_args[b_index + 1]
        else:
            print("ERROR: specify [-b] option with bucket name")


        if '-d' in cmd_args:
            d_index = cmd_args.index('-d')
            # the destination folder in the destination bucket
            dest_directory = cmd_args[d_index + 1]
        else:
            dest_directory = ''


        if '-s' in cmd_args:
            l_index = cmd_args.index('-s')
            # the src folder in the destination bucket
            local_directory = cmd_args[l_index + 1]
        else:
            print("ERROR: specify [-s] with the local dir")


        if '-ext' in cmd_args:
            ext_index = cmd_args.index('-ext')
            extensions = tuple(cmd_args[ext_index + 1].split(','))
            files_list = [
                x for x in os.listdir(local_directory) if (
                    not x.startswith(".") and
                    os.path.isfile(os.path.join(local_directory, x))
                    and x.endswith(extensions))
            ]
        else:
            files_list = [
                x for x in os.listdir(local_directory) if (
                    not x.startswith(".") and
                    os.path.isfile(os.path.join(local_directory, x)))
            ]
        # loop through the desired source files
    for f in files_list:
        # get source file path
        src_path = os.path.join(local_directory, f)


        # specify the destination path inside the bucket
        dest_path = os.path.join(dest_directory, f)


        # upload the file and make it public
        s3.Object(bucket, dest_path).upload_file(src_path,
                            ExtraArgs={'ContentType': 'text/plain'},
                            Config=config,
                            Callback=ProgressPercentage(src_path)
                            )


class ProgressPercentage(object):
    def __init__(self, filename):
        self._filename = filename
        self._size = float(os.path.getsize(filename))
        self._seen_so_far = 0
        self._lock = threading.Lock()

    def __call__(self, bytes_amount):
        # To simplify we'll assume this is hooked up
        # to a single filename.
        with self._lock:
            self._seen_so_far += bytes_amount
            percentage = (self._seen_so_far / self._size) * 100
            sys.stdout.write(
                "\r%s  %s / %s  (%.2f%%)" % (
                    self._filename, self._seen_so_far, self._size,
                    percentage))
            sys.stdout.flush()



if __name__ == '__main__':
  multipart_upload_boto3()

OUTPUT (without including extension):

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

OUTPUT( with including extension ):

this will only upload files with the given extension.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Hope you all like it :)

Thanks and feel free to ask any query.



要查看或添加评论,请登录

社区洞察

其他会员也浏览了