AWS S3 Multipart Uploading
Hello Guys, I am Milind Verma and in this article I am going to show how you can perform multipart upload in the S3 bucket using Python Boto3.
In this era of cloud technology, we all are working with huge data sets on a daily basis. Part of our job description is to transfer data with low latency :). Amazon Simple Storage Service (S3) can store files up to 5TB, yet with a single PUT operation, we can upload objects up to 5 GB only. Amazon suggests, for objects larger than 100 MB, customers should consider using the Multipart upload capability.
AWS SDK, AWS CLI, and AWS S3 REST API can be used for Multipart Upload/Download. We will be using Python SDK for this guide. Before we start, you need to have your environment ready to work with Python and Boto3. Install the proper version of python and boto3.
I have created a program that we can use as a Linux command to upload the data from on-premises to S3.
This command consists of help option:
-h: this option gives us the help for the command. It consists of the command information.
[root@localhost items]# s3mulp -h This program is used to upload files to AWS S3 using multipart upload. Usage: s3mulp [-s LOCAL_DIRECTORY] [ -b BUCKET_NAME] [-d DEST_DIRECTORY] [-ext FILES_EXTENSIONS] where: LOCAL_DIRECTORY is the path to the local directory which contains the local files to be transferred. BUCKET_NAME is the name of the destination S3 bucket. DEST_DIRECTORY (optional) is the path inside the destination bucket that the files need to be transferred to. Note that it should not start with '/'. If it is not specified, files will be uploaded to the main bucket directory. FILES_EXTENSION (optional) is the extensions of the files in LOCAL_DIRECTORY that need to be transfered. Extensions should be separated by ',' only. If FILES_EXTENSION is not specified, all files in the directory are uploaded (except files whose names start with '.'). This program uploads files only; folders are ignored. Enclose all arguments with quotation marks, as shown in the example below. Example: s3mulp -s "/Users/abc/xyz/" -b "bucket_3" -d "2018/Nov/" -ext "png,csv" This will upload all png and csv files in the local directory 'xyz'
to the directory '2018/Nov/' inside bucket_3.
Now, let's start with the python script.
Firstly we include the following libraries that we are using in this code.
import boto3 from boto3.s3.transfer import TransferConfig import os import threading import sys import requests
boto3 is used for connecting to AWS cloud through python.
TransferConfig is used to set the multipart configuration including multipart_threshold, multipart_chunksize, number of threads, max_concurency.
Sys is used for system commands that we are using in the code.
Now we create a function as functions are easy to handle the code.
this code consists of multiple parameters to configure the multipart threshold.
def multipart_upload_boto3(): # multipart_threshold : Ensure that multipart uploads/downloads only happen if the size of a transfer # is larger than 25 MB # multipart_chunksize : Each part size is of 25 MB config = TransferConfig(multipart_threshold=1024 * 25, max_concurrency=10, multipart_chunksize=1024 * 25, use_threads=True)
this code takes the command parameters at runtime. Here I also include the help option to print the command usage.
cmd_args = sys.argv if len(cmd_args) == 2 and ('-h' in cmd_args or '--help' in cmd_args): print(__doc__)
sys.exit()
Now we create the s3 resource so that we can connect to s3 using the python SDK.
# create a resource instance s3 = boto3.resource('s3')
Now here I have given the use of options that we are using in the command. These options include:
-b for the bucket name.
-d option for the destination directory.
-s for the source directory.
-ext if we want to only send the files whose extension matches with the given pattern.
if len(cmd_args) > 3: if '-b' in cmd_args: b_index = cmd_args.index('-b') # the bucket name in the aws cloud bucket = cmd_args[b_index + 1] else: print("ERROR: specify [-b] option with bucket name") if '-d' in cmd_args: d_index = cmd_args.index('-d') # the destination folder in the destination bucket dest_directory = cmd_args[d_index + 1] else: dest_directory = '' if '-s' in cmd_args: l_index = cmd_args.index('-s') # the src folder in the destination bucket local_directory = cmd_args[l_index + 1] else: print("ERROR: specify [-s] with the local dir")
if '-ext' in cmd_args: ext_index = cmd_args.index('-ext') extensions = tuple(cmd_args[ext_index + 1].split(',')) files_list = [ x for x in os.listdir(local_directory) if ( not x.startswith(".") and os.path.isfile(os.path.join(local_directory, x)) and x.endswith(extensions)) ] else: files_list = [ x for x in os.listdir(local_directory) if ( not x.startswith(".") and os.path.isfile(os.path.join(local_directory, x))) ]
Finally, we are gathering the file information and running the loop to locate the local directory path and destination directory path.
And at last, we are uploading the file by inputting all the parameters.
for f in files_list: # get source file path src_path = os.path.join(local_directory, f) # specify the destination path inside the bucket dest_path = os.path.join(dest_directory, f) # upload the file s3.Object(bucket, dest_path).upload_file(src_path, ExtraArgs={'ContentType': 'text/plain'}, Config=config, Callback=ProgressPercentage(src_path) )
This code is for progress percentage when the files are uploading into s3.
class ProgressPercentage(object): def __init__(self, filename): self._filename = filename self._size = float(os.path.getsize(filename)) self._seen_so_far = 0 self._lock = threading.Lock() def __call__(self, bytes_amount): # To simplify we'll assume this is hooked up # to a single filename. with self._lock: self._seen_so_far += bytes_amount percentage = (self._seen_so_far / self._size) * 100 sys.stdout.write( "\r%s %s / %s (%.2f%%)" % ( self._filename, self._seen_so_far, self._size, percentage)) sys.stdout.flush()
Here we are calling the function.
if __name__ == '__main__': multipart_upload_boto3()
So finally this will upload the folder to s3 using the multipart upload.
FINAL CODE:
#! /usr/bin/env python3 """ This program is used to upload files to AWS S3 using multipart upload. Usage: s3mulp [-s LOCAL_DIRECTORY] [ -b BUCKET_NAME] [-d DEST_DIRECTORY] [-ext FILES_EXTENSIONS] where: LOCAL_DIRECTORY is the path to the local directory which contains the local files to be transferred. BUCKET_NAME is the name of the destination S3 bucket. DEST_DIRECTORY (optional) is the path inside the destination bucket that the files need to be transferred to. Note that it should not start with '/'. If it is not specified, files will be uploaded to the main bucket directory. FILES_EXTENSION (optional) is the extensions of the files in LOCAL_DIRECTORY that need to be transfered. Extensions should be separated by ',' only. If FILES_EXTENSION is not specified, all files in the directory are uploaded (except files whose names start with '.'). This program uploads files only; folders are ignored. Enclose all arguments with quotation marks, as shown in the example below. Example: s3mulp -s "/Users/abc/xyz/" -b "bucket_3" -d "2018/Nov/" -ext "png,csv" This will upload all png and csv files in the local directory 'xyz' to the directory '2018/Nov/' inside bucket_3. """ import boto3 from boto3.s3.transfer import TransferConfig import os import threading import sys import requests def multipart_upload_boto3(): # multipart_threshold : Ensure that multipart uploads/downloads only happen if the size of a transfer # is larger than 25 MB # multipart_chunksize : Each part size is of 25 MB config = TransferConfig(multipart_threshold=1024 * 25, max_concurrency=10, multipart_chunksize=1024 * 25, use_threads=True) cmd_args = sys.argv if len(cmd_args) == 2 and ('-h' in cmd_args or '--help' in cmd_args): print(__doc__) sys.exit() # create a resource instance s3 = boto3.resource('s3') if len(cmd_args) > 3: if ('-b' in cmd_args): b_index = cmd_args.index('-b') # the bucket name in the aws cloud bucket = cmd_args[b_index + 1] else: print("ERROR: specify [-b] option with bucket name") if '-d' in cmd_args: d_index = cmd_args.index('-d') # the destination folder in the destination bucket dest_directory = cmd_args[d_index + 1] else: dest_directory = '' if '-s' in cmd_args: l_index = cmd_args.index('-s') # the src folder in the destination bucket local_directory = cmd_args[l_index + 1] else: print("ERROR: specify [-s] with the local dir") if '-ext' in cmd_args: ext_index = cmd_args.index('-ext') extensions = tuple(cmd_args[ext_index + 1].split(',')) files_list = [ x for x in os.listdir(local_directory) if ( not x.startswith(".") and os.path.isfile(os.path.join(local_directory, x)) and x.endswith(extensions)) ] else: files_list = [ x for x in os.listdir(local_directory) if ( not x.startswith(".") and os.path.isfile(os.path.join(local_directory, x))) ] # loop through the desired source files for f in files_list: # get source file path src_path = os.path.join(local_directory, f) # specify the destination path inside the bucket dest_path = os.path.join(dest_directory, f) # upload the file and make it public s3.Object(bucket, dest_path).upload_file(src_path, ExtraArgs={'ContentType': 'text/plain'}, Config=config, Callback=ProgressPercentage(src_path) ) class ProgressPercentage(object): def __init__(self, filename): self._filename = filename self._size = float(os.path.getsize(filename)) self._seen_so_far = 0 self._lock = threading.Lock() def __call__(self, bytes_amount): # To simplify we'll assume this is hooked up # to a single filename. with self._lock: self._seen_so_far += bytes_amount percentage = (self._seen_so_far / self._size) * 100 sys.stdout.write( "\r%s %s / %s (%.2f%%)" % ( self._filename, self._seen_so_far, self._size, percentage)) sys.stdout.flush() if __name__ == '__main__': multipart_upload_boto3()
OUTPUT (without including extension):
OUTPUT( with including extension ):
this will only upload files with the given extension.
Hope you all like it :)
Thanks and feel free to ask any query.