How to transfer data with AWS DataSync the easy way
Source: AWS

How to transfer data with AWS DataSync the easy way

AWS DataSync is a great, secure AWS service to move data between on premises and AWS storage services. With DataSync, you can copy data between Network File System (NFS) or Server Message block (SMB) file servers, Amazon Simple Storage Service (S3), Amazon Elastic File System (EFS) and Amazon FSx for Windows File Servers.

To get the basic understanding of DataSync, please follow this AWS Documentation

Recently, when I was working to migrate gigabytes of data from on-premises to cloud, and to transfer some specific set of data from one S3 to external account S3, I started evaluating the options of data transfer services in terms of time and cost, and looking at the available options among AWS services, I decided to use DataSync for this requirement. I want to highlight some pros and cons of DataSync as per my experience.

?Pros:

  • DataSync automatically scales and handles moving files and objects, scheduling data transfers, monitoring the progress of transfers, data encryption, verification of data transfers
  • With DataSync you pay only for the amount of data copied, with no minimum commitments or upfront fees.
  • DataSync includes encryption and integrity validation to help make sure your data arrives securely, intact, and ready to use.?
  • You can schedule your tasks using the AWS DataSync Console or AWS Command Line Interface (CLI), without needing to write and run scripts to manage repeated transfers. Task scheduling automatically runs tasks on the schedule you configure.
  • If a DataSync task is interrupted (for instance, if the network connection goes down or the DataSync agent is restarted), the next run of the task will transfer missing files, and the data will be complete and consistent at the end of this run.
  • You can specify?exclude filters, include filters, or both, to determine which files, folders, or objects gets transferred each time your task runs.

Cons:

  • ?Fewer how-to articles and unavailability of open source automation solutions.
  • Not a cost/time effective option for petabytes of data transfer.
  • If you are using DataSync Scheduler to copy the data on regular intervals, there is no provision to stop that schedule, other than canceling that task.
  • For?Scheduler Frequency, the minimum interval is of 1 hour. If you have a requirement to sync your data in less than 1 hour, you need to develop your own custom solution ( which I ended up doing and will discuss in my upcoming blog).

DataSync can be used for data transfer in two scenarios: on-premises to Cloud and Cloud to Cloud. Both scenarios are discussed below, along with the code to automate some of the steps.

Scenario 1: On-premises to Cloud Data Migration

No alt text provided for this image

We used AWS DataSync to transfer on-premises data to AWS S3 storage. After analyzing multiple available services, we decided to go for DataSync as along with data transfer, it can manage the data integrity via checksum verification and provides multiple options of include and exclude patterns to transfer the specific data.

Steps to setup DataSync agent on on-premises side

To transfer data between on-premises storage systems and AWS Storage services, deploy a DataSync agent and associate it to your AWS account via the Management Console or API. The agent will be used to access your NFS server or SMB file share to read data from or write data to.?Please follow the steps as mentioned in the AWS blog for agent setup on on-premises side.

Steps to follow on destination account

1.????Go to DataSync service in AWS management console on destination account and select “Create Agent”

2.????Create a destination S3 bucket in the destination account/region with default attributes.

3.????Create an SNS Topic “datasync-notify” to notify Users in case of any DataSync transfer failure.

4.????Create a DataSync Role [s3_data_sync_access] in IAM. Allow DataSync to read and write to your Amazon S3 bucket. The following example policy grants DataSync the minimum permissions to read and write data to your S3 bucket (replace <YourS3BucketArn> with the ARN of your bucket).

{
???"Version": "2012-10-17",

???"Statement": [

???????{

???????????"Action": [

???????????????"s3:GetBucketLocation",

???????????????"s3:ListBucket",

???????????????"s3:ListBucketMultipartUploads"

???????????],

???????????"Effect": "Allow",

???????????"Resource": "<YourS3BucketArn>"

???????},

???????{

???????????"Action": [

???????????????"s3:AbortMultipartUpload",

???????????????"s3:DeleteObject",

???????????????"s3:GetObject",

???????????????"s3:ListMultipartUploadParts",

???????????????"s3:GetObjectTagging",

???????????????"s3:PutObjectTagging",

???????????????"s3:PutObject"

?????????????],

???????????"Effect": "Allow",

???????????"Resource": "<YourS3BucketArn>/*"

???????}

???]

}        

5. Set up a DataSync source location (NFS server) on the destination account.

6. Set up a DataSync destination location (S3) on the destination account.

7. Create a DataSync task to initiate data transfer with specified parameters for source location, destination location, settings and task logging

8. Start execution of DataSync Task

You can also use the lambda code in my git repo to automate the steps 5 through 8.

Lambda Input:


{
	"sourceLocation": “NAME OF SOURCE Directory”,
	“destinationLocation": “NAME OF DESTINATION BUCKET”,
	“AgnetARN”: "arn:aws:datasync:<REGION>:<ACCNT ID>:agent/<AGENT ID>",
	“NFSServer”: “ServerHostname”
}
        

Note: NFSServer is the name of the NFS server. This value is the IP address or Domain Name Service (DNS) name of the NFS server. An agent that is installed on-premises uses this host name to mount the NFS server in a network.

Lambda Output:

{
	"status": “TRANSFERRING”,
    “taskid": “arn:aws:datasync:region:account-id:task/task-id”
}        


Scenario 2: S3 to S3 Cross Account Data Migration

No alt text provided for this image

To transfer the specific S3 data to external account S3, we can again rely on DataSync to transfer the data across the AWS accounts, I found it works better than the S3 copy or S3 sync for cross account copy.

You need to follow certain steps on source as well as destination accounts in order to start seamless data transfer.

Steps to setup DataSync on Destination account

1.????On Destination AWS Account, create S3 bucket where output should be copied.

2.????Please make sure source and destination buckets are in same AWS region.

3.????Enter the following S3 Bucket Policy on destination bucket (replace <sourcebucketaccount> and <destinationbucket> with appropriate values).

{
  "Version": "2012-10-17"

  "Statement": [

?????{

????????"Sid": "BucketPolicyForDataSync",

????????"Effect": "Allow",

????????"Principal": {

???????????"AWS": [

????????????"arn:aws:iam::<sourcebucketaccount>:role/datasync-role-source",

????????????"arn:aws:iam::<sourcebucketaccount>:root"

???????????]

????????},

????????"Action": [

?????????????"s3:GetBucketLocation",

?????????????"s3:ListBucket",

?????????????"s3:ListBucketMultipartUploads",

?????????????"s3:AbortMultipartUpload",

?????????????"s3:DeleteObject",

?????????????"s3:GetObject",

?????????????"s3:ListMultipartUploadParts",

?????????????"s3:PutObject",

?????????????"s3:GetObjectTagging",

?????????????"s3:PutObjectTagging"

??????????],

???????"Resource": [

?????????????"arn:aws:s3:::<destinationbucket>",

?????????????"arn:aws:s3:::<destinationbucket>/*"

????????????]

??????}

???]

}?        

Steps to setup DataSync on Source account

  1. Create “datasync-role-source” IAM role in source S3 bucket AWS account.
  2. Attach these 2 AWS managed policies to datasync-role-source role:

- AWSDataSyncFullAccess

- AWSDataSyncReadOnlyAccess

  1. Create a custom DataSync policy for specific source and destination buckets, (replace <sourcebucket> and <destinationbucket> with appropriate values).

"Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "s3:GetBucketLocation",
                    "s3:ListBucket",
                    "s3:ListBucketMultipartUploads"
                ],
                "Effect": "Allow",
                "Resource": 
                      "arn:aws:s3:::<sourcebucket>",
                      "arn:aws:s3:::<destinationbucket>"
              },
              {
                  "Action": [
                      "s3:AbortMultipartUpload",
                      "s3:DeleteObject",
                      "s3:GetObject",
                      "s3:ListMultipartUploadParts",
                      "s3:PutObjectTagging",
                      "s3:GetObjectTagging",
                      "s3:PutObject"
                  ],
                  "Effect": "Allow",
                  "Resource": 
                        "arn:aws:s3:::<sourcebucket>/*",
                        "arn:aws:s3:::<destinationbucket>/*"
              }
          ]
      }        

  1. Attach the custom policy above to datasync-role-source role.
  2. Set up a DataSync destination location (S3) on the destination account.
  3. Set up a DataSync source location (S3) on the destination account.
  4. Create a DataSync task to initiate data transfer with specified parameters for source location, destination location, settings and task logging
  5. Start execution of DataSync Task

You can also use the lambda code in my git repo to automate the steps 3 through 8.

Lambda Input:

{
	"SourceBucketName": “NAME OF SOURCE BUCKET”,

	“external_bucket": “NAME OF DESTINATION BUCKET”

}        

Lambda Output:

{
	"status": “TRANSFERRING”,

	“taskid": “arn:aws:datasync:region:account-id:task/task-id”

}        

Very soon, I will be publishing a CloudFormation Template to automate all the above steps on source account.

Sanket Sharma

Data Engineering Consultant at Optum | UnitedHealth Group

3 年

This is amazing, One of the best article and tested and its work perfectly.

Udit Das

Strategic IT Leader-?? | Trusted Partner for C-Level Stakeholders | Driving Cloud IT Infrastructure Transformation and Delivery | $100M+ Managed Accounts |

3 年

Too good. One of the best blogs about one of the costliest service

要查看或添加评论,请登录

Ritu Srivastava的更多文章

  • Slash your costs of AWS Lambda by Memory Optimization

    Slash your costs of AWS Lambda by Memory Optimization

    In the realm of serverless computing, AWS Lambda stands out as a powerful and flexible platform that allows developers…

    7 条评论
  • Way to OCI Architect Associate (1Z0-932)

    Way to OCI Architect Associate (1Z0-932)

    When I cleared my Oracle Cloud Architect Associate certification, I received a lot of questions on how to prepare for…

    3 条评论

社区洞察

其他会员也浏览了