Replicating data from RDS SQL Server to Redshift using AWS DMS in compliance with PCI DSS
Hi everyone. I recently designed and built an offline data analytics cloud architecture for my company on AWS. The reason why I used a data warehouse for analytics are the following: 1) Data warehouse stores the data from multiple sources centrally in a uniform format. 2) Data warehouse reduces the I/O to the database from the compute-heavy analysis requests which avoids system bottleneck by the database.
My offline data analytics architecture includes several critical steps in the following illustration.
1)???Move selected data from multiple sources to the data warehouse
2)???Centralized data repository in a tabular format for analytics
3)???Data analysis with AWS Redshift cluster and Python-based analytic apps
4)???Data BI virtualization with AWS Quicksight
5)???Data modeling with AWS Sagemaker
Based on my work domain of Fintech, my primary goal would be securing Payment Card Industry (PCI) and Personally Identifiable Information (PII) data in the whole system. So my concern is how to protect sensitive data among the data flows of the architecture. Today, I will introduce the first step in my architecture - Replicating data from RDS for SQL Server to Redshift using AWS DMS in a secure way.
Overview of the data replication
AWS DMS is a cloud service that allows replicating data from relational databases to data warehouse. The replication procedure is shown in figure 1. DMS builds replication servers in the Multi-AZ high availability cluster where we run the replication task. The DMS system has two endpoints: a source endpoint which connects to the database and extracts structured data; a destination endpoint which connects to the AWS Redshift to load the data into the data warehouse. DMS detects the source schema change and only loads the newly generated tables into destination as source data grows.
Secure the data replication procedure in preparation stage
1.????SSL configuration in RDS and Amazon Redshift per PCI Requirement 4
Before we protect data in transmission with SSL, we need to enable the SSL option in RDS and Amazon Redshift. We need to set them up in their parameter groups. To be mentioned, you need to create a new custom parameter group for this configuration because you can’t modify the default parameter groups.
To configure SSL connections on RDS:
Go to the AWS console and navigate RDS.
Go to Parameter Groups and find the one for your instance.
Edit the rds.force_ssl parameter to true.
To configure SSL connections on a Redshift Cluster:
Go to the AWS console and navigate Redshift.
Go to Parameter Groups in the Workload management tab and find the one for your cluster.
Set the require_ssl parameter to true.
2.????Protect data at rest with the KMS encryption key per PCI requirement 3
RDS database and Amazon Redshift both store customers’ data in the disk. DMS replication server may also store data in the EBS disk when it processes large transactions. It’s necessary to encrypt the data at rest with AWS managed KMS key. If you have a compliance requirement to have full control over the keys, please select the customer-managed key (CMK).
3.????Use AWS IAM roles to explicitly allow/deny authorized service requests with the least privilege principle
Per PCI requirement 7, we recommend only assigning certain IAM permissions and IAM roles to data engineer/DBA to use AWS DMS based on the link. Moreover, we need to set one additional service role if we choose AWS Redshift as our target?because AWS DMS uses an Amazon S3 bucket to transfer data to the Amazon Redshift database. You need to authorize Amazon Redshift to access the S3 bucket on your behalf with the IAM role, dms-access-for-endpoint. The way to set up the role is the following:
1)Create a JSON file dmsAssumeRolePolicyDocument3.json with the following IAM policy.
2) Create the service role using the AWS CLI
aws iam create-role --role-name dms-access-for-endpoint --assume-role-policy-document your_folder_path//dmsAssumeRolePolicyDocument3.json --profile your_profile --region us-east-2
3)Attach the AmazonDMSRedshiftS3Role policy to dms-access-for-endpoint role
aws iam attach-role-policy --role-name dms-access-for-endpoint --policy-arn arn:aws:iam::aws:policy/service-role/AmazonDMSRedshiftS3Role?--profile your_profile --region us-east-2???
4.????Execute the whole data replication procedure in the private subnet
Per PCI DSS 1.2.1, we should restrict inbound and outbound traffic for the cardholder data environment (CDE) to host our data storage inside the private subnets without public internet access. Replacing all the data containers in the private subnets can improve the security posture of our system.
We can uncheck the Publicly accessible option in the network configuration of the DMS replication instance to place it in the chosen private subnet.
We need to create a new cluster subnet group and add the private subnets to provision our Redshift cluster inside private subnets.
For the RDS, we should choose a private subnet and uncheck the public access option to limit the attacks from the public internet.
5.????Protect database, DMS replication server and data warehouse with security groups (firewall) from unauthorized access
Per PCI DSS 2.2.2, we should only allow the necessary services, protocols, etc., to our CDE with the security groups (firewall). We use strict inbound and outbound rules in our security groups. The rules are the following.
For RDS SQL server:
Inbound:
Outbound: Because security groups are stateful, the response of the inbound requests can be sent back without the outbound allow rules. We didn’t set up the outbound rules.
For DMS replication server:
Inbound:
Outbound: Because security groups are stateful, the response of the inbound requests can be sent back without the outbound allow rules. We didn’t set up the outbound rules.
For Redshift:
Inbound:
Outbound: Because security groups are stateful, the response of the inbound requests can be sent back without the outbound allow rules. We didn’t set up the outbound rules.
领英推荐
6.????Store database authentication credential in the AWS secret manager
Per PCI 8.2.1, AWS highly recommends storing the credential of the database and AWS redshift as a secret in the AWS secret manager encrypted by the KMS key. It reduces the credential breach risks through encryption and access controls. Store the plaintext credential in the DMS configuration is not the best practice.
To access your credentials from DMS, you should specify the secret in the secret manager and the IAM role to read and decrypt the secret. The IAM role follows the zero trust and least privileges principle.
7.????Create a DMS user for DMS task only within SQL server with limit privileges
Per PCI requirement 7, we recommend creating an independent SQL server user and only assigning necessary permission to it with the following T-SQL script.
Creating the DMS replication instance
DMS uses an instance to initiate the connection between the RDS SQL server and Amazon RedShift, transfer the migration data and cache the processed data.We need to specify the instance name, instance type, DMS engine version, VPC, Multi-AZ high availability, network subnet group, encryption key, etc., to create the instance.
Note: I highly recommend choosing the Multi-AZ deployment if you replicate data for critical work. AWS DMS automatically provisions and maintains a synchronous standby replica of the replication instance in a different Availability Zone. This approach provides data redundancy, eliminates I/O freezes, and minimizes latency spikes.
To protect your data, please uncheck the publicly access option.
Create Source and Target Endpoints
An DMS endpoint provides connection, data store type, and location information about your data store. AWS Database Migration Service uses this information to connect to your data store and migrate the data to the target.
1.????Our source will be the RDS SQL Server
We need to specify the following information in the configuration of the source endpoint.
2.????Our targe will be the Amazon Redshift
3.????Test the configuration of your endpoint.
You can test your endpoint configuration and verify if your endpoint configuration is valid in the endpoint connections tab. The test will run for a minute and you should see successful message in the Status column.
If you face errors, ensure that the security group of your RDS database allows for inbound traffic on port 3306 and your Redshift allows for inbound traffic on port 5439 from the AWS DMS Replication Instance security group. Please also verify the IAM role or the secret used by the DMS is correct.
Creating the replication task
We are in the last step! The last step is creating task to do the schema replication from source SQL server to the data warehouse.
1.????Enter the following parameter values in the Create database migration task screen:
Because we are only executing the offline data analytics and don’t have a requirement of data replication latency, we only need to choose the type of migrate existing data. If you need to analyze the real-time data, please choose the option migrate existing data and replicate ongoing changes (CDC).
2.????In the Task settings panel, select wizard* and enter the following values (leave everything else as default):
The DMS logs in the CloudWatch help you troubleshoot the errors during DMS replication. I recommend setting the log level to default first, then change the log level to detailed debug when you face issue during replication. Increasing message severity can fill up the available disk space. Do so only when needed for debugging.
3.????In the Table mappings panel, select Wizard mode, press the Add new selection rule button and select the schema you want to migrate in the source SQL server.
The selection rule helps you transfer the schemas and tables based on your analytic needs. You can set up a series of rules to include with or exclude from the specific data in the data replication to target Redshift. You must specify at least one include action rule in this setting.
4.????In the Migration task startup configuration, select Automatically on create then Scroll to the bottom of the screen and click the Create task button to create the task and start the data replication.
5.????Monitor the task until the status is changed to Load complete, replication ongoing and the progress is 100%
Troubleshoot
When you meet errors during the data replication, you can check the Table statistics section in the database migration task and check the Load state column to see which table names have errors. Then you can go to the log group dms-tasks-your_task_name in the Cloudwatch to see the detailed error.
Alternatively, you also can go to the Redshift query editor and run the following query to display the records of all Amazon Redshift load errors.
Select * From stl_load_errors
Congratulations! You’re all set! In my future blogs, I will move to the second phrase of my design to introduce how to create structured tables and data catalog from S3 raw data with AWS Glue.