ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Replicating data from RDS SQL Server to Redshift using AWS DMS in compliance with PCI DSS

Shaoyi Li

Lead Cloud Engineer | AWS Hero | AWS re:Invent and Summit Speaker | 82x AWS Badges | TikTok-CN(100K) | MSE-DS@Upenn

å‘å¸ƒæ—¥æœŸ: 2022å¹´7æœˆ10æ—¥

Hi everyone. I recently designed and built an offline data analytics cloud architecture for my company on AWS. The reason why I used a data warehouse for analytics are the following: 1) Data warehouse stores the data from multiple sources centrally in a uniform format. 2) Data warehouse reduces the I/O to the database from the compute-heavy analysis requests which avoids system bottleneck by the database.

My offline data analytics architecture includes several critical steps in the following illustration.

1)???Move selected data from multiple sources to the data warehouse

2)???Centralized data repository in a tabular format for analytics

3)???Data analysis with AWS Redshift cluster and Python-based analytic apps

4)???Data BI virtualization with AWS Quicksight

5)???Data modeling with AWS Sagemaker

Based on my work domain of Fintech, my primary goal would be securing Payment Card Industry (PCI) and Personally Identifiable Information (PII) data in the whole system. So my concern is how to protect sensitive data among the data flows of the architecture. Today, I will introduce the first step in my architecture - Replicating data from RDS for SQL Server to Redshift using AWS DMS in a secure way.

Overview of the data replication

AWS DMS is a cloud service that allows replicating data from relational databases to data warehouse. The replication procedure is shown in figure 1. DMS builds replication servers in the Multi-AZ high availability cluster where we run the replication task. The DMS system has two endpoints: a source endpoint which connects to the database and extracts structured data; a destination endpoint which connects to the AWS Redshift to load the data into the data warehouse. DMS detects the source schema change and only loads the newly generated tables into destination as source data grows.

Secure the data replication procedure in preparation stage

1.????SSL configuration in RDS and Amazon Redshift per PCI Requirement 4

Before we protect data in transmission with SSL, we need to enable the SSL option in RDS and Amazon Redshift. We need to set them up in their parameter groups. To be mentioned, you need to create a new custom parameter group for this configuration because you canâ€™t modify the default parameter groups.

To configure SSL connections on RDS:

Go to the AWS console and navigate RDS.

Go to Parameter Groups and find the one for your instance.

Edit the rds.force_ssl parameter to true.

To configure SSL connections on a Redshift Cluster:

Go to the AWS console and navigate Redshift.

Go to Parameter Groups in the Workload management tab and find the one for your cluster.

Set the require_ssl parameter to true.

2.????Protect data at rest with the KMS encryption key per PCI requirement 3

RDS database and Amazon Redshift both store customersâ€™ data in the disk. DMS replication server may also store data in the EBS disk when it processes large transactions. Itâ€™s necessary to encrypt the data at rest with AWS managed KMS key. If you have a compliance requirement to have full control over the keys, please select the customer-managed key (CMK).

3.????Use AWS IAM roles to explicitly allow/deny authorized service requests with the least privilege principle

Per PCI requirement 7, we recommend only assigning certain IAM permissions and IAM roles to data engineer/DBA to use AWS DMS based on the link. Moreover, we need to set one additional service role if we choose AWS Redshift as our target?because AWS DMS uses an Amazon S3 bucket to transfer data to the Amazon Redshift database. You need to authorize Amazon Redshift to access the S3 bucket on your behalf with the IAM role, dms-access-for-endpoint. The way to set up the role is the following:

1)Create a JSON file dmsAssumeRolePolicyDocument3.json with the following IAM policy.

2) Create the service role using the AWS CLI

aws iam create-role --role-name dms-access-for-endpoint --assume-role-policy-document your_folder_path//dmsAssumeRolePolicyDocument3.json --profile your_profile --region us-east-2

3)Attach the AmazonDMSRedshiftS3Role policy to dms-access-for-endpoint role

aws iam attach-role-policy --role-name dms-access-for-endpoint --policy-arn arn:aws:iam::aws:policy/service-role/AmazonDMSRedshiftS3Role?--profile your_profile --region us-east-2???

4.????Execute the whole data replication procedure in the private subnet

Per PCI DSS 1.2.1, we should restrict inbound and outbound traffic for the cardholder data environment (CDE) to host our data storage inside the private subnets without public internet access. Replacing all the data containers in the private subnets can improve the security posture of our system.

We can uncheck the Publicly accessible option in the network configuration of the DMS replication instance to place it in the chosen private subnet.

We need to create a new cluster subnet group and add the private subnets to provision our Redshift cluster inside private subnets.

For the RDS, we should choose a private subnet and uncheck the public access option to limit the attacks from the public internet.

5.????Protect database, DMS replication server and data warehouse with security groups (firewall) from unauthorized access

Per PCI DSS 2.2.2, we should only allow the necessary services, protocols, etc., to our CDE with the security groups (firewall). We use strict inbound and outbound rules in our security groups. The rules are the following.

For RDS SQL server:

Inbound:

Outbound: Because security groups are stateful, the response of the inbound requests can be sent back without the outbound allow rules. We didnâ€™t set up the outbound rules.

For DMS replication server:

Inbound:

Outbound: Because security groups are stateful, the response of the inbound requests can be sent back without the outbound allow rules. We didnâ€™t set up the outbound rules.

For Redshift:

Inbound:

Outbound: Because security groups are stateful, the response of the inbound requests can be sent back without the outbound allow rules. We didnâ€™t set up the outbound rules.

é¢†è‹±æŽ¨è

Snowflake

Rohit Singh 4 ä¸ªæœˆå‰

ADF Copy Data: Copy Data From Azure Blob Storage To A SQL Database Using Azure Data Factory

ADF Copy Data: Copy Data From Azure Blob Storage To Aâ€¦

Atul Kumar 1 å¹´å‰

Google BigQuery vs Amazon Redshift: Learn Key Difference

Google BigQuery vs Amazon Redshift: Learn Keyâ€¦

Lyftrondata 4 ä¸ªæœˆå‰

6.????Store database authentication credential in the AWS secret manager

Per PCI 8.2.1, AWS highly recommends storing the credential of the database and AWS redshift as a secret in the AWS secret manager encrypted by the KMS key. It reduces the credential breach risks through encryption and access controls. Store the plaintext credential in the DMS configuration is not the best practice.

To access your credentials from DMS, you should specify the secret in the secret manager and the IAM role to read and decrypt the secret. The IAM role follows the zero trust and least privileges principle.

7.????Create a DMS user for DMS task only within SQL server with limit privileges

Per PCI requirement 7, we recommend creating an independent SQL server user and only assigning necessary permission to it with the following T-SQL script.

Creating the DMS replication instance

DMS uses an instance to initiate the connection between the RDS SQL server and Amazon RedShift, transfer the migration data and cache the processed data.We need to specify the instance name, instance type, DMS engine version, VPC, Multi-AZ high availability, network subnet group, encryption key, etc., to create the instance.

Note: I highly recommend choosing the Multi-AZ deployment if you replicate data for critical work. AWS DMS automatically provisions and maintains a synchronous standby replica of the replication instance in a different Availability Zone. This approach provides data redundancy, eliminates I/O freezes, and minimizes latency spikes.

To protect your data, please uncheck the publicly access option.

Create Source and Target Endpoints

An DMS endpoint provides connection, data store type, and location information about your data store. AWS Database Migration Service uses this information to connect to your data store and migrate the data to the target.

1.????Our source will be the RDS SQL Server

We need to specify the following information in the configuration of the source endpoint.

2.????Our targe will be the Amazon Redshift

3.????Test the configuration of your endpoint.

You can test your endpoint configuration and verify if your endpoint configuration is valid in the endpoint connections tab. The test will run for a minute and you should see successful message in the Status column.

If you face errors, ensure that the security group of your RDS database allows for inbound traffic on port 3306 and your Redshift allows for inbound traffic on port 5439 from the AWS DMS Replication Instance security group. Please also verify the IAM role or the secret used by the DMS is correct.

Creating the replication task

We are in the last step! The last step is creating task to do the schema replication from source SQL server to the data warehouse.

1.????Enter the following parameter values in the Create database migration task screen:

Because we are only executing the offline data analytics and donâ€™t have a requirement of data replication latency, we only need to choose the type of migrate existing data. If you need to analyze the real-time data, please choose the option migrate existing data and replicate ongoing changes (CDC).

2.????In the Task settings panel, select wizard* and enter the following values (leave everything else as default):

The DMS logs in the CloudWatch help you troubleshoot the errors during DMS replication. I recommend setting the log level to default first, then change the log level to detailed debug when you face issue during replication. Increasing message severity can fill up the available disk space. Do so only when needed for debugging.

3.????In the Table mappings panel, select Wizard mode, press the Add new selection rule button and select the schema you want to migrate in the source SQL server.

The selection rule helps you transfer the schemas and tables based on your analytic needs. You can set up a series of rules to include with or exclude from the specific data in the data replication to target Redshift. You must specify at least one include action rule in this setting.

4.????In the Migration task startup configuration, select Automatically on create then Scroll to the bottom of the screen and click the Create task button to create the task and start the data replication.

5.????Monitor the task until the status is changed to Load complete, replication ongoing and the progress is 100%

Troubleshoot

When you meet errors during the data replication, you can check the Table statistics section in the database migration task and check the Load state column to see which table names have errors. Then you can go to the log group dms-tasks-your_task_name in the Cloudwatch to see the detailed error.

Alternatively, you also can go to the Redshift query editor and run the following query to display the records of all Amazon Redshift load errors.

Select * From stl_load_errors

Congratulations! Youâ€™re all set! In my future blogs, I will move to the second phrase of my design to introduce how to create structured tables and data catalog from S3 raw data with AWS Glue.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Shaoyi Liçš„æ›´å¤šæ–‡ç«

File Integrity Monitoring with OSSEC in AWS EC2

2023å¹´1æœˆ24æ—¥

File Integrity Monitoring with OSSEC in AWS EC2

PCI-DSS is a set of security standards for businesses to protect cardholder data in card data storing, processes andâ€¦

9 æ¡è¯„è®º
Understand and Build a High Availability AWS CloudHSM

2022å¹´6æœˆ17æ—¥

Understand and Build a High Availability AWS CloudHSM

Hi everyone, I recently designed a High Availability and Disaster Recovery architecture for our company's FIPS 140-2â€¦

6 æ¡è¯„è®º

Replicating data from RDS SQL Server to Redshift using AWS DMS in compliance with PCI DSS

Shaoyi Li

Lead Cloud Engineer | AWS Hero | AWS re:Invent and Summit Speaker | 82x AWS Badges | TikTok-CN(100K) | MSE-DS@Upenn

Overview of the data replication

Secure the data replication procedure in preparation stage

é¢†è‹±æŽ¨è

Creating the DMS replication instance

Create Source and Target Endpoints

Creating the replication task

Troubleshoot

Shaoyi Liçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Demystifying DATA: DBMS, Databases, Data Structures, Database Engines and Data

The Latest in Distributed SQL - September

Snowflake vs Redshift vs Google BigQuery

Database Classification Primer for Executives

Snowflake is not a data warehouse

Azure Synapse Analytics - Save costs up to 94%

Apache Iceberg and the Battle for Open Data Control

Explore the 10 Most Popular Database Types and How They Shape Technology

Data warehousing in Azure

What Is Snowflake Database?

Overview of the data replication

Secure the data replication procedure in preparation stage

é¢†è‹±æŽ¨è

Creating the DMS replication instance

Create Source and Target Endpoints

Creating the replication task

Troubleshoot

Shaoyi Liçš„æ›´å¤šæ–‡ç«

File Integrity Monitoring with OSSEC in AWS EC2

Understand and Build a High Availability AWS CloudHSM

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Demystifying DATA: DBMS, Databases, Data Structures, Database Engines and Data

The Latest in Distributed SQL - September

Snowflake vs Redshift vs Google BigQuery

Database Classification Primer for Executives

Snowflake is not a data warehouse

Azure Synapse Analytics - Save costs up to 94%

Apache Iceberg and the Battle for Open Data Control

Explore the 10 Most Popular Database Types and How They Shape Technology

Data warehousing in Azure

What Is Snowflake Database?

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†