Moving Mission-Critical SQL Server Workloads Effectively to Cloud with AWS DMS
Guillermo Wrba
Autor de "Designing and Building Solid Microservice Ecosystems", Consultor Independiente y arquitecto de soluciones ,evangelizador de nuevas tecnologias, computacion distribuida y microservicios.
Quite recently, i was involved on driving a big data migration project following a lift-and-shift approach to move one of the big US-nationwide student meals management system running in Rackspace-cloud into AWS Cloud - a system that has been aging for more than 15 years, typically following 2-tier and a siloed-architecture, very tight coupled.
Moving such kind of mission-critical production workload involved more than 20 Tb, more than 1000 SQL Server databases - in multiple maintenance levels - , and thousands of tables most of them with million records. Something that for sure, cannot be done without some kind of data migration framework; otherwise a manual migration could take months - or even years.
I chose AWS DMS as the best matching migration framework for this engagement, not only because AWS stand behind the product, but also because the customer was already engaged with AWS, which over-simplified the adoption of the framework.
I'm not going to explain here the internal architecture of AWS DMS or how the AWS MAP methodologic framework works , since that would require a separate article; i want to focus on the migration strategy and how the migration architecture looked like in order to succeed in such a big workload migration that also, had to take into considerations things such as data privacy, migration performance, automation everything following a time-sensitive approach. ( 2 months target).
We can classify the over-arching data migration approach in three different tactics, one regarding host management, one regarding communication link management between the two sites, and one regarding data migration tactics, as shown below:
In our case, we decided to go with a dedicated host approach since customer wanted also to migrate their existing SQL Servers 1:1 to EC2 SQL Server instances, following a BYOL approach.
In addition, we decided to go forward with the AWS DMS/SCT framework, which suffices all of the existing requirements around migration automation, sensitive data management, and - most important - the timeline given.
Our Cloud connectivity strategy of choice was going with an S2S VPN connection, which offered us up to 1.2Gbps without sacrificing running costs. Direct Connect was proposed but discarded due to set-up complexity and increased running costs.
The Big Data Migration
AWS DMS helped us to quickly set up the replication instances, endpoints and replication tasks by means of scripting - that i had to write myself in order to ensure that more than 1000 tasks and more than 2000 endpoints are appropriately deployed. Rather than a big-bang deployment, i chose for going with an incremental migration - in batches of 20-30 databases - which offered us a better control over the whole migration - issues can happen from nowhere and it's better to have a strict control over the migration task progress and any possible error that requires attention.
In order to deal with the migration , a total of twenty (20) DMS replication instances were deployed ( DMS.C5.2XLARGE ) across two VPCs. This number allowed us to deal with up to 1000 databases to be migrated, with 50 replication tasks per replication instance which matched the AWS limitations:
- No more than 100 endpoints per replication instance -> which represents a hard limit of 50 tasks per rep instance ( 50 x 2 endpoints = 100 )
- No more than 1000 endpoints per account -> having 100 endpoints per rep instance, gives us a total of 1000/100 = 10 rep instances max per VPC/account
SQL Server Replication/Distribution behind the scenes
I decided to go with a "full load with ongoing CDC" (CDC for change data capture ) DMS migration mode, which esentially copies a snapshot from source to destination, and from there, replays all of the changes in the source database via SQL Server replication mechanism. For this to work, replication must be configured on the source servers. I decided to go with replication with distribution, because the source servers haven't had enough space to be their own distributors. For that, a separate distributor machine must be deployed in AWS; this machine receives the "change pushes" from the source databases and distributes them across the different AWS replication tasks, following a pub/sub mechanism. You can get rid of the distributor if you have enough space on your origin servers, however it's even a good practice to deploy a distributor, which helps to have replication independence, so that you can have better control and also, better manage the storage allocation for the distribution database.
In the above diagram, AWS DMS replication instance acts as a subscriber to the CDC events that are getting stored in the SQL distributor database. The result, table transactions that happen on source result in CDC records written on the transaction log; a TLOG publisher reads them and publishes them to the distributor machine queue associated with the database. An AWS DMS instance is subscribed to the queue, reads the events and get the CDC events replayed against the target database. This results in all of the data changes happening at the source to be eventually propagated to the target. Note the term "eventually" here means that there can be some time between a change occurs in source and gets propagated - usually known as data propagation time.
High-Level Migration "Transient" Architecture
A certain set of components are required to be set up for this migration to happen. Below, the architecture i used in this case, with all of the components involved. Note that for simplicity im not including here the 20 replication instances, however in my real-world implementation a total of 20 rep instances were used , 10 per VPC/account, as discussed in earlier sections.
Note that after migration, all of the transaction architecture gets removed, however that must also be considered as part of the overall migration running costs, since otherwise we are not reflecting the real costs to the client.
Note also that in my above diagram, i used a VPC Peering across VPCs in order to ensure that all of the AWS DMS replication instances can access the databases both at the Rackspace data center as well as the AWS-based SQL server instances, covering east->west and west->east traffic. Also, an RDS instance is depicted there; this is because the client decided to go with RDS for certain specialized workloads.
Communication between both VPC and RackSpace has been done via an S2S dedicated VPN gateway, one per VPC, covering north->south and south->north traffic.
Note also that the SCT tool is responsible for the schema/code conversion, but also migration tasks can also be started from the SCT Tool itself, without any other requirement. Even when this is possible, in this case it was not feasible since we are talking about 1000+ databases; configuring endpoints and migration tasks for such a number of databases result in being - at the last - , overkilling or non-feasible at all ; this is why i fully recommend to rely on AWS Scripting for that - you can do it by leveraging the AWS API, AWS SDK, or directly via AWS CLI commands - there are numerous examples on the web of the usage which help you to oversimplify the time required to spin up the infrastructure - especially when high volume of data or number of databases is being handled like in this case.
Disabling constraints and clearing data before migration
One important aspect, that usually remains unforeseen, is to disable constraints and triggers before you run an actual migration task, particularly when you use CDC mode. Not doing so can cause the task to fail simply because the migration task tries to fully inject data in the target without restriction; if we keep constraints, those will cause injected data to fail due to dependencies. Keep them disabled till your migration is completed.
The same applies to table-based triggers. If we do not disable triggers, all data being populated by injection will raise trigger execution, causing data duplication in most of the cases. Keep triggers disabled till your migration is completed.
Lastly, you may want to use the "drop tables on target" action when configuring your task, so you do not need to worry about pre-existing data. That will work, however, all of your constraints will also be dropped, and - which is worse - not re-created at all. If you decide to use this mode, be prepared to get all of the constraints re-created manually once you conclude the migration.
I tend to use the "do nothing" mode, so that i can manually truncate all tables via script, while keeping the constraints in place - but disabled - as i described above. This works seamlessly, and do not require to re-create anything.
Lastly, be sure the enable cloudwatch logging enabled for your replication instance at the default level. That will give you enough information in the event a task fails, so that you can better diagnose the problem , fix it, and finally retry the task. Before you retry, make sure to perform data cleanup to avoid data duplication and further task failure.
LOB column support and fine-tuning memory usage
DMS supports LOB column migration. But you need to be sure to configure your LOB mode accordingly on the task configuration. Full CLOB mode allows you to import any LOB column disregarding its size, however it can have detrimental performance behavior while migrating, due to memory allocation for LOB objects. I have found that the "Limited LOB" mode works best in most of the cases, provided that you configure the maximum LOB size correctly. I tend to use blocks of 1024K to cover any edge case (larger documents).
Another important factor that plays into the scene is the LOB chunk size. AWS recommends to keep the value at the default 64K, since increasing this value means that more memory needs to be allocated to read individual LOB data. I have used the default proposed without noticing any issue.
One important thing about LOB configuration is memory usage is impacted on the DMS replication instance that could eventually lead to an out-of-memory condition observable on the task logs or task summary page. The total memory usage depends on three factors:
领英推è
- First, the number of parallel load workers. You can configure this number of the task configuration via the "Maximum number of tables to load in parallel". We call it W
- Secondly, the commit rate during load. This determines how many records are batched prior to commit. We call it C.
- Lastly, the Maximum LOB Size, we call it L
The total footprint is given by the formula. N = W*C*L, and is given in kilobytes.
The number N is going to be allocated in-memory inside the replication instance, so make sure that the instance size you have chosen at least, doubles this number.
For example, for a max Lob size of 2048K, considering 16 parallel load workers, and batches of 256 records, the allocated memory will raise to
N = 2048 K * 16 * 256 = 8388608 , or 8 Gb of memory. This number can raise very easily so i'll recommend to keep an eye on it to avoid OOM situation and failed tasks.
Typically, you want to keep W and C as low as possible. I recommend to play with these numbers to get to the best matching for your task. This can be a trial-error thing.
Mapping Migration tasks at the DB / schema level
One important decision is how to better map migration tasks either at the database or schema level. Sometimes, it's better to just map the task at the DB level to avoid incurring in additional endpoints and tasks that can make your migration transient architecture more complex, and also to reduce costs.
But in other cases, it's better to just split a DB in the different schemas and then , map individual tasks at the schema level. This is recommended especially under the below conditions:
- The schemas are pretty heavy, with tables with million records, with multiple schemas . In this case, it's better to separate them out, to avoid long-running tasks , and focus on short-running tasks that can be better controlled - and managed. One task per schema is recommended.
- There are schemas with special tables with LOB objects mainly. Those can be better managed if tasks are targetted to the individual schema, since the task require differentiated tuning, as we discussed in above sections.
- There are schemas with hundreds of tables. In such cases, it's better to have schema-targeted tasks, so that the task can be tuned to deal with it by assigning increased parallel load tasks to reduce overall load time.
For the typical case, mapping at the DB level will suffice. Also, if there are too many databases - like the case i described - it doesn't make too much sense to target tasks at the schema level since that will be penalized by a too high number of endpoints and tasks, that will easily exceed the AWS endpoint quota. We should avoid exceeding quota whenever possible, however if required, quotas can be enhanced via AWS support ticket - of course, at an increased cost.
For the case of reference i presented here, i have used roughly two full accounts with a total of two thousand endpoints and thousand tasks that have sufficed my requirements.
Some Glitches
AWS DMS seems to be a pretty robust workload migration framework and solution, that provides a migration approach end-to-end that encompasses multiple different use cases. It helps to maximize time by programming and configuring migration tasks that can be automated via either the AWS Console, or even programatically, which helps to scale the migration very well.
While migrating SQL Server, i have found, however some glitches that wanted to share:
- No support for assemblies. If you have assemblies on your source database server, those are not getting converted with SCT tool and simply ignored. in such a case, you'll need to manually migrate those via SSMS or via scripting
- Bad external procedure conversion. When converting SQL Server external procedures, a bad conversion is applied that causes those to appear as invalid. If you face this situation, you can simply migrate them manually and get them re-created on the target database, and therefore, get them refreshed on your right side of the SCT migration tool ( target side), so you can incorporate the change.
- No timeout control on AWS replication tasks. If for any reason your replication instance gets disconnected from your source database, your replication task gets reset, which means, the replication will start from scratch. This causes multiple issues.
- XML Types are not usually converted the right way, and may remain invalid at target. Those need to be re-created manually and target, and imported on the right (target) side of the SCT tool.
Configuring max LOB size for replication at SQL Server
The default SQL Server column replication size for LOB objects is 64K. This means that except you change this default value, trying to store such column with content larger than 64K will lead to a misleading error condition saying something like the following:
System.Data.SqlClient.SqlException (0x80131904): Length of LOB data
(607916) to be replicated exceeds configured maximum 65536.
Use the stored procedure sp_configure to increase the
configured maximum value for max text repl size option,
which defaults to 65536. A configured value of -1 indicates no limit,
other that the limit imposed by the data type.
As the error says, you should change this value from the default to -1, otherwise users will comply about not being able to write LOBs > 64k when performing normal application flows - that usually worked fine before you enabled replication.
You can do so by means of scripting, or otherwise using SSMS by navigating to Instance->Properties->Advanced->Miscellaneous->Max Text Replication Size
You should do this only on those origin database servers - that's not required at all on your target Database servers since it's related to replication functionality, which you usually enable on the primary server.
Choosing the appropriate DMS replication instance
This is also important. There's no particular rule regarding which particular DMS instance type you should be using since that depends completely on your migration dynamics and needs. There are multiple factors that contribute to requiring bigger and larger memory footprints, such as the presence of big LOB columns, the number of rows ( millions), the number of tables ( more than thousand); those are just examples. Also, note that DMS uses internal memory for internal DMS purposes that are not directly related to any of the above variables. My recommendation is to start with smaller instances such as dms.c6i.xlarge ( 8 Gb, 4 vCPU), and just in case you get frequent OOM or otherwise you're experiencing throttling at the migration task processing ( i.e. migration taking too much time, or suffering from low network throughput) you can then upgrade it to dms.c6i.2xlarge. If that does not suffices then going with dms.r6i.2xlarge represents the best option in terms of cost/benefit since those instance have larger memory capacity( 64gb) while keeping CPU bounded (8 cores), which is almost right for most of the workload migrations.
You may want to avoid OOM conditions. One recommendation is to just assign one task at a time per replication instance for simplicity; you may wonder how fast memory can be utilized and consumed even for a single task when the data being moved is bigger as mentioned.
Sometimes , you can experience OOM conditions when running smaller tasks; this usually happens once your DMS instances have been running for a long time ( days / weeks ); in such a case i opted to reboot the instances programatically once a week. Rebooting the instances can help to free up all of the freeable memory that otherwise, cannot be reused. Rebooting an instances takes no more than one minutes, since reboot is performed in-place, so no re-cycle of instance occurs; after rebooting all of the associated migration tasks will resume their execution, which may take usually 1-5 minutes, so you may need to wait some time before you see your tasks resuming their execution after a reboot.
Lastly, it's a good practice to keep an eye on the cloudwatch metrics to ensure that the instance are healthy enough , or otherwise, you can create event subscriptions so that DMS can inform you whenever for example, a task fails due to OOM conditions; you can get alerts via email or otherwise via SMS - internally DMS will create an SNS topic for you.
Table Truncation and Schema Mutation
While your migration is running forward, it's advisable that engineering should avoid any kind of schema change at all. Changing schemas means that you'll need to esentially convert your schema once again, truncate your target tables, and then start the migration from scratch, which for sure will impact your migration timeline and critical path.
Ideally, the migration plan should consider a "Freeze" on the data structures so that schemas ideally shouldn't be changed during such a time, in order to avoid re-work. The freeze on the schemas should be part of the migration strategy, and visibility should be given to the overall organization from upper management and/or architecture head for proper team alignment, otherwise the impact on the migration can be quite high.
A second thing to advise, is that ideally, the applications shouldn't perform any TRUNCATE operation. You may wonder why. The explanation is simple: truncate operations do not make use of the redo-log, and hence, any truncate is not going to be propagated through the CDC pipeline; that means that if you truncate a table that is undergoing CDC, truncation does not get propagated to the target migration database, which introduces hence, higher level of inconsistency.
This is why SQL Server prevents any truncate operation by issuing an error if such truncate occurs in order to avoid running in such inconsistency. It's not a good idea at all to disclose that your code has a truncate while running the migration, since in such a case, users will complaint that they cannot use the application as they did before, and that can be a real red flag.
In order to avoid this situation, i recommend to perform a code review before you embark on the migration itself, which should be part of the "assess" phase of the AWS MAP. Should you detect this condition, i recommend to reach out to engineering heads and raise the issue. There are two possible solutions here depending how engineering team is willing to contribute:
- If engineering is willing to contribute, we could plan to replace the truncate command by a DELETE. This is going to solve the problem completely. DELETE operation is redo-log aware and is the best recommendation to avoid this situation moving forward during the migration phase.
- Engineering cannot change the code due to any reason. In such a case, the truncation will fail. That needs to be raised and made visible. The only solution available here is to just disable CDC for the database or databases where truncation is going to occur for a certain negotitated window, and then, re-enable CDC once the window is completed. This requires co-ordination between the migration team, applications team and architecture, and is not recommended at all.
Final Thoughts
I have tried above to give a quick walkthrough of the solution for this workload migration. There could be other ways of accomplishing the same, however this represents one standard way of performing a site-to-cloud seamless workload migration. One important point to consider, depending on the volume of data being transferred its important to keep an eye on the cloud connectivity strategy. S2S tunnels tend to be tricky and sometimes, configurations such as DTD timeout can become a real pain to diagnose. - i learnt my own lesson on this.
In addition, depending on how old is your source database, moving data can also be challenging,especially on production environments because of the technology aging on the backend that tends to cause contention and bottlenecks when trying to orchestrate multiple migrations in parallel; this is another aspect that needs to be considered. My recommendation is to be cautious and proceed incrementally if possible, rather than going big-bang and try to run big batches of tasks; that can lead to inconsistencies, unforeseen issues, and lack of migration coordination. My opinion, it's better to spend time on discussing on the overall strategy with client, and also get as much information as you can from the source site; sometimes it can be surprising all the things you can be aware when assessing the source infrastructure. Following AWS MAP methodology, this is part of the "assess" phase, where you realize about the customer's source environment, the customer maturity in terms of AWS migration, and other facts, but that's part of a different conversation.