AWS S3 Migration
TrainingPeaks imports over 35 different file formats to help you analyze your workout and provide immediate feedback to athletes and coaches around the globe. Storage of this information is something we take seriously. Over the years TrainingPeaks has used several storage mechanisms for your workout files including disk, database, MongoDB's Grid Filesystem (GridFS), and now Amazon S3.
Problem
To provide the reliability our customers expect from our platform, we run MongoDB in a High Availability (HA) configuration called a replica set. This duplicates data between multiple servers in case of a single server failure. In addition we have a mirrored instance that runs on a four hour delay, hourly and daily backups in case of catastrophic failure. Since moving to the Mongo GridFS implementation in 2012, TrainingPeaks has seen exponential growth in workout file storage. In 2014 we were storing over four terabytes in Mongo GridFS, with workout files going back to the year 2001. As our storage requirements grew, so too did our costs and maintenance complexities. Updates, maintenance, and backups took longer, carried more risk and reduced our ability to innovate.
Solution
In 2014 we identified workout file storage as an area of high operational and infrastructure risk and started investigating alternatives. One alternative was Amazon's Simple Storage Service or S3, which was even more attractive since moving the entire TrainingPeaks platform to Amazon Web Services (AWS) in August of 2013. Ultimately Amazon's S3 won out with a combination of reliability, simplicity and cost. A project was started to migrate from all customer workout data from Mongo GridFS to S3.
Migration
A project like this is akin to replacing the engine of a car while you are going down the road at 60 miles an hour and takes careful planning and testing. In late 2014 we started writing new incoming files to both Mongo GridFS and Amazon S3 in parallel. To ensure reliability, one of every 50 files uploaded was queued and verified as stored correctly in both places. To ensure performance we we monitored and compared the write performance. Millions of files later with a few minor quirks worked out, we were confident and started migrating in our test environments.
The code to read files from S3 was implemented and deployed to our test environments behind feature flags allowing us to switch back and forth between GridFS and S3. After thorough testing and performance monitoring, this was released to production but turned off with a feature flag until all old files were migrated from Mongo GridFS into S3.
The migration process ran in production for five days migrating and verifying over 38 million workout files. It worked backwards in time to the first file uploaded for a workout on November 22, 2001. When the migration completed we turned off the feature flag in production and began reading workout files from S3. We continued to write files to Mongo GridFS in parallel in case of any unforeseen problems requiring us to rollback. After a few weeks of constant monitoring without a single problem, we turned off the feature flag and stopped writing files to Mongo GridFS operating completely on Amazon S3.
Cleanup of Mongo GridFS took a few more weeks. A cleanup process ran for six days deleting migrated files out of Mongo. After deleting the files, we started rolling new Mongo instances into our replica set and removing the old instances. This is the only way to reclaim the storage space used by Mongo even if the data is deleted. In the end our Mongo storage space was reduced by over 92%.
S3 Configuration and Use
Each one of our environments, production, testing and development have their own S3 bucket that can only be written to by that environment. We use machine level S3 permissions to prevent cross talk without exposing keys in configuration and preventing deployment errors.
Files are written to S3 with a per person based key using recommended S3 best practices to prevent hot S3 nodes that could degrade performance. Initial performance testing showed respectable 100 ms response times, and the actual times in production are well below this mark even under constant heavy load.
Our production S3 bucket is backed up to a secondary S3 bucket on a nightly basis. Our backup bucket is configured with an automatic object lifetime policy to archive objects to Amazon’s Glacier data archiving service after 60 days providing a secure durable copy of our customer data.
Summary
At TrainingPeaks we take your data as seriously as you take your workout. Our migration to using S3 for storage is a behind the scenes detail you shouldn’t have to worry about, to provide you with the best training and analysis platform to meet your next challenge.