SaaS Operational Maturity
A New Platform
The TrainingPeaks platform has been around in one form or another since 1999. Since then, our user base has grown linearly, then exponentially, and the technology platform has gone through several revisions. In 2012 TrainingPeaks started planning its next technology jump, moving from the aging Adobe Flex based version to an HTML5 version. In late 2012 development began, and on July 1, 2015, two and half years later, the Flex version of TrainingPeaks was finally and happily retired.
Operational Maturity
A successful software company needs the appropriate level of operational maturity to succeed. Successful startup companies don’t launch with five 9's of uptime and scalability to hundreds of millions of engaged users. Likewise, a product that is never updated and has daily downtime won’t be around long. As a SaaS business grows and matures, so should its operational processes.
In 2012 TrainingPeaks was on a hosted platform with quarterly deployments. Deployment was a mostly manual affair taking the entire development and QA teams an entire day. Infrastructure changes were manual and painfully error prone. Regressions were frequent and stability was a problem. A large percentage of development time was spent keeping things up and running, leaving little time for innovation and improvement.
Built in HTML5, JavaScript, and other modern web technologies, the new version of TrainingPeaks is more than just a pretty face. Its development also included a new back end architecture to provide scalability, performance, and updated security. The new REST API was built with operational concerns in mind, deployment was automated, and monitoring put in place taking the platform to a new level of operational maturity. Currently the TrainingPeaks platform is deployed and updated weekly. Deploys take a few people less than an hour. The teams spend more time fixing bugs, improving features, innovating, all while delivering those improvements to our customers every week.
What It Took
The key to maturing the platform were 5 basic steps:
- Developers focused on quality, best practice, and technical debt
- Continuous Integration with deployable, versioned artifacts
- Automated repeatable deployment of code and infrastructure
- Operational monitoring and situational awareness
- Feedback and continuous improvement
Best Practices, Quality, Paying Down Technical Debt
Software development is more like gardening: a software product requires constant tending. As business and customer requirements change there is constant weeding and replanting to ensure a successful harvest. Developers spend more time reading, understanding, and modifying code than writing new code. Doing this efficiently requires a focus on quality, implementing best practices and SOLID principles, as well as giving your teams time to fix the root cause of problems instead of fixing symptoms.
TrainingPeaks implemented a new agile development process. Legacy code was evaluated and re-written as necessary with best practices in mind. Additional test were written. Pair coding, pull requests and code reviews became the norm. When bugs are encountered, a failing unit test is written verifying the problem, the problem is fixed with passing tests to help prevent regressions. Developers are encouraged to spend 20% of their time addressing technical debt.
Continuous Integration
Continuous Integration (CI) has been a key part of good software development process for years. But CI is more than just making sure that your code compiles and your tests pass. To be successful, a development process should use CI to build deployable artifacts that can be tested on any system and promoted to production. Artifacts need to be versioned and tied to a source code commit. This allows your teams to figure out the running version, when it was deployed, what changes were made. This enables quicker resolutions to operational problems.
Automated, Repeatable Deployment
Now that you have artifacts from your CI system, you need a automated, repeatable deployment process. A lot of operational problems are not caused by un-tested software but from software and infrastructure compatibility problems. Making your infrastructure part of your automated deployment reduces problems and increases the confidence in your deployments.
TrainingPeaks moved the platform from hosted hardware to Amazon Web Services (AWS) in May of 2013. A repeatable deployment process was needed to codify years worth of manual undocumented changes. Initial automation deployed to the same hardware in AWS and took our release cycle down to every two weeks. Adding automated infrastructure to the deployment, spinning up new servers in AWS for each new release, further improved our process and provided a simple rollback ability. Using feedback from our deployments to improve, we reduced the deployment time and staff needed significantly. A rock solid repeatable process gave our QA team confidence, and we moved to weekly deployment.
Operational Monitoring
Good well tested code and a good deployment process can only get you so far. Now you need to know how your system is doing operationally. Errors and performance problems make for unhappy customers. Deployment tracking quickly identifies operational issues associated with new deployments. Versioned artifacts trace problems back to source changes.
At TrainingPeaks we started adding operational monitoring when moving to AWS. Replacing home grown solutions with services such as NewRelic,SolarWinds, and Rollbar enabled us to know about problems almost immediately and usually before our customers do. Critical feedback from these and other tools enabled us to reduce errors and improve performance even while adding new features and significantly increased users and traffic.
Feedback and Continuous Improvement
Left for last, but the most critical step. Without feedback and learning from that feedback to continuously improve, things will not mature.
At TrainingPeaks we have deployment retros and a retro for any serious problem or interruption in service. Before major infrastructure changes we do pre-mortums allowing us to identify and address risk before problems become serious. Feedback goes back into development to not only improve what the customer sees, but also to improve our ability to support it. We are continuously using this and other feedback channels such as our customer success team and UserVoice to improve how we deliver software.
Originally posted on Medium