Transitioning from on-premise to SAAS
Transitioning an on-premise deployed application to a (potentially multi-tenant) SAAS application running in the cloud is a daunting operation that should not be underestimated. This article is a high-level summary of some of the things we learned at Aprimo after going through this entire exercise twice with two Enterprise applications. We are not discussing the licensing or business impact, but solely focusing on the technical and operational challenges.
I hope that, after reading this article, you have a better understanding of what to take into account and what challenges you may expect when you also embark on that journey. I have categorized these topics in a few different areas and more than likely, all these areas will have to be worked on simultaneously. You'll have to prioritize within these streams depending on your needs, application, and starting point.
Supporting Multiple Cloud Platforms
If you are considering to support multiple cloud providers (Azure, AWS, Google Cloud...), or if you are considering to add your new features to your on-premise version as well, then there are some things you need to take into account:
- Feature set: This is especially important if you want to leverage one or more SAAS services. Different services mean different capabilities. An example: transcoding a video using Amazon Elastic Transcoder will behave differently than doing the same operation on Azure Media Services. Both may support different file formats, different codes, the quality of the generated file may be different, work differently with subtitles... This means that you either have to develop for the greatest common denominator of all platforms, or you have to start differentiating your features depending on the platform you run on. The first option limits your feature set. The latter increases the development complexity.
- Development: From a development perspective, all these services have to be implemented, tested, and maintained individually. They may not only have different APIs, they may even require a different architecture so the differences in your code, your design, your testing, and your deployment could be substantial.
- Branching strategy: If you do not branch off your cloud code from your on-premise code, then you may have to develop, test, and release on all platforms at the same time if you want to support both. For smaller features, this may not be a big deal. But for bigger features where there may be fundamental architectural differences between platforms, the increase in complexity may be huge. As a minimum, it will slow down your releases. If you do decide to branch off your cloud code from your on-premise code, then new features have to be developed and tested twice in both branches. This may also slow you down.
- Operations: From an operational perspective, all these services and platforms have to be supported and monitored individually. This increases operational complexity and costs. For each platform, you need to set up and manage your networking, security, backups, disaster recovery, cost analysis, alerting...
- Release: On Azure, releasing your software may mean leveraging things like Azure DevOps. On, AWS CodeDeploy. In on-premise, you may have used Windows Installers packages. As long as you stick to only one of these platforms, it's probably ok to use any of these technologies. However, if you do need to support multiple platforms, you may try to standardize as much as possible to ensure you only have a single thing to manage. For example, by running your application in containers and deploying using tools like Puppet or Chef.
- Architecturally: it is more difficult to optimize for a particular platform (be it on-premise, or any cloud platform). This may mean worse performance, lower availability, and higher run-time costs. Going serverless is going to be especially much more difficult if you also need to support on-premise.
Multi-Tenancy
The decision to convert a single-tenant app to a multi-tenant application impacts a lot more than your number of servers.
- Customization Capabilities. In a single-tenant environment, it is a lot easier to offer advanced customization capabilities as every deployed instance can be customized according to the needs of every tenant. Although functionally easier, operationally this makes your life more complicated because heavy customizations make it more difficult to maintain your service level objectives. For example, if requests start failing with a HTTP 500, how do you know with certainty this is because of your code and not some customization?
- Operational Costs. Running a single-tenant environment also impacts your monitoring and alerting as every time a new tenant is introduced, more resources will be added that have to be monitored, secured, backed up, alerted on...
- Scaling Impact. In a multi-tenant environment, the number of users with a session open on a web server could be significantly higher than in a single tenant-environment. Depending on the memory requirements of your application, this may create a bottleneck. If your application stores a lot of data in memory (either in session-state or in some other cache) optimizations may be in order. Also, web-applications that require session affinity will complicate (auto-)scaling and releasing with no downtime as it may be more difficult to kick users off a particular server. A workaround for that may be to store your session state out-of-process but this may introduce performance and potentially serialization challenges.
- Failover. To be able to failover a service, additional resources may have to be added. These resources increase the run-time cost per tenant. The cost impact of providing failover for you all your servers or services may be substantially lower in a multi-tenant context.
- Run-time Cost. Having a single-tenant environment means that your run-time costs will increase proportionally with the number of tenants you have. Whereas in a multi-tenant environment, the cost increase per tenant may be orders of magnitude smaller.
- Security Impact. In a multi-tenant environment, a single web server will be used by multiple tenants. So, any tooling that could give customers access to data of another customer has to be removed or refactored. For example, in one of our applications, we had a legacy HTTP handler that customer admins could use to flush some static caches. In a multi-tenant environment, static caches contain information about multiple tenants, so this had to be removed. Unpublished API calls that you use only internally may also cause risks.
Use a phased approach
This one is obvious: don't try to rearchitect your application, move away from virtual machines, change how you log, change how you release, provide failover... all from day 1. You'll likely fail. There's too much to learn, too many things that need a different approach and mindset of so many people in the organization.
You'll have to roll out these changes in steps, and that means the business may have to take a hit until the technology is ready again. This may mean a loss of features, too high run-time cost, too low availability, too low performance, too high operational overhead... It will be key that everybody in the organization is aligned with that fact. The first few months or year will become a constant challenge of trying to balance optimizing your architecture, optimizing your release, optimizing your costs, introducing new features, re-introducing lost features...
Release Strategy & Test Automation
Getting your release ready is one of the first things you should focus on. Don't expect it to be perfect from day 1. Among other things, it may take some time to get the naming right and to figure out where to put what logic. You'll want to ensure the release is easily versionable, testable and that it works properly with your branching strategy.
Automation. Automate your entire release from day 1. If there is anything that you need to do first, it's this. Without this, you don't stand a chance: releases will take too long, human errors will cause more problems, deployments will never truly be consistent... As a minimum, you need to have a fully automated continuous delivery pipeline that requires zero manual intervention right off the bat.
No downtime. If you're moving from on-premise to SAAS, you may have to make changes to your release and to the way you release features, to ensure you can do releases without any downtime to your users. For example, your database upgrades may require different tooling and/or a different process to ensure you can roll out without any downtime. You may need to ensure JavaScripts and CSS files can get updated without forcing your users to reload the pages in the browser. Or maybe messages in queues are versioned so that old listeners don't pick up new messages...
Release often. You may release every 3, 6, or even 12 months in an on-premise world. You don't have that luxury anymore in SAAS. Customers will not only require you to introduce features quicker but more importantly, internal technical and operational reasons will also push you towards more frequent releases. More frequent releases reduce the risks of your releases (big releases are riskier than small releases). More frequent releases allow you to fix bugs more quickly. Doing more frequent releases will put a spotlight on and force you to optimize all the pain points in your SDLC: how you define requirements, design, test, plan, define milestones...
Automate Testing. If you release only every 6 months, it may be ok to spend two weeks in a QA cycle. But if you release every month, every week or even every day, spending two weeks in QA simply doesn't cut it anymore. So move your testing to the left and use automation instead of manual testing. This may mean that automated tests have to be created after the fact for legacy features. It may also mean that your QA/test team has to start working differently and may have to learn to write code instead of doing manual testing.
Branching Strategy. Because you release more often, features may have to be split into multiple smaller milestones. You can't afford to have a feature branch open for 6 months. Constantly keeping this branch up to date is not only going to be challenging because changes of other developers will start to conflict with yours at some point. But doing a big-bang release of something that has been in development for 6 months also brings huge risks. Therefore, long-lived feature branches are best avoided.
Learning Curve. New features may require new infrastructure: services, virtual machines, containers, SQL databases, queues... All these resources need to be deployed and thus require updates to your release. The most scalable solution is to make it the responsibility of your development teams to update the release. However, most developers are not used to working with these technologies so making these changes will, for most of your teams, be a learning curve.
Performance & Availability
Here are some of the things to take into account that impact performance and availability.
Failover. Having parts of your application that cannot failover may have been ok in a single-tenant on-premise environment. Customers probably weren't willing to pay for the hardware anyway. From your perspective, only that single customer would have been impacted if a server went down anyway. That's no longer an option you have in a multi-tenant environment. A service's downtime there means all customers are down. Any piece of your code that does not support scaling out or failing over may have to be adapted.
Resiliency. Assume any dependency will go down in the cloud. Be it SQL databases, storage, key vaults... You have to build more resiliency in your code to deal with the fact that any of your dependencies may not be available. For SQL Server, you have to update your code to deal with transient failures. Other services (like key vaults) may have rate limitations that you may run into if you are too chatty. For the latter, putting a cache in front of these services may be necessary...
Latency. There may be parts of your client-application that require low-latency networks. Transitioning to the cloud means much higher latency and these client applications may have to be updated accordingly to still function as expected.
Communication Channels. If your clients communicate with your server over something that doesn't use port 80 (HTTP) or 443 (HTTPS), they may no longer work (reliably) anymore.
Long-Running Requests. Your legacy application may have certain pages that may be configured to allow a longer execution timeout (for example to support some slow query). These things may have to be refactored because they make detecting performance issues more difficult. You could decide to exclude certain routes from your performance monitoring, but then any performance issue in those routes will never be caught. And these requests also reduce the scalability of your web-servers. All these long-running requests block threads which, once you have sufficient users, do become a problem. You may have to scale out your web-servers faster than you expected to.
Customization Capabilities
Any customization capabilities in your application that make use of one of the following items may also require updating.
Local Access. Having local configuration files may no longer be an option:
- Customers will no longer have access to these configuration files. For security or compliance reasons, many of your internal people who may be tasked with applying these customizations may also no longer have access to them. This configuration has to be moved to something (a UI or API) that both can access.
- In a multi-tenant environment, local configuration files may impact multiple customers. So any configuration that should only impact a single tenant has to be moved to something tenant-specific.
- Local configuration files are also incompatible with scaling. Changing a single configuration setting may mean updating it on 20 servers.
- APIs. All the APIs your customers should use have to be converted to REST APIs. If you still have customizations that require your .NET APIs or Java Class Libraries for example, then all these customizations either have to be removed or refactored to require a Web API.
Running Custom Code. Remove any feature that requires any custom code of a customer to run on your web servers. This code results in an inability to monitor the performance and availability of your application. If - every time there is slowness in your application - you first have to check if it's caused by your code or the customer's code, then you have practically closed the door for any efficient monitoring. The support and operational impact of having to investigate each occurrence are huge and should not be ignored.
Security & Legal
Security becomes much more important in a SAAS context because you are now responsible for the data: its consistency, its backups, its security... Here are a few ways this will impact you:
Log Files.
- You may have to go through all of your log entries to ensure that no sensitive or customer data written to any of the log files. In an on-premise world, this information was written on servers of the customer so the security impact was much less severe.
- People who need to analyze issues may no longer have access to local log files. Also, having them locally makes analysis in a horizontally scaled out environment much more difficult as you need to know which server to look on. Also, as servers/services/containers may be renewed during a release, you may not even have the logs anymore of a previous release if you would store them locally. You can fix many of these challenges using tools like Azure Log Analytics or AWS Log Analytics to aggregate these log files in a central repository, but the challenge with these tools is that they may be a few seconds or minutes behind and they do come with a cost. Depending on verbose your logging is, the cost may be considerable.
- In a multi-tenant environment, your logs entries will only have any meaning to you if the log also contains the name of the tenant this entry applies to. If you come from a single-tenant environment, this probably will not be the case. To ensure all your log entries contain this tenant information, the existing code may have to be updated.
- The verbosity level of your logging may have to be updated. Writing too many log entries could have a too big of a performance or cost impact. For example, at the beginning of our SAAS adventure, we had an issue a few times where the verbosity of our logging, combined with the load caused by multi-tenancy resulted in so much data that Application Insights brought some servers down.
Hide Infrastructure. Any UI, API, documentation... that exposes anything related to the infrastructure that you are using internally has to be hidden or removed. The key thing to keep any SAAS environment in a manageable state is that you have to be able to control and automate the infrastructure and configuration that you are using. Exposing this to a customer not only impacts your manageability but also introduces security risks.
Developer Access. Often when a problem happens, you need the customer's data or configuration to be able to reproduce the problem. Especially with hard to reproduce issues. In SAAS your customers may ask you to contractually limit who can or cannot access their database. This means that your developers may not be able to access a customer's data to reproduce an issue.
Open Source. Your application probably using dozens of open source components. It's near impossible nowadays to develop anything without using anything that isn't open-source (assuming you wanted to :)). For SAAS, you will have to take out and replace any GPL flavored software to evade legal consequences. You may have to set up a process to ensure to stay up to date because any known vulnerabilities in anything that you use, now may become a vulnerability for you as well.
Operations
This section explains things to take into account when building an operations team.
- Operations Mindset. Doing operations requires a particular mindset and you may currently not have those people on board. Don't expect your developers to do operations. They have a different mindset and they should. A developer focuses on the long term manageability of the platform. This means that any bug has to be fixed in a good way. An ops-person has to focus on operational availability. His focus is to fix the issue NOW. This conflict has to be there. A developer often gets his satisfaction from building features. An operations person may get his satisfaction from solving a problem. Your developers probably also didn't join your company expecting they would have to do 24/7 at some point. Operational people will more likely expect that.
- Alerting. You don't want to have your people constantly looking at screens and dashboards waiting for something to go wrong. Alerts are the only way to be notified of any issues in a manageable way. That means that alerting software has to be bought or built. A team has to be set up that responds to those alerts. Documentation has to be created so that the team knows how to respond to these alerts. That team needs to be trained continuously on the changes and new features. The product may have to be changed so that the necessary telemetry or data is available for alerting to use. Enabling your support/alert/operations team has to become just as much a deliverable for your R&D teams as the software itself.
- Feedback Loop to R&D. I'm moving from the assumption that you will start by building an operations team next to your R&D team. Doing DevOps from day 1 may be a bridge too far. In that case, it is vital that you make sure that you have the necessary feedback loops to share what the operations team has learned with your R&D teams. This will be crucial if you want any issues to result in bug fixes, product changes, or optimizations. In the beginning, this may be a free lunch as your key technical people may be involved in everything, but as your SAAS organization grows, more people get involved and you will need a structured solution for this problem.
- Eliminate Noise. Your log or telemetry may contain thousands of errors a day that may not even be noticed by your users. These errors may pollute your statistics, introduce noise in your monitoring, or cause false alerts. Errors that do mean anything have to be fixed, anything else is noise and has to be eliminated.
Developing for SAAS
A major unexpected challenge was that we had to change the mindset of our R&D teams when creating new features. Creating SAAS software is NOT simply a different way of deploying and maintaining the system. It has to be written differently!
Many of the things mentioned earlier are not one-off projects. Feature development, design, bug fixing, troubleshooting, security.... all have to be approached differently in SAAS and this must become second nature for your developers. Getting there will not be a free lunch.
Here are just a few topics that your developers may have to get used to:
- not writing sensitive data to the logs
- not forgetting to create alerts for every new feature
- providing documentation for operational people
- releasing features in a way that they don't cause any downtime
- expecting transient failures
- expecting rate limitations in dependencies being used
- considering the impact on other teams in the organization when introducing new technologies
- taking run-time costs into account when designing a new feature
- ...
Conclusion
The purpose of this article was not to give you a comprehensive checklist to move to SAAS but to give you a list of topics that you may not have thought of yet and that you may have to consider. Transitioning to SAAS is a daunting but thrilling and fun thing to do. Once you're there, you may never want to go back to on-premise :)
License attributions for the icons: Availability Icon, Release Icon, Mindset Icon, Operations Icon, Tenant Icon, Security Icon, Customization Icon, Cloud Icon, Steps Icon
On the exciting crossroad between software, data & life sciences
4 年Interesting read! Any thoughts on the distinction between being cloud based and SAAS? It appears to me that in some circumstances a VPC deployment might hit the sweet spot, combining cloud scalability with a level of control by the customer equivalent to on-premise (compliance with internal policies, regulatory affairs, information security, integration with other IT, customer validation before roll-out of upgrades...).