Releasing Canary on top of a centralized multi-tenant microservice platform
Guillermo Wrba
Autor de "Designing and Building Solid Microservice Ecosystems", Consultor Independiente y arquitecto de soluciones ,evangelizador de nuevas tecnologias, computacion distribuida y microservicios.
Big-Bang and Blue/Green Releasing Approaches
Releasing new features is something that occurs very frequently whether our system is created on top of a microservice architecture or not at all. As our product evolves, new business requirements are gathered and translated into feature and user stories that need to be documented, and then, translated into a solution, which may or may not involve microservices - at the end, microservices are just another option when it comes to implement the real thing.
When it comes to releasing new features into our target production environment, we have several deployment options. The most common - and simple - option is to implement a "big-bang" approach, for which new features are deployed in production by first stopping our running services, then deploying the changes, and finally re-starting the services; this approach is simple, but has several drawbacks the main being the fact that it involves a service interruption, so we cannot guarantee our system is going to survive a deployment. Depending on the nature of our system - and the intricate user interactions - this approach could either be feasible or not at all. For example, a system that gathers information by subscribing to RSS feeds, and then updating a central news database from time to time via a batch process , such an interruption may not cause any major problem to and end user, since the delay associated with the information being uploaded is something that is expected from the end user experience.
Now let's suppose that our "news" system now has a new requirement to pull the data in near-real-time; it's evident that in such a case, given the user expectations to get data as soon as possible, the big-bang approach wouldn't end up being a good fit for our purpose.
This leads to our second releasing approach, which is known as "blue/green" deployment for which two separate production environments are defined, the "blue" where the actual system version is running and accessible for production users, and a "green" usually not visible to end users, where releasing of new features takes place .
Following such an approach, the current software version is deployed in production under a "blue" separated environment; users can only connect to the "blue" side and cannot directly access the green side. If a new version of any software component needs to be deployed, then deployment must necessarily occur at the "green side" and QA must be performed on that side only. Once the new released components are fully tested and the new release is stable, then traffic is automatically switched to the green side, so users can start using the new capabilities being deployed. Meanwhile, the blue side remains as a "Backup" in case something goes wrong with the new build; in such a case, the traffic is switched over to the "blue" side, falling back to the original condition.
Canary Releasing
Lastly, we have a modified version of the "blue/green", known as "canary" release.?Whereas the blue/green approach switches all the users between environments (blue->green) without making any distinction,?the canary release performs an "incremental" switch by increasingly moving users from the "blue" to the "green" section, so that at a certain point in time,?only just a few users would be able to see the new build;?other users are only going to see the old build, depending on which side the user is sitting at.?????Once all the users have been moved to the "green side", the canary release is said to be complete.
This approach for sure has many benefits. One of the most remarkable benefit is the capability of performing an incremental test of the new build, so that if something unexpected happens, only a portion of the user population gets affected and hence, rollout can be interrupted without the majority of the users even not getting noticed about any issues.
Now, we need to think just a moment about how a canary release methodology can fit into a multi-tenant architecture. In a multi-tenant platform, multiple tenants are being served simultaneously; for example, in a big hospital management system (HMS), every single hospital represents a single tenant; individual hospitals can have their own storage, settings and related services and may or may not be aware of the presence of other adjacent - or sibling- hospitals; in summary, that's the very basic nature of a multi-tenant platform.
Now the problem of releasing new features in canary may be a little more complex since not only users must be ramped up incrementally when deploying, but also individual hospitals (tenants) should do so as well; in other words, if a new build is release into production, we may not want to make it available to all the hospitals at once, and otherwise, bring the new build incrementally across hospitals.
If we were following a distributed single-tenant approach , the dilemma on how to implement the incremental rollout may be quite simple, since individual tenants can be deployed to a completely separate compute infrastructure, i.e. in the form of virtual machines or dedicated containers. Because every single tenant handles its own compute infrastructure, we can implement an incremental roll-out by simply deploying the new build across the multiple tenants incrementally over time; if for some reason the new build fails for a certain tenant, we can revert back the change without necessarily affecting other adjacent tenants.
The above is true for a distributed single-tenant approach, but what about a centralized multi-tenant approach, where all the requests are being handled uniformly by using a micro service platform ?
Canary Releasing in a centralized, multi-tenant platform
Since in such a multi-tenant centralized approach, tenants are not physical entities - contrary to the distributed single-tenant approach we described earlier - and rather than that, logical entities inside our multi-tenant platform and also considering we are running our business services on top of domain microservices , we need to find a way for tenants to pick either the old or the new microservice build at the moment of executing the business logic, depending on which side of the canary release the tenant is sitting.... remember our "blue/green" discussion above,
领英推荐
The solution presented below involves a layer-7 load balancer (L7LB) context-routing capable, together with a microservice versioning approach in place and an API Gateway, responsible for relaying the HTTP Headers of the request to the Load Balancer. As we mentioned above, the L7LB should be capable of routing API requests to either the old or the new microservice build, that will expose either the old or the new service API, respectively. The decision of where to route a request is taken at the L7LB level, and is usually under the control of the DevOps team. In a typical incremental rollout scenario, the DevOps team performs the necessary configurations, so that the load balancer directs the request to the "new" API build incrementally. Depending on the load balancer capabilities, this mapping can occur in two ways:
For example, in our hospital management system hypothetical case, that could be done via our HospitalID, so that at the load balancer configuration level, every single HospitalID ( present on the API request) can be mapped to either the "old" microservice build or the "new" microservice build. In addition, a userid attribute could be obtained from the user session cookies; based on both the TenantID and UserID, the load balancer queries the dynamic routing table, and once the API version gets determined, it routes the request to the appropriate service ( in k8s) that exposes that API version.
The below schematic represents the canary deployment and its potential implementation. The centralized microservice platform can be invoked from multiple Operational Units (OU in the diagram), from which multiple users can be connected to. An API Gateway receives requests from multiple users belonging to multiple OUs, and forwards the appropriate TenantID and UserID headers to the load balancer; Based on the endpoint URL , the TenantID ( represented here by the OUID associated with the Operational Unit ID ) and the UserID, the load balancer determines the resulting API version to be invoked. Note that for this to be possible, an appropriate API versioning policy must be in place.
Implementing Canary: Istio to the rescue
In the above diagram, we are exposing some kind of dynamic "load balancer", capable of performing dynamic routing like a man-in-the-middle; configuring an L7 reverse proxy to perform dynamic routing can be quite complex, either we choose NgInx , HaProxy or any other reverse proxy .
In a real world, that capability can be easily covered by the introduction of a service mesh, such as Istio in combination with some proxy, which reduces the implementation complexity quite a lot. There are several examples that demonstrate how canary can be implemented on top of Istio to distribute traffic across different k8s services based on some pre-defined routing rules that can be specified at the Istio level itself; below i'm sharing some links to what i consider a good resources to begin with.
This will continue soon..... Stay Tuned!!!