Intune’s Journey to a Highly Scalable Globally Distributed Cloud Service
This post is the fourth and final part of a series (check out Part 1, Part 2, and Part 3) that I felt was really important to publish.
You might be asking why I wanted to dive into the architecture of Microsoft Endpoint Manager/Intune at this level. My reasoning is that, as you take dependencies on the cloud services you use every day, it’s important to know whether or not those services have been architected for scale. There will always be critical moments when you will need to be able to rely on the security, performance, and disaster recovery services your organization requires.
Back in 2013, we knew that the initial architecture of Intune was not going to be able to take us where we needed to go. This recognition led us to begin looking around the industry to see if there was anything in the market that could help us accelerate toward getting to that architecture we needed. The answer was a resounding “no.” The reason we didn’t find anything was that the competitors to Microsoft Endpoint Manager/Intune were not architected as modern cloud services; they were (and are) all essentially client-server architectures and then “hosted.” This is an architecture firmly rooted in the increasingly distant past.
As we work with you on your modern management needs, I can very confidently say that Microsoft Endpoint Manager gives you an ideal, modern, and forward-looking solution for your management and security needs. It is an architecture for the future. Believe me when I say that architecture matters.
This post looks at how we matured our services to perform reliable failover/disaster recoveries, be fault tolerant, and rollout new features with minimal impact to existing feature usage by customers.
As noted in the last post, three huge things we learned leading up to this point were:
- It is critically important to set realistic and achievable goals for SLA and scale and then drive towards achieving each of these goals persistently. With this in mind, we found that the best outcomes happen when a set of engineers from across the org work together as a unit towards a specific and common goal. This is a great way to make incremental changes that accumulate into some really big and really positive effects.
- Continuous profiling is a critical element of cloud service performance. By doing this you can reduce resource consumption, tail latencies, and it indirectly benefits all runtime aspects of a service.
- Microservices really help improve agility. Having the proper tooling to handle patches, deployments, dependencies, and resource management is a critical part of deploying and operating microservices in a high-scale distributed cloud service.
As called out in the previous blog, after reaching some of our key goals for availability/SLA, perf, scale, and agility (and also consistently sustaining them over continued rapid growth of customers and devices), we were ready to move on to the next stage of maturity by investing and learning in the following areas:
- Failover/disaster recovery
- Fault tolerance
- A/B deployments to rollout out new features with minimal impact to existing features
- Future evolution
Here’s how we did it:
#1: Failover/Disaster Recovery
In September of 2018, Azure experienced an outage in the South Central Region of the U.S., which subsequently came to be known within Microsoft as the ‘San Antonio Outage’.
This outage was triggered by severe weather, including lightning strikes on Azure data centers. An outage of such a severe impact had not been seen or experienced previously, particularly an outage with one of its scale units hosted in the same region (see Part 1 for the definition of scale unit). When this occurred we did not have highly reliable and automated failover and disaster recovery (DR) procedures in place, and this meant that, when we had dependent services such as Azure compute and storage unavailable for an extended amount of time, we had no option but to wait for all our dependent Azure resources to recover.
This led us to invest heavily in failover and DR procedures within Microsoft Endpoint Manager/Intune. Amongst these investments was a regular schedule of DR drills in our preprod environments. By doing these drills repeatedly over the past year we have learned some very valuable things about gaps in fundamentals, areas where automation was really needed, testing gaps, and consistency gaps across microservices.
Our first DR drill took several months to prepare and perform – but, over time, we improved significantly, and we now perform a DR drill every 1-2 weeks. Some of the major improvements we made based on the DR drill data include:
There were also a few things we learned doing DR drills that exposed some gaps in the way our infrastructure worked across all our microservices:
-- First, many of our microservices provisioned their resources in an ad hoc manner. We quickly realized we needed a consistent way of identifying, provisioning, and inventorying microservice resources.
- Impact: This resulted in a major effort to rehaul the way we provision all dependent resources of a microservice via a Unified Resource Management across Intune.
-- We needed to move our clusters (scale units) from the existing Azure PaaS based infrastructure to VMSS based infrastructure that supports availability zones (AZs). Some of our production clusters are already on these AZs, and we are on the path to migrate all our existing clusters to AZ enabled clusters within the next year.
- Impact: This provides a huge benefit in resiliency to datacenter outages.
-- A DR process with stateful services is extremely complex to perform successfully. We need to move all our Intune services to a stateless model backed by external storage via SQL Azure/CosmosDB.
- Impact: These storage models already solve the DR problems and make it very seamless to perform failover and/or disaster recovery.
Given all the above investments and efforts, we now feel much more confident today on our ability to successfully and efficiently recover from a data center outage. We are also still investing in making our failover/DR abilities even more resilient and fast.
#2: Fault Tolerance
Another thing we learned from our work with failover/DR was the limits of the fault tolerance levels within Microsoft Endpoint Manager/Intune’s services and infrastructure. We discovered just how important this was when we saw service SLA’s drop below three 9s when we were performing upgrades. During upgrades it’s typical to see processes go down, nodes get brought down/up, dependent services may be unavailable, network connections drop, and variety of unexpected faults start to happen. In other words, it’s total chaos. We needed to make sure that our services were resilient to any form of chaos that we encountered.
Our investments in this area relied primarily on a great feature provided by Service Fabric to perform chaos in the cluster. We utilized this to investigate many of our issues and focus on improving upgrade SLA’s. I consider the result of this to be very impressive – see Figure 1 below.
Figure 1: Upgrade SLA Improvements:
This improvement taught all of us a very important lesson: It’s important to induce faults into the system and test our service resiliency all the time. It’s not safe or smart to spend all your time in a reactive mode waiting for failures to happen via upgrades in order to find/fix code bugs. The frequency of such faults need to happen on a continuous basis. Based on what we learned here we were able to implement several such faults or fault tolerance mechanisms into our environments. These mechanisms (among others) included:
-- Constantly running chaos:
- In addition to what I described above, we also introduced faults at a deeper individual API level, like calls to Redis randomly failing or calls to Azure storage failing, etc.
-- Constantly rotating secrets:
- Typically, application secrets for accessing resources such as SQL or Azure storage are rotated every few months, sometimes once a year, depending on the secret. The implication evidently, is that the code and/or configuration handling the rotated secret is exercised once every few months and sometimes once a year. O
- ur experience showed that, most often, code that is exercised infrequently is bound to have lower quality – this results in more issues in production that impact customers. Learning this led us to automate our secret rotation and exercise them on daily or weekly basis in preproduction environments.
- Doing this not only helped us find and fix rare code quality issues, but, in a couple of cases, it actually exposed issues in external dependencies (such as Azure Active Directory). Ultimately, we were able to surface the code bug to the external team, and have it investigated/fixed much before the issue ever happened in production.
-- Auto mitigation:
- Over time, we learned that it is not possible to catch every single issue in preproduction environment, particularly those that happen due to specific traffic patterns or payloads triggered by unique customer actions. The most common issue had to do with high resource consumption, such as high CPU, high memory, etc. – and the most common and quickest way to fix it was to either reboot the node or restart the service to resolve the issue. Doing this manually often took a long time and extended the impact to the customer. This led to us implementing an auto mitigation system that automatically detected and mitigated these issues in one instance. This single change saved us 200 person-hours in a single week!
- The type of issues we can now automatically detect and mitigate are usually associated with high CPU or memory consumption – and we have seen the abnormal request rate drop. Now we can automatically detect and mitigate ~35 incidents in a week, about half of which are in production. In worst-case scenarios, however, we have successfully mitigated 315 incidents in a week, with about 150 in production. A majority of those situations would have turned into a customer-impacting issue had we been unable to prevent it.
- Overall, low available memory causes us to take automatic action in either restarting a service or a node. Installing an auto-mitigation system has proven highly valuable in preventing customer-impacting issues, and we intend to extend the variables to other conditions such as disk I/O, and more.
#3: A/B Deployments
As we were evolving our services in various fundamentals, one of the very difficult tasks we faced dealt with the metric that tracked time to mitigate incidents.
Our goal was to mitigate in under an hour, but this was almost an impossible task to achieve given our mitigation strategy which was always to fix forward. The “fix forward” strategy involves the entire cycle of a feature or hot fix release: PR, build, test, and deploy in a safe rollout manner. Because each of these 4 stages added several hours of time, we needed to embark on an entirely different strategy if we were going to meet our goal of keeping the time to mitigate (TTM) under an hour. The solution which we adopted was that instead of fix forward, we decided to do a roll back to a previous working version. This is typically referred to as “A/B deployments.” The exact way this works behind the scenes requires a separate blog post of its own, but, in a nutshell, it works like this:
- While we roll out a new version of the service, we keep the existing version of the microservice running in production until we gain confidence in the new version’s quality.
- A router service called Traffic Routing Service (TRS) knows how to route traffic to either the newer or existing version of the service based on certain signals and/or manual switch.
- When an incident happens, we turn on the switch in TRS and it instantaneously routes all traffic to the existing, stable version of the service. The switch over is typically in 2-3 minutes.
- This technique of A/B deployments provided us with huge improvements in TTM for microservice deployments.
- This also improved our agility because we can roll out to production even faster since we know that we can mitigate any issues quickly.
A very nice, unexpected side benefit of TRS was load balancing requests to the microservice across all instances of the service running in the cluster. To give some background, in Post #1, I mentioned how there are many FE and MT nodes where multiple instances of multiple microservices are run for availability, perf, and scale. As shown in that figure, there is an Azure load balancer that routes traffic to these nodes. The Azure load balancer does a really good balancing of routing and traffic equally among all the FE nodes. However, it has no concept of a microservice. Due to this reason, requests to a microservice themselves may not be balanced across all its instances. For such a balance, the load balancer needs to understand the concept of a microservice. It turns out TRS automatically serves this purpose. We noticed significant drops in CPU due to this kind of ‘more intelligent’ load balancing – the following charts show how the traffic is balanced (Figure 2 and Figure 3).
In Figure 2, each color represents an Azure node, the vertical axis represents the traffic (requests per sec), and the horizontal axis is the time. As the chart shows, before we implemented TRS, some nodes had very low traffic (some of them had 0) while some nodes had very high traffic. The result of this imbalance is unpredictability in availability and/or resource usage of a given VM, and difficulty in capacity planning. After we implemented TRS, this imbalance disappeared. Figure 3 shows the same nodes/VMs and how well the traffic is balanced. This balance helped us in availability, optimal resource usage as well as a better way to predict when we need to add capacity to handle growth. Another really nice effect of TRS load balancing is the CPU improvements. Due to the smoothening of traffic across all the VMs, Figure 4 shows the CPU improvement not only was uniform across all nodes (before/after 22nd June), it also went down significantly to < 250 from 750. In this chart, each color represents a node, the Y-axis is the % CPU, and the X-axis is the time. The reduction after 22-June is self-evident.
Figure 2: Uneven Traffic Distribution Before TRS to Microservice:
Figure 3 : Even Distribution of Traffic by TRS to Microservice (the different colors overlap so much that they appear as a single color):
Figure 4: CPU Improvement Due to Better Load Balancing of Traffic to Microservice:
Conclusion & Future Evolution
This concludes our 4-part blog series on Microsoft Endpoint Manager/Intune’s evolution from the initial stages where we just began hosting our services on Azure at a small scale to an industry leading service in performance, scale, reliability, and availability. Needless to say, we are on a relentless mission to continue to improve our service in all aspects of fundamentals, features, and maturity. Many future investments are still underway with our move away from stateful services to stateless services (probably a blog by itself on why/how), resource governance and isolation with containers, infinite scale out, seamless failovers, and many others. We hope you enjoyed this blog series.
Chief Technology Officer at The Edge Hub ? Microsoft MVP - .NET
5 年Great read Brad, not just for endpoint guys but anyone looking at how to build scalable architecture.
Technical Director, Engineering? ???
5 年Great article! Quick question, Brad: Do you currently host your state in Service Fabric reliable dictionaries or a central OLTP or a combination of both?
This is such a good post, and an important point. MSP's would be wise to pay attention here--Intune / Microsoft Endpoint Manager may someday even support multi-tenant views (if we're good boys and girls maybe Santa Anderson will bring us some happy news for Christmas)! I know that our own org has already run into certain limitations with the "hosted model" of RMM tools... the answer today is to scale your own infrastructure which means adding a lot more cost not to mention labor and expertise. Not so with MEM. Provision and GO! True cloud scale service.
Thanks for the transparency with the lessons learned along the way!
Product Management and Strategy at Microsoft
5 年Great article Brad!