The Modern Internet is Broken, and No One Knows How to Fix It
Before you read on, there are a few things that I need to state up-front:
1. While I work for AWS, I am not speaking on behalf of AWS. This is about something that I, personally, took an interest in, and I have been researching this topic for several weeks.
2. None of this should be viewed as an attack on Microsoft. The challenge they face is unprecedented and, frankly, I think they’ve done the best they can.
So... What Are You Talking About?
The famous industrialist J. Paul Getty said,
“If you owe the bank $100, that’s your problem. If you owe the bank $100,000,000, that’s the bank’s problem.”
If you don’t work in technology, odds are that you’ve never heard of Microsoft’s Azure Active Directory (Azure AD), though you have almost certainly interacted with it, unknowingly. If you do work in technology, particularly if you are responsible for the ongoing health of an IT environment, you are certainly aware of Azure AD. More precisely, you are aware that Azure AD has had some issues.
Azure AD is the far-and-away leading provider of business user identity management in the cloud. With over 650,000 customer companies in the United States alone, we, the IT community, have “loaned” Microsoft near-complete control over how our users authenticate to and access the cloud and on-premises applications that our customers and employees need to go about their business.
We are the bank, and we have a problem.
So… What is Azure AD?
Microsoft launched its Azure cloud platform in 2010, roughly around the same time as Office 365, which became generally-available in June 2011. Beyond providing the latest version of the Office desktop apps we’re all familiar with, Office 365 included cloud hosted-and-managed versions of the extremely popular Exchange email platform, SharePoint document sharing server, and Lync (which became Skype for Business… which became Teams). Moving this kind of business data to the cloud required a new approach to how companies manage user authentication and authorization.
In simple terms, “identity and access management” systems are set up to control who can do what with which bits of information. If Diane from accounting wants to read her email or access a file, we (as the IT team) need to ensure Diane is who she says she is - she provides her username and her password - we confirm she still works for the company, and that she has “permissions” to access that file. We maintain a database of all of our employees on a server (or servers) called a Domain Controller. The Domain Controller contains a roster ("directory") of all of our employees, their job titles, who they work for… all the things we might need to know in order to grant them permission to certain things on our network.
In 2011 (and still today), Domain Controllers were in on-premises data centers. By moving Exchange, SharePoint, and Lync to the cloud, the world needed a way to leverage that same directory of users, passwords, and permissions to control access to applications and data in the cloud, without duplicating the user directories already built. Thus, Microsoft provided Azure AD. By pairing on-premises Domain Controllers with Azure AD through a process called “AD Sync,” companies can provide web-based user authentication to services hosted in the Microsoft cloud, without routing that authentication and authorization back to the corporate datacenter. Azure AD is the backbone of “Anytime, anywhere” access to Microsoft cloud services, and it’s so critical that you can’t use the “365” services without it. The basic version of Azure AD comes free with every single subscription to a Microsoft cloud service because it is 100% necessary for everything from accessing your email online to determining whether or not your license for PowerPoint is valid.
Today, millions of customers around the world use Office 365 – over 258 million users in April 2020. That’s 258 million people authenticating to Azure AD, multiple times per day, 365 days a year. Azure AD Basic (the version included with an Office 365 license) provides access to Microsoft Teams, which added 95 million users in 2020 alone, growing by nearly 900% since the start of the COVID-19 pandemic. Besides Exchange Online, SharePoint Online, and Teams, other Microsoft services that rely on Azure AD include OneDrive, InTune (a device management service), Microsoft Managed Desktop, the Azure Portal itself, Yammer, Streams, Sway, Power BI, GitHub, and a whole host of services used to secure end-user devices in the enterprise. Premium versions of Azure AD also allow companies to provide Single Sign-On (SSO), which allows you to access multiple applications with your same, corporate issued credentials, through Azure AD, making Azure AD the potential gateway to thousands of non-Microsoft apps, like Zoom, Jira, Slack, Workday, Asana, Salesforce, HubSpot, ServiceNow, Dropbox, Intuit, and DocuSign, to name just a small percentage of the possibilities.
Maybe you aren’t in the business world at all. Do you think you don’t interact with Azure AD? You might want to think again. Azure AD powers the user log-on experience for the over 90 million monthly active users on Xbox Live. Perhaps you’re more of a PlayStation fan? Companies of all types and sizes use Azure AD Business-to-Consumer (B2C) to run their user sign-up and administration systems. Even if you’ve never opened a Word document in your life, I can guarantee that you are interacting with Azure AD behind the scenes. How can I be so sure? Hint: You’re reading this on LinkedIn.
The raw numbers behind Azure AD usage are astounding. Just over a year ago, Microsoft reported that Azure AD was providing secure access to “250 million monthly active users, connecting over 1.4 million unique applications, and processing over 30 billion daily authentication requests.” Supporting this massive scale is an architecture of over 300,000 CPU cores (December 2019), designed to handle this enormous amount of traffic.
So… What’s the Problem?
Ever heard that other famous saying? Something about eggs and baskets? Slowly but surely, Azure AD has been established as the predominant identity system for authentication to millions of business-critical applications and untold petabytes of business data. Perhaps second only to the root DNS servers of the Internet, Azure AD is the largest single point of failure in the history of modern technology.
Despite the incredible amount of hardware and engineering that has gone into making sure Azure AD stays online, Azure AD has had a regular occurrence of high-profile outages. Although, to be fair, with so many active daily users, every outage is “high-profile.” Microsoft promises a 99.9% uptime SLA for Azure AD, and this will improve for Azure AD Premium customers to 99.99% by April 1st. In real-world terms, accepting a 99.9% SLA is the equivalent of saying, “We are ok with Azure AD being unavailable for 4+ hours, twice per year.”
Microsoft has made many improvements to Azure AD over the years, and continues to do so. However, flaws in the system are appearing more and more as the system is stretched to its apparent limits.
- Azure AD is designed to be a globally-distributed system, but in September 2018, a lightning strike at the “US South Central” (San Antonio) Azure facility took down access to Microsoft online services around the world.
- Partitioning of the Azure AD system is supposed to mean that a single failure can only impact 2% of users, but a code defect in September 2020 blocked access attempts for 83% of users in the US, 19% in Europe, and 63% in Australia.
- A new, “Safe Deployment Process,” designed to mitigate the types of errors that caused the September 2020 outage was only implemented half-way before a “configuration change on the backend storage layer” took Azure AD down for several hours, impacting users around the world earlier this month.
One other thing you likely aren’t aware of, unless you spend a lot of time in this world, is that Azure AD also powers machine-to-machine or service-to-service authentication for customers’ deployments running on Azure. If Azure AD goes down, customer deployed resources on the Azure cloud can also stop functioning, because Azure virtual machines and other cloud resources can't connect to Key Vault or Azure Storage Accounts.
In addition to Azure AD being a “practical” single point of failure for so much of the business IT world, it also appears that Azure AD itself suffers from some quite literal internal single points of failure in its own infrastructure. Publicly available information may indicate that much of the critical infrastructure on which Azure AD runs may be located in just one datacenter. If you are interested in a highly-detailed and thoughtful exploration of this problem, I recommend reading the article here (h/t Dan Patrick).
So… What’s the Solution?
Exploring and researching potential solutions to this problem only reinforced for me how serious this problem could be. In short, I don’t think that there are any good solutions. Why?
I’ve read a fair bit of anti-cloud fear-mongering online, the bulk of which appears to be from individuals representing companies that want to sell you more hardware. Regardless of what others might say, for the majority of companies out there, there really is no architecting “around” Azure AD. If you use Office 365 for anything other than the software licensing itself, if you are using Exchange Online (yes, even in Hybrid Mode) or any of the other Microsoft hosted services, when Azure AD goes down, authentication requests will be rejected.
You might be thinking, “Azure AD isn’t the only cloud identity provider out there,” and you would be right. Google, Okta, Centrify, Ping Identity, and others can provide identity management solutions and SSO to the majority of cloud-hosted and on-premises applications. Each one works by maintaining your directory information in their own, proprietary systems, or by accessing the user and permissions information stored in your on-premises Domain Controllers, while user authentication requests are routed to the cloud identity provider’s (IdP) servers. They can even be integrated with Azure AD, allowing you to federate identity services back into Microsoft hosted applications. However, the simple fact remains – If (when) Azure AD becomes unavailable, your employees will not be able to authenticate/”sign-in” to any service that relies on Azure AD to confirm the employee is licensed for that application and allowed to access the data.
Perhaps you see yourself as a bit of a control freak. You’re an iconoclast, and you refuse to let your IT department depend on services provided by someone else. If self-reliance is really your thing, the alternatives are simple. All you need to do is purchase (or build), test, deploy, configure, and manage your own solutions for email, ticketing, voice, video, chat, conferencing, code repositories, file sharing, document management, CRM, ERP, business intelligence… and, oh yeah, identity. Naturally, you’ll need to restrict user access (and productivity) by exclusively allowing connections from your own network or a VPN, a strategy that should work perfectly in the reality of today’s work-from-anywhere requirements.
Then, of course, you’ll watch your IT department become five-times larger than any other division at your company. You know, the teams that are actually doing the things that help you find, win, retain, and support your customers.
So… What’s the Bottom Line?
Truly, I don’t think there are any good answers out there. Intentionally or unintentionally, we’ve allowed Azure AD to become a fundamental and critical element to how business is conducted in the modern world. There is no one thing that a customer or a vendor can do to completely alleviate or avoid this problem. Hardly any company at all, save for one – ahem – or two very large cloud providers can say that they can operate internally at 100% capacity without Azure AD, and interacting with customers is a much different story. Elsewhere, behind the scenes, Azure AD powers much of the business world and a fair bit of the consumer’s online experience as well.
All we can do is hope Microsoft fixes it, and fixes it fast.
Do you have a differing opinion? Did I get anything wrong? Please leave your feedback in the comments below. I will be happy to correct any factual errors.
Follow me on Twitter @the_CXO
Senior Systems and Networking Engineer
4 年Brian - wonderful analysis as always. The most recent outage made me personally question the design and dependence on Microsoft systems. Of course, this is the 3rd outage that has impacted my delivery of services to my users. I’ve had the lucky experience of being impacted by the outages in September 2018 (during a migration to Exchange Online), September 2020 (during a reopen of my workplace post-COVID required closure), and then recently in March 2021. Much like the potential failure you referenced with DNS, authentication is another likely weak link in many business system designs. I’m very interested to see what solutions become available to mitigate these weak points!
Sr. Global Enablement Manager
4 年Great article Brian, appreciate your research on this. Looks like this method may be a way to protect against Azure AD outages https://aws.amazon.com/blogs/security/how-to-enable-your-users-to-access-office-365-with-aws-microsoft-active-directory-credentials/