Why "the Cloud" is better, even when it's broken
Edit: For specifics about the rebate or whether you may be entitled to it, please see the terms and conditions of the SLA Agreement documentation. The SLA may differ depending on the service affected. The SLA documentation defines what is considered "downtime" for that service, and there is a formula/calculator they provide that can be used to see if the SLA was breached.
This morning (30th Jan 2019 Australian Eastern Time) Microsoft’s cloud services suffered from an outage. The outage affected Office 365, the Azure portal and specifically user authentication.
A portion of network infrastructure that facilitates authentication requests is degraded, affecting access to one or more Microsoft 365 services.
For Australia, this affected services for the start of the business day. I was personally notified of the outage at approximately 7:30am Brisbane time (which translates to 8:30am Sydney time because Queensland hasn't grasped the concept of daylight savings). This is not an ideal time to have an outage where people can’t authenticate to Office 365, in particular for e-mail services.
Of course, during this time social media lit up with the #office365 hashtag as people fired shots at Microsoft, and posted tweets like the world was ending. There were people (joking…or maybe not) saying everything should be run back on-premise. I can only imagine what the business case for that would look like, to bring back the “golden era” of server/client computing of the early 2000’s.
All jokes and jibes about the cloud aside, here is why the Cloud is better, even when there’s an outage.
You see, Microsoft has a Service Level Agreement (SLA) of 99.9% per month for Office 365 Services. I know Data#3 was impacted by this outage for approximately 1 hour this morning. If the outage was less than 1hr, then Microsoft didn’t even breach SLA. But just for fun, let’s say the outage was 1hr. And let’s say the month counts from Jan 1st - Jan 31st. Assuming no outage tomorrow (31st of Jan), this puts the monthly SLA at 99.86%.
According to Microsoft’s SLA agreement documentation here, for a breach greater than 99.9% you are entitled to a 25% service credit.
25%! That’s free money….
…and the amount of credit goes up from there. If they breach 99% you get a 50% service credit. If they breach 95%, you get a 100% credit.
Another thing to think about....
When Office 365 or Azure Services go down – who do you think is working on the problem? Any issue with Microsoft’s crown jewels of cloud services and they will immediately have the smartest people in the organisation working on the issue. It is not in their interest to prioritise something else over an outage that is impacting subscribers.
Now let’s look at this same issue if it were an on-premise solution that an organisation had to deal with themselves. Firstly, it would probably take more than an hour to even work out you had an issue.
- Several users call the service desk to report email is down and/or logon issues
- The first 5 calls are put down to coincidence until someone identifies it as a wider issue
- Service desk escalates the issue
- Someone from Tier 2/3 support looks at the ticket, pokes around in logs, checks mail flow, performs their own tests etc.
- T2/3 support restarts services, maybe reboots a server, asks around if anybody deployed a patch this morning, checks monitoring software to see if any alerts show anything obvious, does a Google/Bing search and follows half a dozen dead ends etc.
- IT Manager makes the call to log a support ticket with Microsoft
- 35 mins is spent on the phone with Microsoft premiere support and the call is logged
- Microsoft will then get back to you, depending on your support agreement and SLA. Maybe it’s within the hour, maybe it’s 4hrs, maybe it’s next business day.
- Microsoft finally fix the issue after phone call tennis is played because you are in completely different time zones
….and after all that, you don’t get back 25% of anything.
THIS is why the Cloud is still the better option even when it’s broken.
Your mileage may vary and the scope of this outage impacted different organisations in different ways.
Here in Brisbane at 7:30am for me it simply meant extra time for a coffee. Half the office wasn't even at work yet and didn't even know there was an outage. Elsewhere in the world this may have had a longer, more business critical impact. But regardless of whether the “cloud is down” or your “on-prem infrastructure is down” – the speed at which the service is restored and therefore your business continuity will always be quicker in the cloud.
The argument for keeping commodity infrastructure on-premise (such as email and file storage) is only valid if there is never, ever a single problem with the on-premise infrastructure. Which we all know is never the case.
Top 50 UC Expert. AI Show co-host. Leader BCStrategies. Analyst/Consultant for orgs and vendors.
6 年Good article. Organizations wary of cloud outages should compare their on-premises track record; however, many organizations don't bother to track outages when they are running Exchange or other services on-prem.
Public Sector | Enterprise Account Manager
6 年Well put.
Senior Data Scientist at Atlassian
6 年Many good points. Awesome post
FastTrack Architect @ Microsoft | Trusted Advisor for Microsoft 365
6 年Wow! I was not expecting this to be picked up by IT News.? For anybody that comes here to read the article, read the SLA agreement and understand the process and requirements for making a claim if you are seeking compensation.? The different cloud services have different levels of SLA, and also there is a formula for calculating how you've been affected which is detailed in the documentation.
IT Consultant
6 年Nice post, and you were quoted!?https://www.itnews.com.au/news/microsoft-may-need-to-credit-customers-for-cloud-login-outage-518654