A 10-Step Program for Improving your Organization's Azure Incident Management
Image generated with Bing Image Creator

A 10-Step Program for Improving your Organization's Azure Incident Management

I'm by no means a paragon of successful New Year's resolutions. Whether it's the "Work out every day before March break beach vacation", the "Drink no more than 2 coffees per day", or the "Get better organized for this year's tax filing", I've attempted various resolutions to little avail. Nevertheless there are certain areas of self-improvement that when called out by a respected professional in your life - Your doctor, for example - You finally embrace the necessary micro-habits and just get it done.

As we dive into 2025 I think this is an important concept to port over to the business world, specifically with a focus on improving Business health outcomes in the cloud-dependent era. With more and more of an organization's workloads, services and business processes having dependencies on cloud offerings such as Azure, it's vital that such organizations be as prepared as possible on the 'Before', 'During' and 'After' operations in the event of an incident with Azure services.

To that end... And to steal another overplayed tradition of sharing an annual "Top-10" list... Following are my Top-10 recommendations for improving your organization's Azure incident management maturity and in turn the health of the business.

Note that a number of items on this list are specifically curated for organizations with Microsoft Unified Support partnerships, not only because that's the world I live within, but also because Unified is the most important partnership that your organization should have in place for run-state success of Production workloads on Azure.

  1. Ensure your Azure incident/support stakeholders complete Introduction to Azure Incident Readiness - Training | Microsoft Learn: Takes less than 45 minutes, overviews the foundational concepts of Azure incident management, outlines the Support ticket process, and introduces the audience to Azure service health alerting.
  2. Ensure your Azure incident/support stakeholders have proper RBAC roles to log Azure Support cases (https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request): Follow through on the guidance from the aforementioned training, before an incident actually occurs!
  3. Ensure your Azure incident/support stakeholders have their Unified Access ID if ever needing to log an urgent Support case by phone (https://learn.microsoft.com/en-us/services-hub/sfbus/support-for-business/access-id-prof-support): In the rare event where the Azure portal itself isn't accessible, it's still possible to log Support cases by phone however this requires the caller to have their unique "Access ID". Work with your Microsoft Customer Success Account Manager (CSAM) ahead of time on this.
  4. Be intentional (not passive) with Azure service health alerts (Receive Service health alerts on Azure service notifications using Azure portal - Azure Service Health | Microsoft Learn): If you've got alerts just landing in a shared mailbox which is monitored as the Ops team sees fit, it's not enough. Configure push notifications to designated on-shift Ops personnel, and integrate the alerts directly into your organization's ITSM platform.
  5. Enable Unified Support for the appropriate Azure subscriptions: Validate with your Microsoft CSAM on a recurring basis as to which of your Azure subscriptions require Unified-level Support.
  6. Establish your Azure Major Incident Response Plan (“MIRP”) and integrate into your organization’s BCP/DR/EM processes: The MIRP is in fact a required element of the Unified Support partnership if you leverage Microsoft cloud services such as Azure or M365, and so if you don't yet have one established, contact your Microsoft CSAM right away and get the activity scheduled. Your CSAM can run a foundational version of the activity, or you can run through a more robust MIRP design engagement with a Microsoft Cloud Solution Architect.
  7. Refresh your Azure MIRP every 6 months, running it through a simulation exercise if needed: The MIRP is only as useful as the accuracy and currency of the information it contains. Work with your Microsoft team to refresh it every 6 months, and consider running through the MIRP Simulation engagement offered through your Unified partnership to ensure your incident stakeholders are comfortable with the process before an actual incident occurs.
  8. Run your highest-business-impact Azure-dependent workloads through a reliability review: First and foremost, ensure your organization (with input from Business stakeholders) is actually identifying which workloads are of highest importance in terms of reliability and resiliency (https://learn.microsoft.com/en-us/training/modules/azure-well-architected-reliability). From there, it's strongly recommended to run those workloads through the 'Well-Architected Reliability Assessment' (WARA). For less critical workloads Microsoft offers self-service online assessments (https://learn.microsoft.com/en-us/assessments/browse), though for your most critical workloads you should run them through the WARA engagement available through the Unified partnership and led by a Microsoft Cloud Solution Architect.
  9. Determine with your Business leads as to whether your Azure environment and/or dependent workloads and/or dependent events may warrant Mission Critical Services (https://www.microsoft.com/en-us/microsoft-unified/mission-critical-services): For your most critical dependencies on Azure, consider whether to take advantage of Microsoft's most advanced program alignment.
  10. Don't forget about Security!: While the practices of "Service Incident Management" and "Cyber Security Incident Management" do share certain elements, they also involve many distinct elements including tasks, stakeholders, and possibly contractual/regulatory obligations. Ensure your organization works with Microsoft to establish and maintain a Cyber Security Major Incident Response Plan. Such a plan may end up merged with or otherwise referenced from your Azure & M365 MIRPs, though at a minimum should be incorporated into your SecOps framework. Reach out to your Microsoft CSAM for info on the many areas where the Unified team can assist with Cyber Security Incident Management planning & response.


I'm no doctor though I'll claim that this Top-10 list is as close as you can get to a "professional opinion", and I hope you find it helpful if you or your team is involved in the run-state success of Azure-dependent workloads. And forget about New Year's resolutions; just tackle each item on its own as a micro-habit, and your organization will improve its Business health outcomes. Happy start of 2025!!

要查看或添加评论,请登录

Graham McKendry的更多文章

社区洞察

其他会员也浏览了