??On-Call Life: Issue 01

??On-Call Life: Issue 01

Takeaways from the 2024 CrowdStrike Incident

Welcome to the first issue of On-Call Life, your go-to source for all things IT. In this newsletter, I will discuss recent events in General IT, Cybersecurity, Software Development and all things in between. I’ll kick it off with the key takeaways from last month’s talk of the town: the CrowdStrike Incident.?

Whether you were faced with the Blue Screen of Death (BSoD), waiting for a delayed flight, or just scrolling on your phone, there was no escaping the situation on July 19th…but what really happened??

In February , a feature was added to The CrowdStrike Falcon sensor to help it identify potential new attacks that could misuse Windows capabilities. After testing this feature, they released two updates, one in March and one in April, both working as expected. However, on July 19th a third update to the sensor was released to some Windows systems, expecting 20 inputs, but receiving 21. Which is what unfortunately resulted in a system crash that led to the outage.?

What was the Impact of the CrowdStrike Incident 2024??

This incident had a global impact and is being called one of the largest outages in history. There were over 8.5 million computers affected worldwide, and over 2,500 canceled flights in the U.S. alone.?

Dependency on the usage of these systems came to the forefront, forcing teams to reevaluate their business continuity plans and consider how they handle crises.?

Key Lessons from the CrowdStrike Incident 2024

While this incident was unpreventable for many organizations, it brought light to the need for teams to have effective plans in place to ensure the seamless delivery of their services. So, it is important to take a step back and consider the lessons we learned to implement in the future.?

Ensure the swift identification of incidents

Identifying incidents quickly is essential to maintaining business continuity. Unfortunately, many professionals woke up on July 19th to the BSoD not sure of how to move forward with their day leaving IT teams to be spread thin responding to questions and tickets. To avoid situations like this, teams must have a strong monitoring and alerting workflow that enables visibility into the status of their systems. Further, with alerting solutions , teams can ensure that their internal communications are up to par by delivering immediate alerts to IT teams upon identification – who can then send a statement to the rest of their staff about the status of their internal systems. By proactively monitoring all endpoints and systems, teams can ensure that the whole team is on the same page, freeing up IT support teams to focus on understanding the problem and coming up with solutions.?

Having a robust business continuity plan is crucial in the modern cyber landscape?

24/7 service delivery is the new norm for clients everywhere, and prolonged downtimes are no longer tolerated – teams will just cancel your service and find another solution. Considering this, teams must prioritize reevaluating their business continuity plan to reflect how they will proceed if situations like the CrowdStrike incident come up again. I will outline best practices for business continuity later in this issue to help teams go about re-examining their plans.?

The need for reducing dependency on a single point of failure

Switching to a multi-cloud strategy wouldn’t have prevented the impact of the CrowdStrike incident on your business, however, it is a topic that was emphasized throughout the issue revealing just how many organizations are dependent on a single point of failure. By implementing multi-cloud, teams will be able to benefit from redundancy cushions, thus minimizing the risks and impacts associated with an outage on one of their providers. It is also incredibly important to note that a multi-cloud strategy can be significantly more complex to monitor and maintain, so teams must ensure that they have the resources to navigate the intricacies of their dependencies.?

Business Continuity Best Practices

To help teams optimize their business continuity plan , I have compiled a list of the 5 best practices for reevaluating and planning for unexpected events:

Expand the Scope of Your Business Continuity Plan

Even if your team held the most thorough risk assessment, this incident made it clear that new incidents and threats can surface at any time. So, after a new incident has been identified and dealt with, your focus on the issue should not stop there. Teams must perform a risk assessment to expand the scope of the business continuity plan and include previously unlisted incidents.?

Conduct Routine Reviews

As briefly implied previously, your business continuity plan cannot be a static document. Teams should routinely review the content and effectiveness of their plan. They can do this through incident simulations, where they recreate common incidents in a controlled environment and follow the plan from start to finish. The response team can then identify challenges and bottlenecks they faced allowing them to reconsider and better their plan before an actual event occurs resulting in shorter response times and more accurate resolution.?

Implement Strong Internal Communication Plans?

Your business continuity plan is only as good as your communication plan. Teams must be able to quickly mobilize their response teams in times of crisis and update other stakeholders as the incident progresses. With solutions, like OnPage, teams can accomplish both of these. And when integrated with monitoring tools, response teams receive immediate, distinguishable alerts right when an incident is detected. Plus, with its mass notification solution, BlastIT, teams can easily deliver notifications simultaneously via SMS, email, and phone to ensure that all stakeholders are aware of the incident and can rest assured that it is being handled by the right team members.?

Employ Monitoring Solutions?

As mentioned before, monitoring is crucial to identifying and resolving critical incidents. Teams can gain 24/7 visibility into the status and health of their systems allowing them to proactively identify and eradicate vulnerabilities, and take action in more abrupt situations. This enables them to maintain business continuity for their clients and ensure their satisfaction.

Maintain Comprehensive Documentation

Documenting incidents is one of the most important things for a business continuity plan, as it can be helpful to refer to them when dealing with similar incidents in the future. Plus, by having everything saved in writing, teams can prevent miscommunication about the timeline or root cause of an event. Lastly, in addition to all of this, some industries are required to maintain these documents for compliance, so teams must research the necessary documentation of their specific organizations’ services.?

Conclusion?

At the end of the day, the CrowdStrike incident brought to light the need to understand the dangers of running on a single cloud environment and establishing strong plans for continuity. Hopefully, you were able to gain some insight into the incident itself and improve your team’s ability to maintain seamless operations in the process.

Thank you for reading this issue of On-Call Life, by Zoe Collins. Subscribe for more!

要查看或添加评论,请登录

OnPage?的更多文章