72 hours of CrowdStrike Falcon Snafu
AI Generated Image

72 hours of CrowdStrike Falcon Snafu

CrowdStrike IT issue

As everyone moves from response to recover people are looking to lessons learnt phase of incident. What can be said about the situation at the moment. In an interconnected world having no services can be very problematic.

What we know about the incident

CrowdStrike rolled out an update to Falcon that basically updated its signatures for detecting new attacks, in this case a threat actor group had modified their use of IPC pipes and CrowdStrike engineers had formulated a new method to detect this changed usage.

The update caused Windows machines with Falcon installed to brick with a Blue Screen of Death (BSOD). The BSOD appears when a bit of code fatally crashed the operating system, and it provides operators with some details of the problem.

The update itself was not a kernel drive despite having the SYS extension but when executed in the Kernel protected mode it attempted to access memory that was unreadable, the call was via C++ pointer using a loop to read values from a table where some values where invalid.

The operation involved a system driver with privileged access to the computer. To protect system integrity, the operating system had no choice but to crash immediately. Non-privileged code can usually recover from a crash by terminating the program, system privileged execution can’t.

This could be avoided by catching NULL values and creating exception handling mechanism or using tools that check for this type of error.

C++ is an old language and known for its issues with memory that is unsafe to access, more modern languages have mechanism to protect against unsafe memory operations and fail the code more elegantly, so it does not crash the operating system. Newer versions of the language do have support for protection against unsafe memory operations.

As C++ has the low-level capabilities that are close to machine code, developers can structure their code to make every single detail of the OS fast and energy efficient. This makes C++ the perfect programming language for use in operating systems.

Impacted computers ended in a BSOD loop as every time it was started it would access the corrupt data and crash the computer again.

The fix is to change the corrupt update file and the easiest was to delete the old file and let Falcon retrieve a new update file without the issue. This is actually easier said than done with modern servers and the large-scale deployment of workstations.

Booting a single workstation into safe mode and manually delete the file is not a problem, however with large scale deployments of Falcon it would require manual intervention on every workstation, something that most IT departments or Managed Service Providers (MSP) are not setup to do in a short period of time.

For servers in a data centre or server farm these are often remote to the organisation and require an engineer to visit and do the restart and manually remove the corrupt update file, again something this difficult to scale up in a short period of time by most IT departments and MSP.

This problem was exacerbated on devices with Windows' BitLocker disk encryption enabled, as corporations often do to increase security, because fixing the problem could require a recovery key stored on a server that had also crashed.

Gradually tools are being developed to easy the issue of recovery machines. Out of Band (OOB) management or Lights Out Management (LOM) and booting into recovery with network access can allow a more scalable approach if such management tools are in place.

The impact

The impact of this has been considerable, far more than the WannaCry attack that impact a third of a million machines and brought the UK NHS to halt.

In figures 8.5m Windows machines, which is less than 1% of the total number of Windows machines worldwide were affected by the CrowdStrike Falcon update

At the time of the incident, CrowdStrike said it had more than 24,000 customers, including nearly 60% of Fortune 500 companies and more than half of the Fortune 1,000.

The outage impacted 674,620 direct customer relationships and over 49 million indirectly, according to Interos data. While the U.S. was the most affected country, with 41% of impacted entities, the disruption was also felt at major ports and air freight hubs in Europe and Asia. Ports from New York to Los Angeles and Rotterdam reported temporary shutdowns, while air freight suffered the hardest blow, with thousands of flights grounded or delayed.

This event turned into possible the biggest single event supply chain disruption, so far. With the interdependencies between so many organisations this could easily be exceeded in the future.

By 8pm on the same day, 5,078 flights had been grounded worldwide - which equates to 4.6% of all scheduled flights globally, according to an aviation analytics company. Cirium Data added that 167 flights that should have left UK airports have been cancelled so far - which is 5.4% of departures - while 171 inbound arrivals were cancelled.

The NHS was hit by the outage as "the majority of GP practices" in England and two thirds of those in Northern Ireland were impacted, while Southwestern Railway said its ticket vending machines had failed.

Customers also reported issues with supermarket payments, online banking and communications systems including Microsoft Teams.

One of the country's largest broadcasters, Sky News, went off air on Friday morning, leaving viewers faced with a message apologising for an interruption in transmission. CBBC was also taken off air following the outage, with TV viewers informed 'something's gone wrong'.

In sport, Manchester United warned that fans would experience delays getting matchday tickets after the failure affected its systems, while the Mercedes F1 team said it was working to rectify issues ahead of the Hungarian Grand Prix

On Friday, UK Government officials convened an emergency COBRA meeting amid the crisis, after major disruption was reported at several major travel hubs.

China and other countries including Russia, North Korea, Iran where not affected as they have been trying to move away from Microsoft to reduce dependency and would not install security software that sent data to a USA company.

Secondary security problems

It is the very nature of threat actors to take advantage of situations and monetise them to their advantage.

Very shortly after the discovery of the cause article appeared on LinkedIn and other forums with instructions to delete the corrupt channel file and reboot the machine. Some of these were started by CrowdStrike and others copied and wrote their own set of instructions. This led to the situation where threat actors started posting malicious fixes that allowed them to social engineer those infected into do actions that were potential more harmful.

With hours domains where being registered to mimic CrowdStrike and allude to being sites with fixes for the problem. These sites where malicious and be put up by threat actors with harmful payloads and instructions.

VirusTotal now has details of malicious tools that purporting to be being fixes but are actually a mix of malicious tools including Remote Access Tools (RATs).

With an impact as large as this incident and teams under pressure to fix, it is an ideal storm for threat actors to take advantage of and there are very agile in the approach to monetising situations.

Was this a cyber security problem

This is a point being debated by a number of security professionals, CrowdStrike say no it was not a cyber security incident as computers where not left unprotected, but their definition is very defensive and narrow. A few security professionals say that availability of machines although part of the CIA triad should not be called a security incident, again their definition is quite narrow.

It was not a cyber attack or the result of a cyber-attack but however, in my humble opinion yes it was a security incident for all those companies impacted by the issue, although the incident was accidental and not the result of malicious actions by a threat actor it impacted availability of systems. The security triad is Confidentiality, Integrity and Availability, incidents don’t need to be malicious they can be accidental, additional there could be manmade or a force of nature. Impacted organisations could not carry out their organisations mission, or deliver services to the client and had a impact on a large number of people, it needs to be considered a security issue.

The ICO consider the unavailability of Personal Identifiable Information (PII) to data subjects when they require it a breach of GDPR, this incident could have put UK and EU institutions into a breach of GDPR regulations.

As part of protecting against the impact of this incident organisations should have had cyber resilience to continue with the mission even if they or their supply chain are impacted. Incident Response plans should cater for this type of incident and enable responder to scale up deployment of remediation actions at short notice.

What was the cause

It will be a while before a definitive root cause analysis is issued by CrowdStrike, but it is a compound failure of people, processes and technology. An unfit update was rolled out in a production environment impact virtually the whole world this can only occur if failures occurred in all the pillars of security, people failed, processes failed, and technology failed.

The principles of a secure development lifecycle, something that CrowdStrike should be aware of and working to, covers the testing and sign off code as its means from development, into test and then into production. Failure to do this would be a process problem, along with potentially a human problem if they signed of with testing. There are numerous tools that would automatically test C++ code for this type of error so potential CrowdStrike may not be using the appropriate development and testing tool suites.

The questions that need to be answered are

  • How did an unfit update make it through testing through a company and was signed off for release to production to deploy to millions of machines with it being picked up.
  • How can similar incidents in the future be remediated without large scale physical deployment of resources.
  • How those impacted are compensated for a failure or IT system. People losing money as flights were cancelled as boarding systems were not available as the computers running the system or something the system depending crashed and couldn’t be restarted.

Legal background

From an insurance perspective CrowdStrike is said to be only minimally liable for the damage or lost revenue caused. The terms for CrowdStrike's Falcon software limits liability to 'fees paid', so the maximum compensation an affected company could recover were the fees that the company has paid to CrowdStrike. There is likely to be a review of how liability can be limited when the consequences of an event are so dramatic to the general public.

The consequence of this event is a key focus area of the EU Digital Operational Resilience Act (“DORA”), that will apply to financial entities from January 2025. DORA will require financial services firms to assess their own internal IT concentration risk before entering into IT contracts. In particular, DORA focusses on the need for firms to have an understanding of how subcontracting affects concentration risk: a business may have two suppliers providing a similar service, but if they both rely on the same cloud provider then there may be a hidden single point of failure.

DORA will also allow the European Supervisory Authorities (“ESAs”) to designate certain IT service providers as “critical ICT third-party service providers” and establishes an oversight regime for them. A number of factors are to be taken into consideration, including the impact that an outage in the provider’s systems would have on the financial system given the number of financial entities relying on them. The legislation makes clear that it is targeted primarily at the major cloud vendors, but the CrowdStrike incident demonstrates that concentration on a small number of software vendors at the on premises and endpoint level can be just as risky as concentration on the cloud hyperscalers.

Security updates are likely to become even more common in the coming years, in part because of various legislative efforts encouraging or requiring hardware manufacturers and software developers to provide them (such as the EU Cyber Resilience Act and the UK Product Security and Telecommunications Infrastructure regime). Businesses need to ensure that their IT change management processes are robust and efficient enough to capture and process these changes in a safe and timely manner.

In terms of the newly announced Cyber Security and Resilience Bill in this years Kings Speech in July MSP and ISPs will need to be able to respond and support clients when events like this occur. The bill will mandate proactive approach and responses

?

要查看或添加评论,请登录

Geraint Williams的更多文章

社区洞察

其他会员也浏览了