Failure Is Predictable – So Why Aren’t We Better Prepared?

Failure Is Predictable – So Why Aren’t We Better Prepared?

Last Friday’s Disruption

Last Friday, Barclays Bank experienced a significant service disruption affecting its digital banking platforms. The timing couldn’t have been worse: the outage overlapped with HMRC’s self-assessment tax deadline. Although Barclays will claim it moved swiftly to communicate with customers and implement mitigations, media reports suggest otherwise and the incident raises important questions about how financial institutions handle critical operations and maintain resilience under pressure.

Understanding the Impact

The outage began around 5am on Friday, 31 January, with peak disruption at 7am and 9:20am—prime times for both retail and business customers. While cards and ATMs remained operational, many users struggled with:

  • Online and mobile banking access
  • Payment processing
  • Customer service (including “Message Us” and telephone lines)
  • Tax payment submissions on a crucial filing day

For a major bank handling billions of transactions across tens of millions of accounts, even a brief outage can cause widespread disruption.

Critical Risk Management Concerns

  • Predictable Peak Timing: With the self-assessment deadline looming, additional transaction volumes were foreseeable. Operational teams, supported by the risk function, should anticipate such spikes and institutions should adopt heighten vigilance around crucial dates.
  • Questionable Resilience: A multi-day digital outage is frankly unacceptable for a major bank. Robust ICT risk management frameworks and rapid service restoration must be the norm. Barclays’ extended downtime raises important questions about its resilience planning.
  • Inadequate Communication: Simply saying “technical issues” provides little transparency. Customers deserve clear information about the cause and likely resolution timeframe. Even now—several days later—no detailed explanation has surfaced.
  • Compensation Clarity: While Barclays eventually confirmed it would cover HMRC penalties, the initial delay left customers anxious at a critical moment. Moreover, the outage affected house purchases, business transactions, and routine banking activities. A promptly communicated compensation framework could have eased anxieties much sooner.

What's particularly concerning is that these incidents are entirely predictable - they are the definition of planning for "When, not if." While operational risk teams gather data and expert views on what could go wrong, they don't always take the critical step of turning this into actionable insight through robust quantitative methods. Without such analysis, how can management make informed decisions? Even basic Monte Carlo simulation can yield meaningful results in hours - maybe a little longer if inputs need to be confirmed. The fact that we still see multi-day outages suggests either this analysis isn't being done or isn't reaching the right decision-makers.

Quantifying Operational Risk

By coincidence, I recently modelled a hypothetical authentication service failure for a mid-sized UK bank—actually, two scenarios:

  1. A basic model estimating staff costs for handling extra call volumes
  2. A more comprehensive model factoring in customer compensation plus operational costs


Basic model estimating

These were based on basic assumed parameters (e.g., call volumes, contact rates) and tested using Monte Carlo simulations. The goal wasn’t to produce an exact figure—rather, to show quantification is an essential element of effective risk management since doing so demonstrates how quickly costs escalate when key drivers shift. These drivers include:

  • Outage duration
  • Authentication failure rates
  • Customer contact rates
  • Staff capacity to handle increased volume
  • Potential compensation requirements


More comprehensive model factoring in

Looking Forward

Although these simulations weren’t specific to Barclays, they demonstrate the value of scenario analysis in:

  • Anticipating potential impact ranges
  • Pinpointing key risk drivers
  • Planning resource requirements
  • Informing compensation strategies
  • Testing operational resilience expectations

Ultimately, scenario planning and clear communication help organisations respond faster and minimise damage during an incident. Financial institutions, regardless of size, must invest in robust infrastructure, appropriately-tested contingency plans, and effective incident response—all of which mitigate the risks and improve customer trust.

Explore The Models

Want to see these authentication failure simulations in action? You can explore and run these models yourself at the Risk Insights Explorer (riskspace.com). Click Select Scenario, then from the drop down select, "Online Banking Authentication Service Failure". The platform lets you adjust parameters, challenge assumptions, and develop your own scenarios - helping build better understanding of operational risk modelling and scenario requirements.


Note: The Monte Carlo simulations referenced are illustrative examples only and not based on any specific data, Barclays or otherwise. They serve as examples of risk quantification approaches rather than predict actual impact.

要查看或添加评论,请登录

John M.的更多文章

社区洞察

其他会员也浏览了