How SD-WAN improves Mean Time To Repair: WHILE outage { CASE detect(); diagnose(); resolve(); i++ }

How SD-WAN improves Mean Time To Repair: WHILE outage { CASE detect(); diagnose(); resolve(); i++ }

The important consideration when working on improving Mean Time to Repair (MTTR) is to understand the time in between. It is not about an outage occurring at a specific time and the link coming back online at another time. The above is what is meant by MTTR but to have a meaningful conversation about it more information is required. In the context of software defined wide area networks (SD-WAN), a comparison needs to be made between how a SD-WAN deployment would have improved MTTRs above that of a legacy wide area network (WAN) installation using old school routers.

Based on risk mitigations and industry norms, ISPs often contract SLA's based on these MTTRs. A poorly managed MTTR can result in heavy penalties or having to incur additional costs by correcting excessive times using more resources (either headcount or automated service tools) which might not be optimal. Another negative consequence would be customer churn.

Incident life cycle

To understand the times involved in MTTR we need to fully understand all the steps that happen from outage to repair, which in ITIL terms is often referred to as the incident life cycle. Here are the steps at a high level:

  • Outage occurs;
  • The outage is detected either by human notification or automated systems such as Network Management Systems;
  • A process of diagnosis occurs whereby resources determine the outage causation and repair process. During this step, a number of tools can potentially assist. Causation can be immediate (visual), intermediate (underlying) or root (underpinning);
  • Typically when the underlying causation is determined a repair can be initiated.
  • If appropriate a workaround might be available to temporary return the link/connectivity to service as a short term alternative while normal operations are completed at a later stage;
  • The link is ready for repair when diagnosis is complete, the repair process determined and any logistics such as delivery of spare parts/components completed;
  • The components that have caused the outage are then repaired and this includes restoring the required configuration for normal operations; and
  • The link starts operating normally again when traffic starts flowing again over the link in a manner similar to before the outage.

Programmatically this would be:
WHILE outage {
step; i++
}

The video below explains in in greater detail and it it can also be applied to security related incidents:

The SD-WAN architecture inherently improves the MTTR in a number of ways. The connectivity is controlled and managed from aggregators / concentrators located in data centres. Thus unlike a legacy distributed wide area network, any link outage is immediately detected by the aggregators / concentrators without the requirement of a remote polling system.

Configuration

The setup and configuration of a SD-WAN is simplistic at an administrative level. There are no realms of text to copy and past via telnet/ssh sessions. The diagnosis is immediately partitioned between the lower transport protocol levels versus the high connectivity protocol levels. SD-WAN makes this diagnosis immediately apparent and there is not extended finger pointing between layer 2 or 3 which so often befalls legacy wide area network deployments.

Logistics

Logistics and spare parts is common across SD-WAN and legacy wide area network deployments and is not necessarily better optimised in either scenario. However, since SD-WAN hardware is more likely to be built using white box instead of proprietary hardware there is a potential improvement in overall parts availability. Another benefit of SD-WAN is that the diagnosis and management ability of the product set is more update which will result in a greater success rate of first resolutions with rolling wheels. One of the biggest curses of current legacy WAN installations is the disproportionate number of second visits required by rolling wheels due to component mismatches. Some of these installations have been in the field for years and the new stock often does not inter-operate with what is in the field.

Automation

The restore of the link is extremely optimized and automated within SD-WAN. This is as a result of the simplistic provisioning mechanism used to initially deploy SDWAN and leveraged to restore service. It automatically connects to the aggregator / concentrator, downloads the configuration and service is restored. In a legacy environment there is a often a process required of laptops using specialised cables, remote session consoles over 3G such as Teamviewer, and the cursed cut and paste required with legacy consoles. The skill level for remote hands in SD-WAN is thus less and therefore more readily available.

SDWAN links are often deployed whereby multiple paths and mediums are utilised. Given this inherent ability, a workaround is more readily available in SD-WAN deployments than with legacy WAN installations. In my situations, SD-WAN protects the overall availability as when more than one last mile is in place, it is unlikely that they are all suffering for outages simultaneously!

At a basic and practical level SD-WAN improves MTTR. Any contributions and comments welcomed.

This article was written by Ronald Bartels who works connecting Internet inhabiting things at Fusion Broadband.

要查看或添加评论,请登录

Fusion Broadband South Africa的更多文章

社区洞察

其他会员也浏览了