登录查看更多内容

VNFs and Closed loop automation

Dan Martin

发布日期: 2017年8月29日

I, not unlike many of us, have been involved in a number of network upgrade projects and suffered through maintenance windows that started late at night and presented many challenges.

I love the idea of sidestepping a painful upgrade and integration process by embracing virtual network functions and their ability to scale in, out, up, and down on command. There’s still some things to think about and some work to do around when and how much around that scale decision. I hope to address some of them here.

What it’s supposed to do and how to tell if its doing it

The first challenge is to decide what the VNF has to do, with mobile core stuff it’s pretty straightforward, each instance has to process so many requests in some unit of time. There are 3 measures here that need to considered. The first is offered load; how many things is the VNF being asked to do at each moment in time? I expect to match resources dedicated to some function to the rate at which that function needs to be performed. The next is transactions per unit time. If the offered loads go up and the number of transactions processed per unit time rises as well, the function is working. At some point as the load goes up the function will take more time to process transactions. This leads us to the third measure, the time per transaction. If the load goes up, the number of transactions per unit time goes up and the amount of time every transaction takes starts to climb it may be time to add another resource instance, depending on how much time per transaction is acceptable. The time per transaction number is usually gotten by some synthetic tester, it performs some transaction, measures how long it takes between start and finish and reports back. If offered load goes up, the transactions per unit time stays flat, and the time per transaction spikes, it’s time to spin up a new instance.

One of the larger challenges in managing performance is setting thresholds. If I can limit my threshold setting efforts to things that directly affect business outcomes and sufficiently model the relationships between the outcome I’m looking for and the parts that make it up I won’t have to set threshold for things like CPU and memory utilization, I’ll be able to know when I need more of either to support my business needs, but more on that below.

What about resiliency?

I have to monitor each resource instance and the aggregate performance of the function. For this NFV stuff I’m not as interested in availability, I stood up enough instances in enough places to let me tolerate the failure of any instance or set of instances. There are two corner cases for resiliency, if the function is big and takes up multiple cores and/or needs to maintain high amounts of state the failure of any instance should be avoided and sufficient protections against single points of failure should be made on the lower layers to avoid losing the function. The other corner is that the less state the function must maintain and the fewer resources the function requires the less provision needs to be made for lower layer resiliency, shared risks should be avoided, but money spent on redundant power or any other HW layer resiliency is money wasted. I can’t avoid using shiny, expensive servers for monolithic apps that use a bunch of resources and maintain a bunch of state, but those apps are bad candidates for virtualization out of the gate.

Where’s the bottleneck?

Up above I said If I had an adequate model of the resources that compose the instance I wouldn’t have to bother with lower layer threshold setting. In this section I’ll tell you how that can be done. Making models by hand is hard, maintaining them is hard too. It’s not too hard to tell that an application depends on memory, storage, network, and CPU but as the application changes the dependencies will change. Something that was disk bound last week could have been optimized to run in memory so disk will become much less import but the new bottleneck could be compute or network. There are some statistical methods, granger causality being one of them, that let you establish a correlation between two sets of time series data. What this does is it lets the variables like offered load, transactions per unit time, time per transaction, CPU, memory, disk, and network, describe their own relationships. This makes it much easier for you to find the strongest correlation between an increase in time per transaction and memory utilization, for instance. This also works in the other direction, if there is no correlation between a slowdown and one of the things you’re looking at the problem is outside the system and you know there isn’t anything you can do around adding more resources to fix it. So you get two things out of this correlation effort, you know what the thing is you need more of, so perfect justification for increasing resource counts, and you know when enough is enough, when adding more resources won’t make any difference to performance.

Scaling out

When offered load is increasing, the time each transaction takes is increasing, and you have a clear indication that the bottleneck exists on the resource level, you still have to wait until you cross a threshold you’ve set that balances time per transaction against the time it takes to spin up a resource and the actual impact the increased time per transaction will have on the business. Now, instead of having to set lower layer thresholds high enough to avoid false positives while risking false negatives, you can focus on a single threshold, still arbitrary, still needing to be maintained, but much closer to the desired business outcome.

There is another issue that needs to be addressed as well, offered load may spike and adding a single resource may create no improvements in time per transaction, improvements may not be observed until some number of resources instances are added. The system will need to, after adding some number of resources without any noticeable improvement, stop adding resources and notify an operator. It’s not hard to imagine what would happen if the system kept adding more resources until it ran out of resources to add.

Scaling in

One of the benefits to the owner of the resources is the ability to scale this in. when offered load decreases, when transactions per unit time decrease, when the time per transaction holds steady or continues to decrease, the resource count can be decreased. Some care needs to be taken around preserving sessions, so it may be best to first tell the load balancer to not set up any connections to a particular instance and wait some decent interval until the number of existing connections falls below a threshold where the potential loss of transactions / services is offset by the savings in freeing up resources to do something else.

Hunting errors or continual optimization?

One of the challenges with closed loop automation is a genuine concern around flapping, that the system would continually add and remove resources, wasting capacity thereby. Instead of viewing an ongoing shrinking and growing of the resource pool as flapping it may serve us to think of that as the system trying to find the strongest correlation between added resource and improved performance. It could be a way of verifying the validity of the performance model and avoiding the problem of step functions, where there is no improvement in performance until after some number of resources are added.

Patrick Sekic

Service Management & Integration at Uniper

7 年

Thanks Dan! Your definitions are clear and practical.

Thomas Ake

Senior Account Manager chez Huawei

7 年

Very useful

查看更多评论

要查看或添加评论，请登录

Dan Martin的更多文章

Restoring the service by finding the flow path

2019年6月10日

Restoring the service by finding the flow path

Networks are built with no single points of failure. Many networks today have at least two entirely separate ways to…
SDWAN - Promise and Pitfall

2019年6月5日

SDWAN - Promise and Pitfall

I come from a network service assurance background. I was underwhelmed by the first news of SDWAN boxes.
Topology for service assurance - is it necessary?

2019年5月16日

Topology for service assurance - is it necessary?

There are two principal uses for topology in service assurance, correlation and impact. I’ll be going over both…
Open source and technical debt - NIFI for example

2019年3月7日

Open source and technical debt - NIFI for example

“We must, indeed, all hang together or, most assuredly, we shall all hang separately.” Benjamin Franklin It’s exciting…
Service assurance - lessons learned

2018年3月21日

Service assurance - lessons learned

I’ve been pitching service assurance solutions, solving service assurance problems, and inventing stuff to make service…

4 条评论
Intent – Solving some problems with ease, others with a little work, and others not at all

2018年3月14日

Intent – Solving some problems with ease, others with a little work, and others not at all

Some background The notion of intent has become stylish. Like other terms that resonate in the community there is a…
Deep learning and everyday problems - Improving interface utilization prediction

2018年2月20日

Deep learning and everyday problems - Improving interface utilization prediction

We keep hearing how machine learning is the next best thing, what we don’t hear about is a specific payoff for using…

4 条评论
Two views of events – just fine, not good enough

2017年12月14日

Two views of events – just fine, not good enough

I originally set out to talk about why just events without topology and dependency knowledge weren’t good enough. The…
I don’t need a data lake, I want a data waterfall

2017年9月21日

I don’t need a data lake, I want a data waterfall

The received wisdom around big data is that first you get all the data in one place and then you think of interesting…
Bandwidth - Going from just in case to just in time

2017年8月22日

Bandwidth - Going from just in case to just in time

The cost of circuits is going down, the cost of routers is going down, it’s easier to find somebody who knows something…

See all articles

VNFs and Closed loop automation

Dan Martin

What it’s supposed to do and how to tell if its doing it

What about resiliency?

Where’s the bottleneck?

Scaling out

Scaling in

Hunting errors or continual optimization?

Dan Martin的更多文章

社区洞察

其他会员也浏览了

Driving Global Telecom Transformation Through Automation with IBM and Telefonica

Verveba Joins the Ericsson Intelligent Automation Platform Ecosystem (EIAP): Revolutionizing RAN with Advanced Automation

Config Easy Solution Overview

5g Rollout: How Quality Engineering Helps Telecom Businesses Adapt to this New Phenomenon?

IT-OT Convergence – Challenges, Strategies & Technologies.

Title: Rapid Development of 5G Applications: Accelerating Innovation with 5G Network in a Box

Evolving Network Operations with Nokia’s Event-Driven Automation

Optimize Your NOC with DX NetOps and Automic Automation

Automation that Evolves : A Deep Dive into 'Antifragility'

Automation at Scale

What it’s supposed to do and how to tell if its doing it

What about resiliency?

Where’s the bottleneck?

Scaling out

Scaling in

Hunting errors or continual optimization?

Dan Martin的更多文章

Restoring the service by finding the flow path

SDWAN - Promise and Pitfall

Topology for service assurance - is it necessary?

Open source and technical debt - NIFI for example

Service assurance - lessons learned

Intent – Solving some problems with ease, others with a little work, and others not at all

Deep learning and everyday problems - Improving interface utilization prediction

Two views of events – just fine, not good enough

I don’t need a data lake, I want a data waterfall

Bandwidth - Going from just in case to just in time

社区洞察

其他会员也浏览了

Driving Global Telecom Transformation Through Automation with IBM and Telefonica

Verveba Joins the Ericsson Intelligent Automation Platform Ecosystem (EIAP): Revolutionizing RAN with Advanced Automation

Config Easy Solution Overview

5g Rollout: How Quality Engineering Helps Telecom Businesses Adapt to this New Phenomenon?

IT-OT Convergence – Challenges, Strategies & Technologies.

Title: Rapid Development of 5G Applications: Accelerating Innovation with 5G Network in a Box

Evolving Network Operations with Nokia’s Event-Driven Automation

Optimize Your NOC with DX NetOps and Automic Automation

Automation that Evolves : A Deep Dive into 'Antifragility'

Automation at Scale