VNFs and Closed loop automation

I, not unlike many of us, have been involved in a number of network upgrade projects and suffered through maintenance windows that started late at night and presented many challenges.

I love the idea of sidestepping a painful upgrade and integration process by embracing virtual network functions and their ability to scale in, out, up, and down on command. There’s still some things to think about and some work to do around when and how much around that scale decision. I hope to address some of them here.

What it’s supposed to do and how to tell if its doing it

The first challenge is to decide what the VNF has to do, with mobile core stuff it’s pretty straightforward, each instance has to process so many requests in some unit of time. There are 3 measures here that need to considered. The first is offered load; how many things is the VNF being asked to do at each moment in time? I expect to match resources dedicated to some function to the rate at which that function needs to be performed. The next is transactions per unit time. If the offered loads go up and the number of transactions processed per unit time rises as well, the function is working. At some point as the load goes up the function will take more time to process transactions. This leads us to the third measure, the time per transaction. If the load goes up, the number of transactions per unit time goes up and the amount of time every transaction takes starts to climb it may be time to add another resource instance, depending on how much time per transaction is acceptable. The time per transaction number is usually gotten by some synthetic tester, it performs some transaction, measures how long it takes between start and finish and reports back. If offered load goes up, the transactions per unit time stays flat, and the time per transaction spikes, it’s time to spin up a new instance.

One of the larger challenges in managing performance is setting thresholds. If I can limit my threshold setting efforts to things that directly affect business outcomes and sufficiently model the relationships between the outcome I’m looking for and the parts that make it up I won’t have to set threshold for things like CPU and memory utilization, I’ll be able to know when I need more of either to support my business needs, but more on that below.

What about resiliency?

I have to monitor each resource instance and the aggregate performance of the function. For this NFV stuff I’m not as interested in availability, I stood up enough instances in enough places to let me tolerate the failure of any instance or set of instances. There are two corner cases for resiliency, if the function is big and takes up multiple cores and/or needs to maintain high amounts of state the failure of any instance should be avoided and sufficient protections against single points of failure should be made on the lower layers to avoid losing the function. The other corner is that the less state the function must maintain and the fewer resources the function requires the less provision needs to be made for lower layer resiliency, shared risks should be avoided, but money spent on redundant power or any other HW layer resiliency is money wasted. I can’t avoid using shiny, expensive servers for monolithic apps that use a bunch of resources and maintain a bunch of state, but those apps are bad candidates for virtualization out of the gate.

Where’s the bottleneck?

Up above I said If I had an adequate model of the resources that compose the instance I wouldn’t have to bother with lower layer threshold setting. In this section I’ll tell you how that can be done. Making models by hand is hard, maintaining them is hard too. It’s not too hard to tell that an application depends on memory, storage, network, and CPU but as the application changes the dependencies will change. Something that was disk bound last week could have been optimized to run in memory so disk will become much less import but the new bottleneck could be compute or network. There are some statistical methods, granger causality being one of them, that let you establish a correlation between two sets of time series data. What this does is it lets the variables like offered load, transactions per unit time, time per transaction, CPU, memory, disk, and network, describe their own relationships. This makes it much easier for you to find the strongest correlation between an increase in time per transaction and memory utilization, for instance. This also works in the other direction, if there is no correlation between a slowdown and one of the things you’re looking at the problem is outside the system and you know there isn’t anything you can do around adding more resources to fix it. So you get two things out of this correlation effort, you know what the thing is you need more of, so perfect justification for increasing resource counts, and you know when enough is enough, when adding more resources won’t make any difference to performance.

Scaling out

When offered load is increasing, the time each transaction takes is increasing, and you have a clear indication that the bottleneck exists on the resource level, you still have to wait until you cross a threshold you’ve set that balances time per transaction against the time it takes to spin up a resource and the actual impact the increased time per transaction will have on the business. Now, instead of having to set lower layer thresholds high enough to avoid false positives while risking false negatives, you can focus on a single threshold, still arbitrary, still needing to be maintained, but much closer to the desired business outcome.

There is another issue that needs to be addressed as well, offered load may spike and adding a single resource may create no improvements in time per transaction, improvements may not be observed until some number of resources instances are added. The system will need to, after adding some number of resources without any noticeable improvement, stop adding resources and notify an operator. It’s not hard to imagine what would happen if the system kept adding more resources until it ran out of resources to add.

Scaling in

One of the benefits to the owner of the resources is the ability to scale this in. when offered load decreases, when transactions per unit time decrease, when the time per transaction holds steady or continues to decrease, the resource count can be decreased. Some care needs to be taken around preserving sessions, so it may be best to first tell the load balancer to not set up any connections to a particular instance and wait some decent interval until the number of existing connections falls below a threshold where the potential loss of transactions / services is offset by the savings in freeing up resources to do something else.

Hunting errors or continual optimization?

One of the challenges with closed loop automation is a genuine concern around flapping, that the system would continually add and remove resources, wasting capacity thereby. Instead of viewing an ongoing shrinking and growing of the resource pool as flapping it may serve us to think of that as the system trying to find the strongest correlation between added resource and improved performance. It could be a way of verifying the validity of the performance model and avoiding the problem of step functions, where there is no improvement in performance until after some number of resources are added.


Patrick Sekic

Service Management & Integration at Uniper

7 年

Thanks Dan! Your definitions are clear and practical.

回复
Thomas Ake

Senior Account Manager chez Huawei

7 年

Very useful

回复

要查看或添加评论,请登录

Dan Martin的更多文章

社区洞察

其他会员也浏览了