Improving VNF live migration through Machine Learning
Recently several ML techniques have been used to better optimize and enhance the performance of VNF live migration. While there is no single ML algorithm which is considered to help various types of VNFs and type of payload contents in a given cloud topology some consensus is building in the industry that ML can offer a great means to investigate the origin, nature and attributes of delay factors amounting to final measurable Net Migration time. Recently I studied a lot about Particle Swarm Optimization (PSO) algorithm which is gaining momentum in the industry. This algorithm is meta heuristic and is used in hyperparameter optimization through the implementation of an Artificial Neural Network (ANN)..
What Causes Delay in VNF Migration: In simple words it is nothing but a probability density function of the delay difference across all computational paths from source to destination. Since we typically in real world deal with VM cluster automation, container cluster automations and we try to come up with most efficient optimum paths to accomplish minimum delay in live migrations we would imagine that such efforts will take care of day to day migration delay headaches. But in reality it does not happen. What happens is (a) We fail to visualize proactively how the run-time analysis of these flows will create scaling issues down the road as number of migration requests increase and (b) Without a well-conceived ML algorithm one cannot come up with an accurate performance prediction model for run time analysis which must match in real time total delays IP flows incur in upper layers like layer 4-7
Consequently, we encounter that as the number of migration requests increases, the time required by the optimization model increases proportionally. The problem gets further aggravated as the number of instances and overall vm/container cluster density increases. Net result : VNF Migration is live but does not scale!
In a recent experiment I observed that ‘Overfitting’ during longer training sessions while implementing ML can be a performance limiter. Also scaling problem can be somewhat better addressed if a few dry runs are made with known set of instances with fixed server or serverless configurations and then slowly increasing the number of migration requests and instances. Future work in this field will involve the expansion of the hyperparameter search space to include additional hyperparameters that currently have not been optimized so far. The good news is that Machine Learning has been a great help and source of optimism for all of us in the cloud community. I am sure better days are yet to come when AI, ML and Automation will be used to their maximum capacity