Machine Learning for IT Admins

A few years back, I took Andrew Ng's class on machine learning from Coursera. This class got me enthused about the field of machine learning and soon I could see applications of machine learning concepts in my own field.

I decided to write this note to encourage other system engineers to look at machine learning as a tool that can help them on their day-to-day job. But please remember that I have no formal training either as a statistician or as a data analyst. So please take this with a grain of salt. However, I would be delighted if you let me know, on this post, if you find any errors.

My very first attempt at applying machine learning was to a design problem. I was involved with introducing VMware based virtualisation at a customer's site. I tried to apply machine learning to optimise the placement of the virtual machines to different hosts. Very soon I ran into several issues and had to give up this attempt. Let me list out some of the problems that I faced and could not overcome in a first project:

  1. The problem was too intricate for someone with no previous experience with machine learning.
  2. There were too many parameters that needed to be considered – too many explanatory variables.
  3. I could not decide on a proper response variable – so I did not know what exactly to optimise. In other words, there were again too many parameters to optimise and I could not successfully combine them into a single variable. I also had no experience or knowledge about working with a vector response variable.
  4. I had no previous programming experience – neither R nor Python. I was basically copy-pasting the Matlab programs from Andrew Ng's class into Octave and trying to make things work.

Anyway, after this experience the first thing I did was to start learning programming. I admit that I got a little lost en-route but it stood me in good stead. I started learning Python and then shifted to R because I felt Python was getting too complicated. Then I thought - “why not C++? It's a strong language!” and spent months learning the basics of C and C++. In hindsight, I was thankful for this because my path through C/C++ taught me a lot about the “under-the-hood” stuff of programming languages. In the meantime, I also took a class on how programs get executed on hardware/operating systems – again from Coursera. Anyhow, at the end of this circuitous route I fell back on Python for four main reasons:

  • Pyhton is general purpose enough
  • Python has very strong support from multiple communities
  • Python is very easy to learn
  • Python is very easy to prototype with

Next, I picked up the basics of the other stuff that I needed:

  • Numpy – a numerical library in Python
  • Pandas – a library to handle data in Python
  • Scikit-Learn - a machine learning and data munging library

I also learnt to work with SQL and currently I am committed to learning MariaDB, MongoDB, Hadoop and Apache. However, these may or may not be essential to using machine learning in pure IT administration jobs.

Once I became somewhat comfortable with my ability to program basic to intermediate machine learning models, I started looking at problems in IT administration that interested me and started thinking how I could apply these principles to help me in my job.

At this time, I was working with a bank supporting their VMware and Citrix environments. VMware is a server virtualisation solution while Citrix is targeted at desktop and application virtualisation. In the bank's infrastructure, Citrix machines were deployed as VMware virtual machines presenting several applications to the bank's employees. It was a nightmare to troubleshoot performance issues on the Citrix farm (that's what a bunch of Citrix machines are called) and we were getting several complaints about performance.

The difficulties were:

  • There was no performance benchmark
  • There was no defined SLA
  • It was very difficult to isolate whether the issues originated at the VMware layer or the Citrix layer.

Over and above this, it is nearly impossible to analytically determine the performance that a Citrix based application on top of VMware virtualisation should be providing.

My logic though was simple – hypervisor level performance metric should somehow be connected to the performance that the end-user was getting from the application. So I set out to gather this data. I ran esxtop (a variant of the Linux top) in batch mode on one of the servers and during that time randomly collected performance data from the outlook application (I admit this might not be the ideal data collection method). However, I did restrict the scope of the problem in the following manner:

  • I only collected how long it took to open the Outlook client from Citrix and within a certain time frame this value is highly related.
  • And I collected esxtop data from the host that only ran Outlook machines. These values are also related but I was anyway taking the information from esxtop in batch mode.

Note: For people who know about Citrix, please note that the bank was still running Citrix Presentation Server 4.5 which made this a lot easier – this was before Citrix changed its architecture.

So I had some natural explanatory variables and a simple response variable that I could work with. I normalised the explanatory variables and used a simple ridge regression method to fit the data. The results I got was encouraging and currently I am attempting to fine tune the model.

This might be easier to tackle if you convert it into a classification problem. For example, you could transform the response variable from a continuous time variable to a nominal variable with a well-defined order if you treat the response time as a value within a predefined range of values and then use one-hot encoding. So you could just try and determine if the response was within 25 seconds or between 25 to 50 seconds and so on.

For the statistically oriented with system knowledge, you may be able to tune the models better if you study the correlation matrices between the variables that you select as explanatory. For others, you could simply try out different sets of counters as explanatory variables and see what gives you the best results. This way you would weed out correlated variables by trial and error. Another option maybe to use LASSO regression instead of ridge. This would give you sparse solutions that automatically weed out correlated variables, though ridge regression would also diminish the effects. However, I have not tried LASSO yet so would not be able to comment more than this.

It is also extremely important to remember that it is not sufficient to only fit a model to the training data. Typically, I keep aside two data sets (I do a 60-20-20 partitioning) for testing and validation. Testing is required because you can really fit your model to the training data by using evermore complex models. However, at the same time if you also keep testing the model against the test data set then you will notice that after some point the cost function starts to diverge – that is the model does not do a good job on a different data set. This is the point where you know that you have to stop. And then the validation data set comes in, which is used to verify whether the model will do well in a generic setting. Finally, I also use model performance scores but I have to admit that I use them more as a backstop. For regression, I would typically use the adjusted r-squared score and for classification I would use the F1 score.

For the academically oriented, I did check out some academic papers before undertaking this project and what I understood from them is the following:

  • Academic papers make the data collection complicated because they are typically looking at some representative workload. However, as system engineers we are mostly looking at specific scenarios so our data collection is simpler.
  • Our response variable selection is also simple because again we are not looking at representative workloads.
  • Some papers are interested in developing general systems. We on the other hand, are not interested in developing a product but rather a solution that is of immediate use. You may be interested in developing your own product but that is not the focus here.

And I am sure there are many problems in major IT systems that gives the engineers a headache and I really believe that machine learning can make the job easier. However, do note that it involves a steep learning curve and requires an inquisitive and imaginative mind. Finally, always remember that in machine learning as in any other field the most important factor in the system is the human – because you can make data say whatever you want it to say. On an ending note, if you are interested in discussing the role and/or applications of machine learning to IT jobs further then please feel free to contact me (though also note that I am no subject matter expert :)).

Raj Batra

SRE @ Walmart Global Tech

7 年

Very Helpful insight thank-you Ratnadeep Bhattacharya

要查看或添加评论,请登录

Ratnadeep Bhattacharya的更多文章

社区洞察

其他会员也浏览了