登录查看更多内容

Machine Learning for IT Admins

Ratnadeep Bhattacharya

Distributed Cloud Infrastructure Engineer, GDC Air-gapped Storage Everywhere

发布日期: 2016年3月29日

A few years back, I took Andrew Ng's class on machine learning from Coursera. This class got me enthused about the field of machine learning and soon I could see applications of machine learning concepts in my own field.

I decided to write this note to encourage other system engineers to look at machine learning as a tool that can help them on their day-to-day job. But please remember that I have no formal training either as a statistician or as a data analyst. So please take this with a grain of salt. However, I would be delighted if you let me know, on this post, if you find any errors.

My very first attempt at applying machine learning was to a design problem. I was involved with introducing VMware based virtualisation at a customer's site. I tried to apply machine learning to optimise the placement of the virtual machines to different hosts. Very soon I ran into several issues and had to give up this attempt. Let me list out some of the problems that I faced and could not overcome in a first project:

The problem was too intricate for someone with no previous experience with machine learning.
There were too many parameters that needed to be considered – too many explanatory variables.
I could not decide on a proper response variable – so I did not know what exactly to optimise. In other words, there were again too many parameters to optimise and I could not successfully combine them into a single variable. I also had no experience or knowledge about working with a vector response variable.
I had no previous programming experience – neither R nor Python. I was basically copy-pasting the Matlab programs from Andrew Ng's class into Octave and trying to make things work.

Anyway, after this experience the first thing I did was to start learning programming. I admit that I got a little lost en-route but it stood me in good stead. I started learning Python and then shifted to R because I felt Python was getting too complicated. Then I thought - “why not C++? It's a strong language!” and spent months learning the basics of C and C++. In hindsight, I was thankful for this because my path through C/C++ taught me a lot about the “under-the-hood” stuff of programming languages. In the meantime, I also took a class on how programs get executed on hardware/operating systems – again from Coursera. Anyhow, at the end of this circuitous route I fell back on Python for four main reasons:

Pyhton is general purpose enough
Python has very strong support from multiple communities
Python is very easy to learn
Python is very easy to prototype with

Next, I picked up the basics of the other stuff that I needed:

Numpy – a numerical library in Python
Pandas – a library to handle data in Python
Scikit-Learn - a machine learning and data munging library

I also learnt to work with SQL and currently I am committed to learning MariaDB, MongoDB, Hadoop and Apache. However, these may or may not be essential to using machine learning in pure IT administration jobs.

Once I became somewhat comfortable with my ability to program basic to intermediate machine learning models, I started looking at problems in IT administration that interested me and started thinking how I could apply these principles to help me in my job.

At this time, I was working with a bank supporting their VMware and Citrix environments. VMware is a server virtualisation solution while Citrix is targeted at desktop and application virtualisation. In the bank's infrastructure, Citrix machines were deployed as VMware virtual machines presenting several applications to the bank's employees. It was a nightmare to troubleshoot performance issues on the Citrix farm (that's what a bunch of Citrix machines are called) and we were getting several complaints about performance.

The difficulties were:

There was no performance benchmark
There was no defined SLA
It was very difficult to isolate whether the issues originated at the VMware layer or the Citrix layer.

Over and above this, it is nearly impossible to analytically determine the performance that a Citrix based application on top of VMware virtualisation should be providing.

My logic though was simple – hypervisor level performance metric should somehow be connected to the performance that the end-user was getting from the application. So I set out to gather this data. I ran esxtop (a variant of the Linux top) in batch mode on one of the servers and during that time randomly collected performance data from the outlook application (I admit this might not be the ideal data collection method). However, I did restrict the scope of the problem in the following manner:

I only collected how long it took to open the Outlook client from Citrix and within a certain time frame this value is highly related.
And I collected esxtop data from the host that only ran Outlook machines. These values are also related but I was anyway taking the information from esxtop in batch mode.

Note: For people who know about Citrix, please note that the bank was still running Citrix Presentation Server 4.5 which made this a lot easier – this was before Citrix changed its architecture.

So I had some natural explanatory variables and a simple response variable that I could work with. I normalised the explanatory variables and used a simple ridge regression method to fit the data. The results I got was encouraging and currently I am attempting to fine tune the model.

This might be easier to tackle if you convert it into a classification problem. For example, you could transform the response variable from a continuous time variable to a nominal variable with a well-defined order if you treat the response time as a value within a predefined range of values and then use one-hot encoding. So you could just try and determine if the response was within 25 seconds or between 25 to 50 seconds and so on.

For the statistically oriented with system knowledge, you may be able to tune the models better if you study the correlation matrices between the variables that you select as explanatory. For others, you could simply try out different sets of counters as explanatory variables and see what gives you the best results. This way you would weed out correlated variables by trial and error. Another option maybe to use LASSO regression instead of ridge. This would give you sparse solutions that automatically weed out correlated variables, though ridge regression would also diminish the effects. However, I have not tried LASSO yet so would not be able to comment more than this.

It is also extremely important to remember that it is not sufficient to only fit a model to the training data. Typically, I keep aside two data sets (I do a 60-20-20 partitioning) for testing and validation. Testing is required because you can really fit your model to the training data by using evermore complex models. However, at the same time if you also keep testing the model against the test data set then you will notice that after some point the cost function starts to diverge – that is the model does not do a good job on a different data set. This is the point where you know that you have to stop. And then the validation data set comes in, which is used to verify whether the model will do well in a generic setting. Finally, I also use model performance scores but I have to admit that I use them more as a backstop. For regression, I would typically use the adjusted r-squared score and for classification I would use the F1 score.

For the academically oriented, I did check out some academic papers before undertaking this project and what I understood from them is the following:

Academic papers make the data collection complicated because they are typically looking at some representative workload. However, as system engineers we are mostly looking at specific scenarios so our data collection is simpler.
Our response variable selection is also simple because again we are not looking at representative workloads.
Some papers are interested in developing general systems. We on the other hand, are not interested in developing a product but rather a solution that is of immediate use. You may be interested in developing your own product but that is not the focus here.

And I am sure there are many problems in major IT systems that gives the engineers a headache and I really believe that machine learning can make the job easier. However, do note that it involves a steep learning curve and requires an inquisitive and imaginative mind. Finally, always remember that in machine learning as in any other field the most important factor in the system is the human – because you can make data say whatever you want it to say. On an ending note, if you are interested in discussing the role and/or applications of machine learning to IT jobs further then please feel free to contact me (though also note that I am no subject matter expert :)).

Raj Batra

SRE @ Walmart Global Tech

7 年

Very Helpful insight thank-you Ratnadeep Bhattacharya

1 次回应

要查看或添加评论，请登录

Ratnadeep Bhattacharya的更多文章

Maximum Subarray Product Performance: Divide and Conquer vs Kadane's Algorithm

2021年11月16日

Maximum Subarray Product Performance: Divide and Conquer vs Kadane's Algorithm

Original leetcode article is here. I recently came across this list of problems to study from to become a better…
C/C++ vs Rust vs Go vs Python: Can you really compare them?

2021年4月20日

C/C++ vs Rust vs Go vs Python: Can you really compare them?

Let me start with a slight background about myself. I am an IT systems engineer who started writing code in 2014 and…
Safe Global State in Rust: Raw Pointers aboard!

2020年9月28日

Safe Global State in Rust: Raw Pointers aboard!

It is pretty common in almost all large projects, at least the ones I have seen, to use a global state of usually…
Container Network Interface (CNI) - A Summary

2020年9月16日

Container Network Interface (CNI) - A Summary

This is a topic that has been turning over in the back of my mind for a while. In short, sometime last year, I…
Using round robin costs you money!

2020年5月28日

Using round robin costs you money!

A lot of load balancers use the round robin (RR) policy as default. And most users leave it as such.

4 条评论
How I made a life-altering decision at 34

2018年5月16日

How I made a life-altering decision at 34

I am currently a 34-year-old married man with two kids, an IT system administrator by profession. And I am starting my…

47 条评论
Our Fear Holds us BACK!

2017年1月5日

Our Fear Holds us BACK!

It was 2002, when I passed out of high school and was looking to get started in a career. My father, a graduate of the…
Why does it make sense to be a part of the Open Source community

2016年3月10日

Why does it make sense to be a part of the Open Source community

I came across the open source movement in 2006 when I truly got introduced to the Linux operating system. As far as I…

1 条评论
System Engineer vs System Administrator

2014年8月14日

System Engineer vs System Administrator

Sometime back I had come across a discussion on LinkedIn on the distinction between System Engineers (core engineering)…

16 条评论
VXLAN by VMware - A complete compilation

2014年6月19日

VXLAN by VMware - A complete compilation

What was the problem anyway? This should be the first place to start and I will not say much here other than to provide…

See all articles

Machine Learning for IT Admins

Ratnadeep Bhattacharya

Distributed Cloud Infrastructure Engineer, GDC Air-gapped Storage Everywhere

Ratnadeep Bhattacharya的更多文章

社区洞察

其他会员也浏览了

Issue #196 - THE ML ENGINEER ??

35+ Best Data Science, Machine Learning, AI, Python, Generative AI, and LLMs&LangChain Certification Courses for 2025-Low Price

5 Best Statistics Courses for Data Science and Machine Learning in 2025

Issue #166 - THE ML ENGINEER ??

Marvelous MLOPs #42: My Story of Becoming an MLOps Engineer

New Book: Approaching (Almost) Any Machine Learning Problem

Machine Learning For Rookies

Exploring Scikit-Learn in 10 Examples

2 Days Free Technical Workshop on Artificial Intelligence

Book Review: Hands-on machine learning with Scikit-learn, Keras & TensorFlow

Ratnadeep Bhattacharya的更多文章

Maximum Subarray Product Performance: Divide and Conquer vs Kadane's Algorithm

C/C++ vs Rust vs Go vs Python: Can you really compare them?

Safe Global State in Rust: Raw Pointers aboard!

Container Network Interface (CNI) - A Summary

Using round robin costs you money!

How I made a life-altering decision at 34

Our Fear Holds us BACK!

Why does it make sense to be a part of the Open Source community

System Engineer vs System Administrator

VXLAN by VMware - A complete compilation

社区洞察

其他会员也浏览了

Issue #196 - THE ML ENGINEER ??

35+ Best Data Science, Machine Learning, AI, Python, Generative AI, and LLMs&LangChain Certification Courses for 2025-Low Price

5 Best Statistics Courses for Data Science and Machine Learning in 2025

Issue #166 - THE ML ENGINEER ??

Marvelous MLOPs #42: My Story of Becoming an MLOps Engineer

New Book: Approaching (Almost) Any Machine Learning Problem

Machine Learning For Rookies

Exploring Scikit-Learn in 10 Examples

2 Days Free Technical Workshop on Artificial Intelligence

Book Review: Hands-on machine learning with Scikit-learn, Keras & TensorFlow