Machine Learning basics for CSPs

Machine Learning basics for CSPs

What better way to close off first week of 2016 than to take a shot at demystifying Machine Learning (ML). 

Lately ML has been getting considerable coverage, much due to being the next buzz word to help differentiate BI and Analytics vendors, along side Big Data. Rightly or wrongly, some other related buzz words tend to be Deep Learning and Predictive Analytics.

ML is a very old and established approach to addressing tough business strategy challenges using data. It is covered in pure academics as Operations Research, and in data technologies as data mining. Most large and medium organizations  have used learning algorithms to address business challenges in areas of customer acquisition and retention. Often specialized techniques and theories, backed by several decades of research, are taught in bachelors and masters courses in marketing, finance, economics, behavioral sciences, computer sciences and so on. 

In this article I'll be talking about ML from a general technology practice and data analytics point of view, rather than as an expert statistician (which I'm not). I'll be generalizing some of the painful details so please don't feel offended but feel free to leave angry comments below. 

Basically ML is an algorithm that computes an output probability for tomorrow based on yesterday's data, and then does this computation again tomorrow based on today's and yesterday's data, and so on. The "learning" is gained due to continual feeding of current outputs into the input pool for future computation. As the algorithm learns from past data, it can improve its prediction of the future output based on its learning from the past. In very simple words, if you were to draw a straight line between points a, b and c (assuming these points coincide on the same straight line), it will be possible to predict a point d that lies further ahead on this line - simple, isn't it? What if a b and c don't lie of a line, perhaps they are on a logarithmic curve - well that's the trick, in order to predict point d, you'd need to have a good idea about the type of curve path points a b and c lie upon.

There are generally two main categories of machine learning models - supervised and unsupervised. Supervised models are where there's plenty of "yesterday's" data, and in addition there's known outcome data available. Supervised part is the known outcome data indicating which of "yesterday's" data points actually passed or failed, based on a chosen measure.  What you typically want out of supervised models is a prediction about tomorrow, based on yesterday's data and known outcomes. As the arrow of time moves on, your ML algorithm will use the predictions it made and known outcomes (i.e., if prediction was true or not) to continually improve its accuracy (the curve path between points a, b, c, and d).

Regression and classification are two types of predictive/supervised machine learning approaches (there are many more but these are probably the most suited for business problems often faced by CSPs). Many algorithms fall under regression like linear regression, support vector regression; and decision trees and naive bayes are popular classification algorithms. Each of these approaches and algorithms function differently, and are in many cases are proven to be more effective in predicting certain types of business problems and outcomes - so choosing the right model for your business problems is key. 

Unsupervised models are generally not used to predict a outcome in future like the supervised models are, instead they are used to make sense out of data by means of clustering related elements together. Clustering algorithms like k-means and hierarchical find groups of entities which show similar characteristics based on the dimensions available in the data. There isn't usually a defined set of clusters available to map to - so clustering attempts to find patterns in the data which may or may not be obvious or clearly visible. Clustering helps businesses define tailored strategies and approaches based on characteristics of each cluster, which theoretically leads of better outcome for the business with regards to each cluster. 

There is no shortage of tools available to perform ML, from free of cost to expensive enterprise licenses, on a small or a large scale, on your laptop or big data in the cloud. Based on your needs and appetite, you can dabble with ML on any scale. For example, most low barrier way to getting started with ML is to use R programming - if you are the programming kind that is. With R installed on your personal computer, you can write fairly straight forward code which gets interpreted (not compiled) by the R shell. There's countless libraries available in R programming to apply all sorts of ML algorithms, many a times with only a few lines of codes, and perform complex analytical tasks. R also has commercial options like R Studio R Server and Shiny which can help you scale your R operations and help perform advanced visualization on top of your ML analysis. Similar to R, other programming languages are also very well suited for ML, like Scala and Python. 

Traditionally, large and medium enterprises have been using ML analytics in old-school systems like SPSS (IBM), SAS Data Miner, and STATISTICA (Dell). Employees trained in these systems write complex models to evaluate data available from core operational systems (like Billing, CRM) combined with market research (like competitive footprint, products availability, demographics etc). Most common uses cases for these models have been in marketing and sales, as well as revenue forecasting. These systems are typically clunky but reliable, and as you can imagine, cost a fair amount to procure and operate. 

In the last few years disruptive DIY BI and analytics providers have been adding ML libraries to their products, like Alteryx and Gooddata. Knime is an open source desktop analytics tool (it has commercial options available) and provides a very strong ML functionality. Common RDBMS' have good amount of ML libraries available in them as well, so SQL has been a pretty staple diet for ML enthusiasts from the DB background. 

One of the main reasons why ML has picked up so much attention in the last few years is the fact that it is getting a strong focus from top level Apache and other open source projects. Big Data systems and ML are a perfect combination because ML needs large data sets in order for models to be reasonably accurate in their predictions. Apache Spark big data platform contains exhaustive machine learning library called MLlib. Commercial big data vendors like Cloudera have ML libraries (MADlib) integrated into their products like Impala. And almost all big cloud storage and compute providers (MS Azure, Google, AWS) have prediction libraries available to use in their environments. There are some niche cloud players like Databricks H2O and Prediction.io which specialize in providing ML functionality in the cloud. 

As a business owner, you really have an option to either continue performing ML projects to gain tactical advantage - and for that you can keep using open source tools or specialized investments you've already made into things like SPSS; or you can start to integrate ML into your Big Data and cloud strategy. 

As more and more businesses are starting to focus on long term customer experience and customer life time value, as well as cost reduction via consolidation and business transformation - it is becoming a no brainer to have a solid all encompassing Data Strategy which takes into account machine learning and predictive analytics, BAU business intelligence, big data storage, and cloud model. Addressing each of these things in bits and pieces will inevitably create the same technology and information silos that businesses are struggling to cope in today's competitive landscape.

ML and Predictive Analytics are giving countless businesses competitive advantage, by deriving better visibility into customer's behavior, predicting and identifying patterns, and helping leaders make informed business decisions. Barrier to entry is drawn so low that you just can't afford to not take the leap!

Happy Machine Learning!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了