Machine learning maths we all use all the time
Photo by ThisisEngineering RAEng on Unsplash

Machine learning maths we all use all the time

Does Machine learning (ML) add to business complexity; or provide solutions to it?  References to AI, black-box algorithms, neural networks etc don’t necessarily help the cause of ML.  However, the reality is that we already use the statistics and maths underpinning various ML methods all the time.  We just may not think of it that way.  Advanced statistical methods and ML simply expand on core methods that we already take for granted in decision making.  

Consider the following three mathematical/statistical tools.    

1.     Probability 

We all know intuitively that if pitched a plan to grow business by 10%, the immediate next question (aside from cost) should be ‘how confident are you about the outcome?’  And it’s clear that there is a big difference between ‘+/-2%’ and ‘could be up as much as 15% but if it doesn’t work revenue could fall 20%’.  

In business we tend to spend a lot of time thinking about the distribution of outcomes, or in statistical terms, the data generating process1.  We know that if we understand the shape of possible outcomes, we will make better decisions.  At a minimum, thinking about the true distribution of data helps us to avoid focussing too much on a narrow set of sample statistics such as an average, which can mislead.

Bayes' Theorem

Another way to approach this is via Bayes’ Theorem.  The intuition behind this rule (above) should be familiar, particularly where information is fragmented or harder to come by.  It relies on two core ingredients: conditional probability and a prior.  To use a current example, we adjust our probability of someone having COVID based on whether or not they have a dry cough.  But this depends on knowing the probability that a dry cough is associated with COVID (Likelihood) and the probability of someone having COVID (Prior).   

Bayes’ doesn’t deal with causality.  It addresses the way we update our assessment of situations based on a prior expectation and new information.  We apply this frequently in business.  We seek an informed view (prior) and then update as we collect new data (or sometimes not, but that’s a different discussion).  

As an aside, ML relies heavily on priors, in many different guises, across many different algorithms. It’s one of the reasons that ML is so powerful when applied in conjunction with business expertise (and often limited when applied without).  

2.     Algebra

The majority of traditional business information is received or assessed in matrix form.  For example, think of a revenue table with various products along the columns and clients down the rows.  Matrix algebra is a very common way that we use and transform information in business.  More often than not through simple linear regression, such as relating volume to revenue.  To explain this further a little matrix algebra is required.  

First, though, a terminology recap: covariance measures the extent to which two variables move together (or apart).  Variance is a measure of the spread of a variable. Together these two help us understand the interplay between information in a comparable way2.  

Revenue (y) = m* Product Volumes (X) + u 

Next, let’s use the example above: y is revenue per month; X is a matrix with each column containing individual product volumes per month; and u is a residual (unexplained) term (the bit we want to minimise).  We want to predict revenue as accurately as possible given product volumes.  To do this we need to arrive at weights (m) that minimise the residual (u), which in this case is the difference between our predicted revenue and actual revenue.  

I’ve included the derivation at the end for those with interest3, but in matrix algebra terms we transform the relationship above and end up with the following...

m = (X’X)?1X’y       [‘ designates the transpose of a matrix and ?1 its inverse]

This innocuous little formula provides the solution for many of our common business forecasting tasks.  In our example the problem is solved – loosely speaking – by comparing the way product volumes and revenue change together (i.e. X’y or covariance) with respect to the underlying variability of product volumes (X’X or variance/covariance).  Which makes sense: we want to understand how variation in revenues is associated with variation in product volumes.  

Machine learning simply extends the use of matrix algebra to explore more complex relationships between information? for both discovery and prediction.  

3.     Chain Rule:

Chain Rule

Finally, the chain rule (above).  The more complex the business and the larger the amount of information we need to deal with, the more significant this rule.  The chain rule states that if z depends on y and y in turn depends on x, then we can relate z to x.  This (relatively) simple piece of maths is one of the most critical components of neural networks.  Neural networks are constructed with layers, with each layer containing a number of nodes (see diagram).  Neural networks are good at capturing complex interactions between information.  

The chain rule is used to adjust weights across nodes and layers within the network to optimise our objective (through a process called backpropagation).  

Neural Network

We can readily relate the chain rule to business.  One way to think of nodes is as processes, systems, teams or other ways that we turn raw information into outcomes. Whether it’s product assessment, pricing, risk management process, call centre management or marketing engagement, the relative attention and effort we afford each ‘node’ ultimately dictates quality of output.  As regulation, process complexity and information (data) input increases, understanding interactions across a business and attaching the right ‘weight’ becomes harder (but also more important).  More advanced tools than linear regression are required. 

So how does Machine Learning differ?  

It doesn’t – in most cases it is simply an extension of these approaches; combined with others that we also use intuitively every day.  ML is engaged in four primary ways:

1)    To help automate simple processes at scale. 

2)    Generate scalability, consistency and efficiency with semi-complex decision making.

3)    To challenge, validate and/or support more complex decision-making processes.

4)    For discovery (through data), supporting innovation and transformation.

The more complex the decision space, the more that ML supports rather than supplants?.  In this way, appropriately applied, machine learning and statistical methods don’t add complexity; they help mitigate it.  And in most cases, ML does it using tools and approaches that we already intuitively or actively use today.  

Views and opinions expressed are solely my own and do not express those of my employer. 

-----------------------------------------------------------------------------------------------------------

Endnotes:

1  We also shouldn’t lose sight of the fact that distributions aren’t always normal (bell-shaped).  To get a feel for alternative distributions, read some of the examples listed for Poisson and Exponential distributions.   

2  By comparison, correlation is the covariance of two variables divided by the square root of their respective variances (i.e. correlation is a standardised measure of covariance). 

3  More specifically, we minimise the square of u (the Residual Sum of Squares or Mean Squared Error), the intuition here being that otherwise large positive and negative errors could cancel each other out, leading to the false impression that we have an accurate model when we don’t (the square also has some nice statistical properties).  Therefore, after solving for the error term, we work through as follows:

i.    u’u = (y-Xm)’(y-Xm) = (y’- (Xm)’)(y-Xm) #Rearrange to solve for square of error term

ii.   u’u = m’X’Xm – y(Xm)’ – y’(Xm) + y’y    # expand

iii.  u’u = m’X’Xm – 2(Xm)’y + y’y  = 0    #set to zero. Note that y is a vector and Xm is a vector and therefore commutative (e.g. y’z=z’y). 

iv.   2(X’Xm – X’y) = 0  #m is a gradient and to calculate we take first derivative with respect to m, noting that (Xm)’ = m’X’ and d(m’) = (dm)’

v.    X’Xm = X’y # now start to solve for m

vi.   (X’X)?1X’Xm = (X’X)?1X’y #X terms on LHS cancel out via identity matrix rule

vii.   m = (X’X)?1X’y  # solving for m

Note: For expediency I have left off notations associated with expected and sample values and ignored dealing with normalisation of the underlying data. 

?  An important extension of this (relating to conditions of non-invertibility/singularity) is Singular Value Decomposition.  I only mention it here because, from a business perspective, what we really want to understand – particularly when we fold in more complex decision boundaries such as cost, conduct risk, technology and product development etc – is how we maximise demand for the minimum number of products at any point in time.  In other words, as well as understanding the product relationships, we also want to understand customer relationships (in statistics, row versus column attributes, or in SVD, U versus V).  More complex statistical methods can help us do this and are the basis of much of the ML work applied in data driven firms today.  

?  In his introductory lectures on ML, Andrew Ng provides a great overview of conditions that encourage effective ML.  This link provides a good summary of the main points. 

要查看或添加评论,请登录

Tim Moloney的更多文章

社区洞察

其他会员也浏览了