登录查看更多内容

Machine learning maths we all use all the time

Tim Moloney

Sales Leadership | Data | AI

发布日期: 2020年7月19日

Does Machine learning (ML) add to business complexity; or provide solutions to it? References to AI, black-box algorithms, neural networks etc don’t necessarily help the cause of ML. However, the reality is that we already use the statistics and maths underpinning various ML methods all the time. We just may not think of it that way. Advanced statistical methods and ML simply expand on core methods that we already take for granted in decision making.

Consider the following three mathematical/statistical tools.

1. Probability

We all know intuitively that if pitched a plan to grow business by 10%, the immediate next question (aside from cost) should be ‘how confident are you about the outcome?’ And it’s clear that there is a big difference between ‘+/-2%’ and ‘could be up as much as 15% but if it doesn’t work revenue could fall 20%’.

In business we tend to spend a lot of time thinking about the distribution of outcomes, or in statistical terms, the data generating process1. We know that if we understand the shape of possible outcomes, we will make better decisions. At a minimum, thinking about the true distribution of data helps us to avoid focussing too much on a narrow set of sample statistics such as an average, which can mislead.

Another way to approach this is via Bayes’ Theorem. The intuition behind this rule (above) should be familiar, particularly where information is fragmented or harder to come by. It relies on two core ingredients: conditional probability and a prior. To use a current example, we adjust our probability of someone having COVID based on whether or not they have a dry cough. But this depends on knowing the probability that a dry cough is associated with COVID (Likelihood) and the probability of someone having COVID (Prior).

Bayes’ doesn’t deal with causality. It addresses the way we update our assessment of situations based on a prior expectation and new information. We apply this frequently in business. We seek an informed view (prior) and then update as we collect new data (or sometimes not, but that’s a different discussion).

As an aside, ML relies heavily on priors, in many different guises, across many different algorithms. It’s one of the reasons that ML is so powerful when applied in conjunction with business expertise (and often limited when applied without).

2. Algebra

The majority of traditional business information is received or assessed in matrix form. For example, think of a revenue table with various products along the columns and clients down the rows. Matrix algebra is a very common way that we use and transform information in business. More often than not through simple linear regression, such as relating volume to revenue. To explain this further a little matrix algebra is required.

First, though, a terminology recap: covariance measures the extent to which two variables move together (or apart). Variance is a measure of the spread of a variable. Together these two help us understand the interplay between information in a comparable way2.

Revenue (y) = m* Product Volumes (X) + u

Next, let’s use the example above: y is revenue per month; X is a matrix with each column containing individual product volumes per month; and u is a residual (unexplained) term (the bit we want to minimise). We want to predict revenue as accurately as possible given product volumes. To do this we need to arrive at weights (m) that minimise the residual (u), which in this case is the difference between our predicted revenue and actual revenue.

I’ve included the derivation at the end for those with interest3, but in matrix algebra terms we transform the relationship above and end up with the following...

m = (X’X)?1X’y [‘ designates the transpose of a matrix and ?1 its inverse]

This innocuous little formula provides the solution for many of our common business forecasting tasks. In our example the problem is solved – loosely speaking – by comparing the way product volumes and revenue change together (i.e. X’y or covariance) with respect to the underlying variability of product volumes (X’X or variance/covariance). Which makes sense: we want to understand how variation in revenues is associated with variation in product volumes.

Machine learning simply extends the use of matrix algebra to explore more complex relationships between information? for both discovery and prediction.

3. Chain Rule:

Finally, the chain rule (above). The more complex the business and the larger the amount of information we need to deal with, the more significant this rule. The chain rule states that if z depends on y and y in turn depends on x, then we can relate z to x. This (relatively) simple piece of maths is one of the most critical components of neural networks. Neural networks are constructed with layers, with each layer containing a number of nodes (see diagram). Neural networks are good at capturing complex interactions between information.

The chain rule is used to adjust weights across nodes and layers within the network to optimise our objective (through a process called backpropagation).

We can readily relate the chain rule to business. One way to think of nodes is as processes, systems, teams or other ways that we turn raw information into outcomes. Whether it’s product assessment, pricing, risk management process, call centre management or marketing engagement, the relative attention and effort we afford each ‘node’ ultimately dictates quality of output. As regulation, process complexity and information (data) input increases, understanding interactions across a business and attaching the right ‘weight’ becomes harder (but also more important). More advanced tools than linear regression are required.

So how does Machine Learning differ?

It doesn’t – in most cases it is simply an extension of these approaches; combined with others that we also use intuitively every day. ML is engaged in four primary ways:

1) To help automate simple processes at scale.

2) Generate scalability, consistency and efficiency with semi-complex decision making.

3) To challenge, validate and/or support more complex decision-making processes.

4) For discovery (through data), supporting innovation and transformation.

The more complex the decision space, the more that ML supports rather than supplants?. In this way, appropriately applied, machine learning and statistical methods don’t add complexity; they help mitigate it. And in most cases, ML does it using tools and approaches that we already intuitively or actively use today.

Views and opinions expressed are solely my own and do not express those of my employer.

-----------------------------------------------------------------------------------------------------------

Endnotes:

1 We also shouldn’t lose sight of the fact that distributions aren’t always normal (bell-shaped). To get a feel for alternative distributions, read some of the examples listed for Poisson and Exponential distributions.

2 By comparison, correlation is the covariance of two variables divided by the square root of their respective variances (i.e. correlation is a standardised measure of covariance).

3 More specifically, we minimise the square of u (the Residual Sum of Squares or Mean Squared Error), the intuition here being that otherwise large positive and negative errors could cancel each other out, leading to the false impression that we have an accurate model when we don’t (the square also has some nice statistical properties). Therefore, after solving for the error term, we work through as follows:

i. u’u = (y-Xm)’(y-Xm) = (y’- (Xm)’)(y-Xm) #Rearrange to solve for square of error term

ii. u’u = m’X’Xm – y(Xm)’ – y’(Xm) + y’y # expand

iii. u’u = m’X’Xm – 2(Xm)’y + y’y = 0 #set to zero. Note that y is a vector and Xm is a vector and therefore commutative (e.g. y’z=z’y).

iv. 2(X’Xm – X’y) = 0 #m is a gradient and to calculate we take first derivative with respect to m, noting that (Xm)’ = m’X’ and d(m’) = (dm)’

v. X’Xm = X’y # now start to solve for m

vi. (X’X)?1X’Xm = (X’X)?1X’y #X terms on LHS cancel out via identity matrix rule

vii. m = (X’X)?1X’y # solving for m

Note: For expediency I have left off notations associated with expected and sample values and ignored dealing with normalisation of the underlying data.

? An important extension of this (relating to conditions of non-invertibility/singularity) is Singular Value Decomposition. I only mention it here because, from a business perspective, what we really want to understand – particularly when we fold in more complex decision boundaries such as cost, conduct risk, technology and product development etc – is how we maximise demand for the minimum number of products at any point in time. In other words, as well as understanding the product relationships, we also want to understand customer relationships (in statistics, row versus column attributes, or in SVD, U versus V). More complex statistical methods can help us do this and are the basis of much of the ML work applied in data driven firms today.

? In his introductory lectures on ML, Andrew Ng provides a great overview of conditions that encourage effective ML. This link provides a good summary of the main points.

要查看或添加评论，请登录

Tim Moloney的更多文章

Extending the use of Deep Learning in Markets – three interesting papers.

2020年8月2日

Extending the use of Deep Learning in Markets – three interesting papers.

Deep Reinforcement Learning for Trading, Zhang, Zohren and Roberts (2020). The authors use a variety of momentum…

15 条评论
Clustering - a summary of some of the recent research

2020年7月23日

Clustering - a summary of some of the recent research

As Marcos Lopez de Prado highlights in his more recent book – Machine Learning for Asset Managers (2020) – clustering…

3 条评论
Can ethics be taught?

2019年8月20日

Can ethics be taught?

Of course it can. And to that end, it was good to read this recent article, covering the University of NSW's decision…

1 条评论
The real benefit of bringing coders, data scientists and quants together

2019年5月28日

The real benefit of bringing coders, data scientists and quants together

According to a recent Bloomberg article, Citi “isn’t just teaching traders how to code, it’s also making sure its…

1 条评论
The real benefit of bringing Coders, Data Scientists and Quants together.

2019年5月28日

The real benefit of bringing Coders, Data Scientists and Quants together.

According to a recent Bloomberg article, Citi “isn’t just teaching traders how to code, it’s also making sure its…
What do a board game, Jira and the Iris flower have in common?

2019年5月13日

What do a board game, Jira and the Iris flower have in common?

Answer: They have all played a part in driving today’s data transformations. It’s difficult to avoid Artificial…

2 条评论
Data 'hype' isn't going away - Part II

2018年10月14日

Data 'hype' isn't going away - Part II

Is it worth it? A quick follow up on the subject of introducing coding into a non-coding specific career. I posed two…

4 条评论
Coding 'hype' isn’t going away

2018年10月12日

Coding 'hype' isn’t going away

The Financial Times reports this week that JP Morgan is putting new analysts through coding courses (JP Morgan’s…

16 条评论

See all articles

Machine learning maths we all use all the time

Tim Moloney

Sales Leadership | Data | AI

Tim Moloney的更多文章

社区洞察

其他会员也浏览了

The importance of a test set

Graph Machine Learning: It's Everywhere!

10 Mind-Blowing Ways Math Tricks You Into Thinking AI is Smarter Than You

Evolution of Machine Learning: From Regression to Transformers Models

How to Choose the Right Machine Learning Model for Your Data

Feature Scaling in Machine Learning: A Comprehensive Guide

What Is Gradient Descent in Machine Learning?

Classification vs. Regression in Machine Learning

Understanding the Foundations: Mathematics and Logic in Machine Learning and AI

BxD Primer Series: Mean-Shift Clustering Models

Tim Moloney的更多文章

Extending the use of Deep Learning in Markets – three interesting papers.

Clustering - a summary of some of the recent research

Can ethics be taught?

The real benefit of bringing coders, data scientists and quants together

The real benefit of bringing Coders, Data Scientists and Quants together.

What do a board game, Jira and the Iris flower have in common?

Data 'hype' isn't going away - Part II

Coding 'hype' isn’t going away

社区洞察

其他会员也浏览了

The importance of a test set

Graph Machine Learning: It's Everywhere!

10 Mind-Blowing Ways Math Tricks You Into Thinking AI is Smarter Than You

Evolution of Machine Learning: From Regression to Transformers Models

How to Choose the Right Machine Learning Model for Your Data

Feature Scaling in Machine Learning: A Comprehensive Guide

What Is Gradient Descent in Machine Learning?

Classification vs. Regression in Machine Learning

Understanding the Foundations: Mathematics and Logic in Machine Learning and AI

BxD Primer Series: Mean-Shift Clustering Models