登录查看更多内容

Three key fails in Machine Learning

Keith McNulty

Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect

发布日期: 2019年4月30日

I kind of had a backwards introduction to analytics. My first heavy involvement in any sort of dedicated analytics project was a Machine Learning project. I mean if you are going to do it, why not go the whole hog?

Not that I am complaining. It was an amazing learning experience. It taught me a great deal about technical approaches to advanced analytics. I learned about technology and data management. But most importantly I learned about how a myth has developed around the term Machine Learning when, in fact, there is nothing mythical about it at all. I remember having team meetings where uninitiated participants would describe what we were doing as 'the dark arts'.

The reality is that there is nothing dark, mysterious or mythical about Machine Learning. Most statistical methods employed in Machine Learning approaches have been known for decades, or even centuries in the case of Bayesian approaches. The explosion of the term Machine Learning is all about the technology and how it enables us to apply these approaches to large datasets in ways that computing resources previously rendered challenging or downright impossible.

Nevertheless the mythology persists. There are plenty out there who believe, or are led to believe, that a Machine Learning project can perform some sort of ultra-modern magic that will defy all human approaches to the same problem. This is dangerous, because it can mean that individuals or teams embark on efforts which take up a lot of resource and time based on a belief that some sort of magic will occur, and without appropriate critical thought and human judgment.

Before I go on I want to clarify: I am not criticising Machine Learning per se. There are countless use cases where it brings value and efficiency and our lives today would not be the same if it were not for breakthroughs facilitated by Machine Learning. No, my point is that we should not believe that Machine Learning works in all situations, and we should be more circumspect about how and when we invest in these techniques.

To illustrate my point, here are my top three key fails that I have witnessed in Machine Learning projects.

1. Poor objective setting

As per another recent piece I wrote, it is essential to know and clarify the purpose of a Machine Learning project. Either you are building your model to explain something or you are building it to predict something. Most of the time, a model that is better at explaining a phenomenon is not optimal at predicting it. Also, models that predict something really well often have features that are really obvious and make up a large part of the predictive power, so that doesn't give them great explanatory powers.

It's critical that the purpose of a Machine Learning effort is clarified and agreed by all parties. We are building a model primarily to explain or primarily to predict. It can't primarily do both. There should be no doubt on this.

2. Poor experimental design

Imagine you are working for a sales company and you want to build a model to explain what drives successful rep sales. One of the things you already know and have known forever is that reps make more successful sales to existing customers than to new customers.

You gather all the data you can find, run your learning and then announce at a big meeting that the top three explanatory drivers of sales are:

Whether the customer has bought before
Whether this is a customer the rep has visited before
Whether the customer rates the rep highly in feedback surveys

It's patently obvious that all of these drivers relate to a factor we already know is important, and so the effort has provided no added value, and in fact by including this data the mathematics of the model is now dominated by a factor that we already knew about. This could have been avoided if someone considered how to design the effort up front. Perhaps we could have removed this data, or restricted the sample to clients outside this group.

3. Poor practical planning

Whether you are embarking on ML for explanatory or predictive purposes, few think of the consequences of success.

If you build a model that can help diagnose the reasons for absenteeism in the workforce, or one that can predict manufacturing problems, or whatever, you need to be able to deploy it practically. This is when you find out that some of the data sources used in the model were extracted from files that require massive manual manipulation. Or you discover that some of the inputs were imputed based on missing data.

The point is that if you are developing ML in the hope that it is deployed in the future to help diagnose or predict things more efficiently, you need to be sure that the input data can flow into the prediction engine easily. I have seen so many ML efforts that use data that has no chance of being easily engineered, and this creates a whole new headache which could have been better anticipated.

While Machine Learning offers so much potential for how we can understand data, we are still nowhere near a point where successful learning is guaranteed for any dataset. In fact, without strong design and planning, and without a good instinct of the structure of the data, a Machine Learning project can end up a gigantic waste of time and effort. By checking on the objectives, experimental design and practical planning, you'll get a good sense of whether its worth it.

I lead McKinsey's internal People Analytics and Measurement function. Originally I was a Pure Mathematician, then I became a Psychometrician. I am passionate about applying the rigor of both those disciplines to complex people questions. I'm also a coding geek and a massive fan of Japanese RPGs. You can also follow me on Twitter at @dr_keithmcnulty.

Andrew Marritt

CEO of OrganizationView - AI powered Employee Voice & People Analytics

5 年

Great article Keith. As you say, there is nothing magical about ML, though sometimes I'm amazed about its ability - either I look at a prediction / classification in awe at the sophistication or frustration given that it might have been obvious to a 4 year old. Often the same emotions with the same model in different instances. I do believe that it's important that the analyst has a decent knowledge of what the chosen algorithm is doing under the cover. Certainly we're having conversations on a regular basis on how to adapt what we're doing coming from it from a discussion of what is going on. This is a bit like a racing driver knowing enough about the mechanics of their vehicle that they can optimise its performance through their driving style. The second part is developing a love for your data. This goes much further than just looking at model performance but also trying to understand why the model is performing by questioning the results. Are there any artefacts in the training data (maybe caused by the process of collecting the data) that the model is learning from? With our text work we've seen instances where punctuation, which we typically don't strip because it's useful for grammatical understanding in one part of the process, suddenly starts to produce misclassification errors in another.

Junaid Sahibzada

Associate Director CIO & Cloud Advisory, KPMG

5 年

Very helpful and insightful. Echos with what an instructor from Amazon recently mentioned during a training. He explained very eloquently when to use ML and when not to use ML. He explained that if we know that our input is 2+2 and we also know that the output for such an input is 4 then we can solve this problem in two ways i) The Machine learning way: Train and feed a ML model to calculate 2+2 in all the unlimited permutations and combinations in which 2+2 can appear. This will require gathering enough data in the entire universe to be able to help the model in identifying all the various possibilities of 2+2. OR ii) simply write a one line code to calculate 2+2. In short, if the input and output is known and is a finite set then using ML is a waste. In contrast, if the input is not known, e.g. human speech, free text etc. then it makes sense to use ML. Thanks for demystifying and de-hyping concepts around AI/ML.

1 次回应

查看更多评论

要查看或添加评论，请登录

Keith McNulty的更多文章

The Italian Origins of Imaginary Numbers

2024年9月23日

The Italian Origins of Imaginary Numbers

If you happened to be taking a stroll around Bologna or Milan in the mid-16th century, it’s possible you might have…

11 条评论
The Beauty of the Binomial Expansion

2024年8月28日

The Beauty of the Binomial Expansion

I’m going to take a sum of two terms a+b and I am going to square it. If you remember from your quadratic expansions…

7 条评论
My Top Tip for Tackling Tough Math Problems

2024年8月21日

My Top Tip for Tackling Tough Math Problems

I recently came across an algebra problem which doesn’t require any advanced math skills to solve, but still takes…

21 条评论
The Three Most Common Statistical Tests You Should Deeply Understand

2024年8月12日

The Three Most Common Statistical Tests You Should Deeply Understand

If, like me, you are not a fan of code formatting in LinkedIn articles, you can also read this article on Medium…

10 条评论
The Trick That Helps All Statisticians Survive

2024年8月6日

The Trick That Helps All Statisticians Survive

If, like me, you are not a fan of the code formatting in LinkedIn articles, you can view this article on Medium. I have…

8 条评论
How To Pipe Real-Time Info Into Your LLM Responses Using Tools

2024年7月31日

How To Pipe Real-Time Info Into Your LLM Responses Using Tools

If you don't want to deal with the poor code formatting in LinkedIn articles, you can also read this article via…

2 条评论
Two Fascinating Properties of the Fibonacci Sequence

2024年7月16日

Two Fascinating Properties of the Fibonacci Sequence

The Fibonacci sequence is a very well known and studied sequence of numbers which is often used in schools and in…
How To Summarize Public Opinion Using RAG AI

2024年7月15日

How To Summarize Public Opinion Using RAG AI

Having now spent almost two years being exposed to the new generation of generative models (starting with chatGPT), we…

5 条评论
The Beautiful and Useful Applications of Logarithms

2024年5月28日

The Beautiful and Useful Applications of Logarithms

Logarithms are among the most useful tools we have at our disposal in mathematics. They allows us to translate problems…

6 条评论
A Primer on Statistical Power and Power Analysis

2024年5月7日

A Primer on Statistical Power and Power Analysis

If your experience is anything like mine, you’ve probably heard numerous people talk about ‘statistical power’ in…

1 条评论

See all articles

Three key fails in Machine Learning

Keith McNulty

Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect

1. Poor objective setting

2. Poor experimental design

3. Poor practical planning

Keith McNulty的更多文章

社区洞察

其他会员也浏览了

IID in machine learning

Data Scientist’s Dilemma: The Cold Start Problem – Ten Machine Learning Examples

[Newsletter] Three Mistakes to Avoid with Machine Learning Forecasting

Demystifying Machine Learning Challenges – Imbalanced Data

Knowledge graphs for Machine Learning are so cool !

Machine learning as a competitive advantage

How Can You Build High-Performing Machine Learning Models with XGBoost?

Role of Feature Engineering in Machine Learning

Data Requirements and Model Selection in Machine Learning

Class 15 - INTRO TO SCIKIT LEARN AND CLASSIFICATION Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

1. Poor objective setting

2. Poor experimental design

3. Poor practical planning

Keith McNulty的更多文章

The Italian Origins of Imaginary Numbers

The Beauty of the Binomial Expansion

My Top Tip for Tackling Tough Math Problems

The Three Most Common Statistical Tests You Should Deeply Understand

The Trick That Helps All Statisticians Survive

How To Pipe Real-Time Info Into Your LLM Responses Using Tools

Two Fascinating Properties of the Fibonacci Sequence

How To Summarize Public Opinion Using RAG AI

The Beautiful and Useful Applications of Logarithms

A Primer on Statistical Power and Power Analysis

社区洞察

其他会员也浏览了

IID in machine learning

Data Scientist’s Dilemma: The Cold Start Problem – Ten Machine Learning Examples

[Newsletter] Three Mistakes to Avoid with Machine Learning Forecasting

Demystifying Machine Learning Challenges – Imbalanced Data

Knowledge graphs for Machine Learning are so cool !

Machine learning as a competitive advantage

How Can You Build High-Performing Machine Learning Models with XGBoost?

Role of Feature Engineering in Machine Learning

Data Requirements and Model Selection in Machine Learning

Class 15 - INTRO TO SCIKIT LEARN AND CLASSIFICATION Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)