登录查看更多内容

Five common analysis fails

Keith McNulty

Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect

发布日期: 2019年9月8日

Over the time I’ve spent involved in analytics professionally, I’ve seen some common howlers. I can’t necessarily blame the people who committed them, because they were never taught that they were bad things to do.

Statistics classes at college are still overly-theoretical. It’s not as bad as it used to be, now that data science has encouraged more learning based on real data and case studies. But there are still too many formulas and not enough practical advice based on the things that are likely to happen once you enter the real world. After all, isn’t that what statistics is supposed to be about?

If I were teaching a statistics class in college, I would probably call it ‘Statistics for the Real World 101’. Here are a few things I’d fail people for.

1. Averaging averages

This is one I see so often. Someone has calculated an average metric for a whole bunch of subgroups, and then wants to give the average metric for the entire population. So they just average the averages. This is almost always the wrong thing to do.

Unless the data for every subgroup is commensurable, of approximately the same cardinality and of similar representativeness — which is basically never — then averaging the averages will just artificially inflate or deflate the genuine metric across the entire population. Here’s a simple example of what happens if you try it on World Bank data on women representation in the workforce — makes it look a lot higher than it really is:

2. Ignoring range restriction

This one pops up a lot if you work in any environment where you have to analyze a process where the data points degrade over time — for example, a selection process. A common situation is where people want to know if the information from earlier in the process can predict something later in the process. Maybe, for example, you want to correlate interview ratings with subsequent job performance.

I often see people ignore the fact that the data points later in the process are a subset of the data points earlier in the process. Data points have dropped out because of the selection in between. Often they conclude that the correlation is low or zero and use that as a basis to denigrate the earlier stages of the process as being not predictive of later stages.

This can be a major problem, especially if the process is highly selective — like if only a fraction of those at the early stages made it to the later stages. Often the statistics of the data points who made it through are high and compressed, because data points with lower statistics didn’t make the cut.

There are ways of correcting a correlation for range restriction. Here is a commonly used formula:

But I want to be clear — all formulas are unreliable if the restriction is substantial. In these situations, if correlation analysis does not reveal anything notable, I simply declare that we cannot conclude anything because of range restriction issues.

3. Using linear regression on a binary outcome

I think when people do this, they either forget their statistics lessons completely or they just slept through them.

Linear regression is a pretty simple process which basically facilitates the prediction of a variable that has a continuous numerical scale. Like the price of a car. Intercepts and coefficients are determined and applied directly to the new inputs to determine a predicted value. Model fit is easy to determine using sum of squares (an extension of Pythagoras’ Theorem to calculate distance).

Trying to use this method on binary outcomes is a really bad idea. Most of the underlying assumptions in linear regression about variance and residual error are violated, and the output is not designed to predict a simple binary outcome. It’s madness, and an indicator that the person who is doing it is not particularly well-trained in statistics, or scared of logistic regression, or something!

Some people try a Linear Probability Model as a way of making linear regression methods work with binary outcome data. I am not at all convinced of this, and it doesn’t solve the fact that probabilities outside the [0, 1] range can still occur.

4. Putting all your eggs in the p-basket (or having no p-basket at all)

p-hacking is becoming a topic of greater and greater awareness in the statistics and data science community in recent years. People are becoming less and less comfortable with the idea that a cold hard significance line is the sole determinant of whether something is deemed worthy of communication as an analytical insight.

Often I see two extremes of this problem. I often see p-values ignored completely, so some pattern that has a p-value of 0.5 is raised as an insight. And I also see too much dependence on the p < 0.05 boundary.

This is where common sense goes out of the window. Instinctively, whether a pattern in data is notable or not depends on the effect it is having and whether that effect could be considered ‘unusual’ or not. This speaks to some element of judgment from the statistician:

If the data is really big, even a miniscule effect can pass the p < 0.05 condition. The important thing is that the effect is miniscule.
If the data is not so big, but the effect seems to be, then p < 0.05 should not be the sole consideration in whether or not this insight is notable.

5. Using bad language

I don’t mean swearing (although I’ve done plenty of that with some of the datasets I’ve had to deal with). I mean not writing your insights and conclusions accurately.

Language is so important in helping others understand what they can conclude from the analysis. I often see poorly crafted language which can lead people to the wrong conclusions, for example suggesting that a causative relationship exists when there is no such evidence, or not appropriately qualifying conclusions.

For example, look at this correlation matrix of the James Bond movies:

It’s so tempting to say that Bond’s drinking and killing is what increases the movie budgets, but that assumes a causality which we haven’t proved. More than likely, longer movies involve more drinking and killing and also cost more to make. But we can’t say this conclusively either. The language I would use here is simple: ‘Movie budgets correlate primarily with the amount of Martinis drunk and the number of people Bond kills.’ That’s still a pretty entertaining conclusion!

I lead McKinsey's People Analytics and Measurement function. Originally I was a Pure Mathematician, then I became a Psychometrician. I am passionate about applying the rigor of both those disciplines to complex people questions. I'm also a coding geek and a massive fan of Japanese RPGs.

All opinions expressed are my own and not to be associated with my employer or any other organization I am connected with.

Grzegorz Rajca

Consultant | I-O Psych | Workplace Strategy | Change Management

5 年

Great post. Range restriction is a particularly nasty problem, as many surveys used in HR have short response scales (1-5) and it’s quite easy to get data which is very restricted. I have a question concerning averaging averages. Let’s assume that we conducted a survey in an organization and then calculated averages for each department. Some departments consist of 20 employees, some consist of 150 employees. Let’s assume that we want to show the score for each of the departments, in comparison to other departments. We could treat each department as a single entity, normalize the average scores and then see how each of the departments deviates from the average. This seems like a good method of correcting for the size of the department, although it does require averaging averages. I’d be very happy to hear your thoughts on this.

Malcolm Earp

Group Chief Commercial & Operating Officer at The Ultimate Battery Company

5 年

A really good post very worth reading

查看更多评论

要查看或添加评论，请登录

Keith McNulty的更多文章

A Fun Introduction to the Concept of Bayesian Statistics

2024年11月25日

A Fun Introduction to the Concept of Bayesian Statistics

I recently came across this very hard looking and creatively articulated problem in an old entrance examination paper…

14 条评论
The Italian Origins of Imaginary Numbers

2024年9月23日

The Italian Origins of Imaginary Numbers

If you happened to be taking a stroll around Bologna or Milan in the mid-16th century, it’s possible you might have…

11 条评论
The Beauty of the Binomial Expansion

2024年8月28日

The Beauty of the Binomial Expansion

I’m going to take a sum of two terms a+b and I am going to square it. If you remember from your quadratic expansions…

7 条评论
My Top Tip for Tackling Tough Math Problems

2024年8月21日

My Top Tip for Tackling Tough Math Problems

I recently came across an algebra problem which doesn’t require any advanced math skills to solve, but still takes…

21 条评论
The Three Most Common Statistical Tests You Should Deeply Understand

2024年8月12日

The Three Most Common Statistical Tests You Should Deeply Understand

If, like me, you are not a fan of code formatting in LinkedIn articles, you can also read this article on Medium…

10 条评论
The Trick That Helps All Statisticians Survive

2024年8月6日

The Trick That Helps All Statisticians Survive

If, like me, you are not a fan of the code formatting in LinkedIn articles, you can view this article on Medium. I have…

8 条评论
How To Pipe Real-Time Info Into Your LLM Responses Using Tools

2024年7月31日

How To Pipe Real-Time Info Into Your LLM Responses Using Tools

If you don't want to deal with the poor code formatting in LinkedIn articles, you can also read this article via…

2 条评论
Two Fascinating Properties of the Fibonacci Sequence

2024年7月16日

Two Fascinating Properties of the Fibonacci Sequence

The Fibonacci sequence is a very well known and studied sequence of numbers which is often used in schools and in…
How To Summarize Public Opinion Using RAG AI

2024年7月15日

How To Summarize Public Opinion Using RAG AI

Having now spent almost two years being exposed to the new generation of generative models (starting with chatGPT), we…

5 条评论
The Beautiful and Useful Applications of Logarithms

2024年5月28日

The Beautiful and Useful Applications of Logarithms

Logarithms are among the most useful tools we have at our disposal in mathematics. They allows us to translate problems…

6 条评论

See all articles

Five common analysis fails

Keith McNulty

Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect

1. Averaging averages

2. Ignoring range restriction

3. Using linear regression on a binary outcome

4. Putting all your eggs in the p-basket (or having no p-basket at all)

5. Using bad language

Keith McNulty的更多文章

社区洞察

其他会员也浏览了

Understanding Descriptive Statistics Made Easy: Average, Spread, and More!

Data Analytics is all about Statistics

BASICS OF PROBABILITY AND STATISTICS :

Beyond the Average: The Diverse World of Statistical Means

Comparison of Multivariate Data Using Principal Component Analysis

Discrete Statistics vs Inferential Statistics

Homoscedasticity — From a line in a checklist to a key element in data analysis

Statistics Basic-Data Classification |Statistical Data Types-(Ultralearning_ML_2)

Outliers Processing

Concise Basic Stats - Part II: Summary Statistics & Basic Exploratory Analysis

1. Averaging averages

2. Ignoring range restriction

3. Using linear regression on a binary outcome

4. Putting all your eggs in the p-basket (or having no p-basket at all)

5. Using bad language

Keith McNulty的更多文章

A Fun Introduction to the Concept of Bayesian Statistics

The Italian Origins of Imaginary Numbers

The Beauty of the Binomial Expansion

My Top Tip for Tackling Tough Math Problems

The Three Most Common Statistical Tests You Should Deeply Understand

The Trick That Helps All Statisticians Survive

How To Pipe Real-Time Info Into Your LLM Responses Using Tools

Two Fascinating Properties of the Fibonacci Sequence

How To Summarize Public Opinion Using RAG AI

The Beautiful and Useful Applications of Logarithms

社区洞察

其他会员也浏览了

Understanding Descriptive Statistics Made Easy: Average, Spread, and More!

Data Analytics is all about Statistics

BASICS OF PROBABILITY AND STATISTICS :

Beyond the Average: The Diverse World of Statistical Means

Comparison of Multivariate Data Using Principal Component Analysis

Discrete Statistics vs Inferential Statistics

Homoscedasticity — From a line in a checklist to a key element in data analysis

Statistics Basic-Data Classification |Statistical Data Types-(Ultralearning_ML_2)

Outliers Processing

Concise Basic Stats - Part II: Summary Statistics & Basic Exploratory Analysis