Lies. Lies. Damned lies and statistics

Lies. Lies. Damned lies and statistics

When numbers tell you the opposite of the truth (which happens more than you might think)...

Here is a quick illustration of 3 all too common mistakes in data analysis courtesy of my weekend ??♂? problem

…here are 3 classic mistakes to make sure you avoid in one simple illustration

So like all good weekend warriors I spent most of yesterday crawling through my Strava ?? from this Sunday’s Napa Marathon (that and shuffling like an old man and dropping to the floor with unexpected bouts of comedically painful cramp)

Taking a look at the overall field stats gives really important lesson for all of us who use data professionally

The Napa Marathon hosted 3 events: a full marathon (26.2 miles) a ? marathon (13.1m) and a 5k (3.1m). In total across the 3 events there were 4179 finishers.

Let’s take a look at the average field pace for each race:

  • Marathon: 9:45 / mile
  • ? Marathon: 10:18 / mile
  • 5k: 12:26 / mile

At first glance that seems counter-intuitive. The longer the race, the faster the runners went.??

Now, if like me you’ve run at all you can probably work this one out. Marathon’s tend to attract a highly fit field; once you enter one you train extensively. By contrast a 5k at an event like this is a casual fun-run.

If I’d concluded that longer distances make you run faster I would of course have been wrong. If I’d gone further and extrapolated that over 50 miles the average runner would travel at 6:46 / mile I’d be foolish (even though that’s what a linear regression of the results would tell me…with an 85% r2 to boot.

So what is going on?

Selection Bias: We have a classic case here. The runners in the marathon are going faster…but not because of the extra distance. The event has attracted a different group of participants. If I try and generalize from these results I get into trouble

Segment Mix: That gets compounded as I have ignored mix effects. Men comprised 58% of the marathon but only 35% for the 5k - and (generally!) men run faster than women. My mix effect is compounding the selection bias here

Over-Fitting: Finally I have over-fitted a correlation from 3 data points; while it is easy to make the stats look compelling…it’s a quick route to some nonsense real world conclusions

So what? What has all this got to do with selling wine???

Whilst the example above might sound silly this is exactly the type of issue with data I’ve seen time and again in my career.

It is especially a risk in product development and design of user funnels and journeys.

To take a real example from Naked Wines we nearly made a major design error due to team members looking at behavior from customers who had opted-in to membership after buying something other than our standard “introductory offers”

A great real world example of selection bias. Beware generalizing from 5-10% of your consumers who avoid the standard “happy path”. Luckily some sharp guys in our data team spotted the issue in time, but we could easily have invested material time and energy in error

How can you arm yourself to avoid these mistakes?

Here are 3 of my top tips to avoid these types of mistakes in your business.

  1. Be Hypothesis Led. Start your experiments with a clear view of what you expect to happen. Challenge unexpected results and beware developing new “explanations” for the results after the facts.?
  2. Beware of “the average”. Break your data down into segments. How do they each behave? Beware of the average customer -> they seldom exist!!
  3. Look for the “Outliers”. Who are the equivalent of the dedicated marathoners in your database? They likely behave differently to most of your consumers (“anyone fancy a 10 mile weekday evening easy run?”. Support quant analysis with qual insights and session tracking to understand different user paths within your data


Ps - Obviously what you really want to know. How did the marathon go? Somehow I managed to stay dry and ran a 3hr 34, for a 15 minute PR -> I think it was all down to the cheer squad!??

Pierre Hyde

Founder at Northwest Strategy Associates: Consulting | Strategy | Business Planning | Transformation | Insight | Pricing | Proposition | Customer | Commercial | Marketing

12 个月

Good post Nick, it's definitely not just wine! As customer data proliferates all across retail / D2C there is enormous temptation for data-semi-literate managers to assert that "the answer is in the data" and equip huge analytics teams with resources and power to execute substantial strategic changes on this basis. Often the decisions are wrong and it's commonly because they fall short on 2 or even 3 of the simple but powerful rules you have posted. And it's happening again and again, in bigger and better equipped companies than everybody thinks...

回复
James Taylor

Co-Founder at Beer52.com & Wine52.com | B-Corp

12 个月

Congrats Nick. Yes, bias is such a huge factor. In smaller teams it's very hard to wind back things once they've been implemented and don't work, especially with confirmation bias in the mix from those who designed it. Why it's important to test fast and cheap first, then build things after the data has been properly analysed. We've all been guilty of it.

回复

Impressive marathon achievement, Nick! Your insights on data pitfalls are a great reminder of the importance of critical analysis in DtC strategies.

回复

要查看或添加评论,请登录

Nick Devlin的更多文章

社区洞察

其他会员也浏览了