登录查看更多内容

The Trouble (My Troubles) with Statistics

Tim Wilson

Experienced Analytics Leader, Thinker, and Doer | Co-founder and Head of Solutions at facts & feelings | Refuses to take self too seriously (see 99+ endorsements and multiple recommendations for "Butter" as a skill).

发布日期: 2017年7月3日

Okay. I admit it. That's a linkbait-y title. In my defense, though, the only audience that would be successfully baited by it, I think, are digital analysts, statisticians, and data scientists. And, that's who I'm targeting, albeit for different reasons:

Digital analysts -- if you're reading this then, hopefully, it may help you get over an initial hump on the topic that I've been struggling mightily to clear myself.
Statisticians and data scientists -- if you're reading this, then, hopefully, it will help you understand why you often run into blank stares when trying to explain a t-test to a digital analyst.

If you are comfortably bridging both worlds, then you are a rare bird, and I beg you to weigh in in the comments as to whether what I describe rings true.

The Premise

I took a college-level class in statistics in 2001 and another one in 2010. Neither class was particularly difficult. They both covered similar ground. And, yet, I wasn't able to apply a lick of content from either one to my work as a web/digital analyst.

Since early last year, as I've been learning R, I've also been trying to "become more data science-y," and that's involved taking another run at the world of statistics. That. Has. Been. HARD!

From many, many discussions with others in the field -- on both the digital analytics side of things and the more data science and statistics side of things -- I think I've started to identify why and where it's easy to get tripped up. This post is an enumeration of those items!

(As an aside, my eldest child, when applying for college, was told that the fact that he "didn't take any math" his junior year in high school might raise a small red flag in the admissions department of the engineering school he'd applied to. He'd taken statistics that year (because the differential equations class he'd intended to take had fallen through). THAT was the first time I learned that, in most circles, statistics is not considered "math." See how little I knew?!)

Terminology: Dimensions and Metrics? Meet Variables!

Historically, web analysts have lived in a world of dimensions. We combine multiple dimensions (channel + device type, for instance) and then put one or more metrics against those dimensions (visits, page views, orders, revenue, etc.)

Statistical methods, on the other hand, work with "variables." What is a variable? I'm not being facetious. It turns out it can be a bit a mind-bender if you come at it from a web analytics perspective:

Is device type a variable?
Or, is the number of visits by device type a variable?
OR, is the number of visits from mobile devices a variable?

The answer... is "Yes." Depending on what question you are asking and what statistical method is being applied, defining what your variable(s) are, well, varies. Statisticians think of variables as having different types of scales: nominal, ordinal, interval, or ratio. And, in a related way, they think of data as being either "metric data" or "nonmetric data." There's a good write-up on the different types -- with a digital analytics slant -- in this post on dartistics.com.

It may seem like semantic navel-gazing, but it really isn't: different statistical methods work with specific types of variables, so data has to be transformed appropriately before statistical operations are performed. Some day, I'll write that magical post that provides a perfect link between these two fundamentally different lenses through which we think about our data... but today is not that day.

Atomic Data vs. Aggregated Counts

In R, when using ggplot to create a bar chart that uses underlying data that looks similar to how data would look in Excel, I have to include a parameter that is stat="identity". As it turns out, that is a symptom of the next mental jump required to move from the world of digital analytics to the world of statistics.

To illustrate, let's think about how we view traffic by channel:

In web analytics, we think: "this is how many (a count) visitors to the site came from each of referring sites, paid search, organic search, etc."
In statistics, typically, the framing would be: "here is a list (row) for each visitor to the site, and each visitor is identified as being visiting from referring sites, paid search, organic search, etc." (or, possibly, "each visitor is flagged as being yes/no for each of: referring sites, paid search, organic search, etc."... but that's back to the discussion of "variables" covered above).

So, in my bar chart example above, R defaults to thinking that it's making a bar chart out of a sea of data, where it's aggregating a bunch of atomic observations into a summarized set of bars. The stat="identity" argument has to be included to tell R, "No, no. Not this time. I've already counted up the totals for you. I'm telling you the height of each bar with the data I'm sending you!"

When researching statistical methods, this comes up time and time again: statistical techniques often expect a data set to be a collection of atomic observations. Web analysts typically work with aggregated counts. Two things to call out on this front:

There are statistical methods (a cross tabulation with a Chi square test for independence is one good example) that work with aggregated counts. I realize that. But, there are many more that actually expect greater fidelity in the data.
Both Adobe Analytics (via data feeds, and, to a clunkier extent, Data Warehouse) and Google Analytics (via the GA360 integration with Google BigQuery) offer much more atomic level data than the data they provided historically through their primary interfaces; this is one reason data scientists are starting to dig into digital analytics data more!

The big, "Aha!" for me in this area is that we often want to introduce pseudo-granularity into our data. For instance, if we look at orders by channel for the last quarter, we may have 8-10 rows of data. But, if we pull orders by day for the last quarter, we have a much larger set of data. And, by introducing granularity, we can start looking at the variability of orders within each channel. That is useful! When performing a 1-way ANOVA, for instance, we need to compare the variability within channels to the variability across channels to draw conclusions about where the "real" differences are.

This actually starts to get a bit messy. We can't just add dimensions to our data willy-nilly to artificially introduce granularity. That can be dangerous! But, in the absence of truly atomic data, some degree of added dimensionality is required to apply some types of statistical methods. <sigh>

Samples vs. Populations

The first definition for "statistics" I get from Google (emphasis added) is:

"the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample."

Web analysts often work with "the whole." Unless we consider historical data the sample and the "whole" including future web traffic. But, if we view the world that way -- by using time to determine our "sample" -- then we're not exactly getting a random (independent) sample!

We've also been conditioned to believe that sampling is bad! For years, Adobe/Omniture was able to beat up on Google Analytics because of GA's "sampled data" conditions. And, Google has made any number of changes and product offerings (GA Premium -> GA 360) to allow their customers to avoid sampling. So, Google, too, has conditioned us to treat the word "sampled" as having a negative connotation.

To be clear: GA's sampling is an issue. But, it turns out that working with "the entire population" with statistics can be an issue, too. If you've ever heard of the dangers of "overfitting the model," or if you've heard, "if you have enough traffic, you'll always find statistical significance," then you're at least vaguely aware of this!

So, on the one hand, we tend to drool over how much data we have (thank you, digital!). But, as web analysts, we're conditioned to think "always use all the data!" Statisticians, when presented with a sufficiently large data set, like to pull a sample of that data, build a model, and then test the model with another sample of the data. As far as I know, neither Adobe nor Google have an, "Export a sample of the data" option available natively. And, frankly, I have yet to come across a data scientist working with digital analytics data who is doing this, either. But, several people have acknowledged this is something that should be done in some cases.

I think this is going to have to get addressed at some point. Maybe it already has been, and I just haven't crossed paths with the folks who have done it!

Decision Under Uncertainty

I've saved the messiest (I think) for last. Everything on my list to this point has been, to some extent, mechanical. We should be able to just "figure it out" -- make a few cheat sheets, draw a few diagrams, reach a conclusion, and be done with it.

But, this one... is different. This is an issue of fundamental understanding -- a fundamental perspective on both data and the role of the analyst.

Several statistically-savvy analysts I have chatted with have said something along the lines of, "You know, really, to 'get' statistics, you have to start with probability theory." One published illustration of this stance can be found in The Cartoon Guide to Statistics, which devotes an early chapter to the subject. It actually goes all the way back to the 1600s and an exchange between Blaise Pascal and Pierre de Fermat and proceeds to walk through a dice-throwing example of probability theory. Alas! This is where the book lost me (although I still have it and may give it another go).

Possibly related -- although quite different -- is something that Matt Gershoff of Conductrics and I have chatted about on multiple occasions across multiple continents. Matt posits that, really, one of the biggest challenges he sees traditional digital analysts facing when they try to dive into a more statistically-oriented mindset is understanding the scope (and limits!) of their role. As he put it to me once in a series of direct messages really boils down to:

It's about decision-making under uncertainty
It's about assessing how much uncertainty is reduced with additional data
It must consider, "What is the value in that reduction of uncertainty?"
And it must consider, "Is that value greater than the cost of the data/time/opportunity costs?"

The list looks pretty simple, but I think there is a deeper mindset/mentality-shift that it points to. And, it gets to a related challenge: even if the digital analyst views her role through this lens, do her stakeholders think this way? Methinks...almost certainly not! So, it opens up a whole new world of communication/education/relationship-management between the analyst and stakeholders!

For this area, I'll just leave it at, "There are some deeper fundamentals that are either critical to understand or something that can be kicked down the road a bit." I don't know which it is!

What Do You Think?

It's taken me over a year to slowly recognize that this list exists. Hopefully, whether you're a digital analyst dipping your toe more deeply into statistics or a data scientist who is wondering why you garner blank stares from your digital analytics colleagues, there is a point or two in this post that made you think, "Ohhhhh! Yeah. THAT's where the confusion is."

If you've been trying to bridge this divide in some way yourself, I'd love to hear what of this post resonates, what doesn't, and, perhaps, what's missing!

Steve Jackson

CDO and founder | Building a travel experience platform

7 年

I think that narrowing down why a digital analyst should bother to use statistics is an important point. When we do analytics it's usually to try and find out something. Statistics will either help with that or won't. We've used statistics to determine optimal ranges of traffic (standard deviations) for different sources to understand why and most importantly when a real trend is occurring. The point being when we see a positive or negative trend we could do something about it. We've also applied statistical analysis to paid media spend over a year to determine optimal limits across ppc and cpm because (shock/horror!) vendors like google may be a little self serving when it comes to how their algorithms spend your media budgets. We've used statistics to predict outcomes and probabilities around conversion. If we spend x what is the probability of y using tools like Monte Carlo experiments. I think learning statistics is a deep rabbit hole as your article brilliantly outlines, but once you start identifying what we can use statistics for in practical ways the depth of learning required for each discipline becomes less gargantuan in nature.

3 次回应

Robert Petkovi?

?? Analytics consultant ?? Translating charts into sentences

7 年

Actually, I think the fact I learned statistics prior to web analytic helps me now understand big data, machine learning, R (which stands for "correlation", a method I created a calculation program on my C-64 as a freshman) and similar concepts much quicker than technicians, or at least find the proper usage for their outputs fast and easy. In fact a Psychometry professor of mine (who's now a big shot in the U.S., you should call him to your show) told me recently that I've found a Holy Grail of psychology: I have thousands of subject who are leaving loads of data to me every day and they are willing to participate in my researches :) The point is, you SHOULD invest your time into learn about some more complex statistical methods than Chi Square or t-test (MANOVA is a good start) but not necessarily through learning R (because then you're just one of developers) but through learning the concepts of those methods and why/when to use them. With your background you might be a stunning analyst/"insightder" upon. I am looking forward our future statistical discussions in Hungary ;)

2 次回应

Robert Petkovi?

?? Analytics consultant ?? Translating charts into sentences

7 年

Huh, you didn't figure out the p-value yet, didn't you? :P Struggling in the world of web analytics for almost two decades with developers and lately with economists, marketeers and "data scientists" (whatever that means) I am very glad that I have studied psychology before I started developing websites. Why? Because I studied a bit difficult "Statistics for social sciences" followed by "Psychometric" where we learned a lot of statistical methods and when to use them. Most of all we learned these are not just figures and charts but actual human behaviour represented by numbers and if we choose the wrong calculation/method, we might screw up somebody's future, job or send him to hospital by mistake. When I entered the world of web analytics it was odd for me that no one wanted to know whether some assumptions are statistically significant (hence the p-value), they just wanted to know if there's a trend that can earn them a few extra bucks. In terms of Google Analytics that's like using the whole product and your knowledge for counting users and engagement. Still, I managed to find my way through those dimensions and metrics and I think I'm pretty good at it now. Actually, I think... (to be continued)

2 次回应

Gene McKenna

VP Product Marketing and Growth

7 年

great article, Tim. I think you have a book to write here. I'll buy a copy when it's ready.

1 次回应

Yogita W.

7 年

The timing of your post couldn't have been better. Much like you, I learnt some statistics while in school in early 1990s, then a subject in late 1990s followed by a week long quant course in 2010. These days, I am learning statistics again as have worked with data scientists in the past and now want to be more data science-y myself. It's early days and it is hard, especially trying to think of use cases how I could apply a certain concept in the digital analytics world. The hard work of confidence level etc seems to be done by the optimization tools leaving digital analysts relieved which is not entirely good. Any books you can recommend? It's a pity I am not in the same geo as you, else would have taken your R class. Meanwhile, next on my list is your site on R and Stats - still a few days away though as am studying a couple of other topics.

1 次回应

查看更多评论

要查看或添加评论，请登录

Tim Wilson的更多文章

Hello, world. Thus begins the most exciting leap of my career.

2024年3月19日

Hello, world. Thus begins the most exciting leap of my career.

Over the last year, a lot of people have asked me some form of the question, “What are you going to do next job-wise?”…

29 条评论
A Claim: All Data Work Falls into One of Two Categories

2023年2月28日

A Claim: All Data Work Falls into One of Two Categories

This post was originally published on gilliganondata.com.

13 条评论
Analytics, Statistics, Inference, and the Evidence Ladder

2022年3月22日

Analytics, Statistics, Inference, and the Evidence Ladder

This post was originally published on gilliganondata.com.

2 条评论
50 Things I Believe About Analytics

2020年5月1日

50 Things I Believe About Analytics

In no particular order: Spreadsheets should never be bottom-aligned. Weekly reports should have a maximum latency of…

44 条评论
Please Stop Touting the "Citizen Data Scientist"

2020年3月8日

Please Stop Touting the "Citizen Data Scientist"

Instead, Let's Tout Opportunities to Learn Concepts and Principles of Data Science Update: Colin Temple used this post…

29 条评论
Assessing Organizations by the Analytical Maturity of the Key Players

2019年12月18日

Assessing Organizations by the Analytical Maturity of the Key Players

LinkedIn is a good place to record idle industry thoughts, isn't it? An idle thought that has been bouncing around in…

13 条评论
Multi-Touch Attribution: Are GDPR, CCPA, and ITP Making the Sky Fall, or Was It Never That High in the First Place?

2019年10月1日

Multi-Touch Attribution: Are GDPR, CCPA, and ITP Making the Sky Fall, or Was It Never That High in the First Place?

Multi-touch attribution has always been overhyped as something that will drastically affect how organizations think…

17 条评论
Data Science for Business: A Slack-Based Book Group

2018年7月16日

Data Science for Business: A Slack-Based Book Group

Through a series of discussions that occurred at Marketing Evolution Experience in Las Vegas, over a few beers at a bar…

13 条评论
Oct. 25th: 1-Day Workshop in San Francisco – Intro to R for the Digital Analyst

2017年10月4日

Oct. 25th: 1-Day Workshop in San Francisco – Intro to R for the Digital Analyst

I'll be conducting a small (up to 8 students) hands-on workshop that is an introduction to R for the digital analyst in…

2 条评论
Did That KPI Move Enough for Me to Care?

2017年9月29日

Did That KPI Move Enough for Me to Care?

This post really..

5 条评论

See all articles

The Trouble (My Troubles) with Statistics

Tim Wilson

Experienced Analytics Leader, Thinker, and Doer | Co-founder and Head of Solutions at facts & feelings | Refuses to take self too seriously (see 99+ endorsements and multiple recommendations for "Butter" as a skill).

The Premise

Terminology: Dimensions and Metrics? Meet Variables!

Atomic Data vs. Aggregated Counts

Samples vs. Populations

Decision Under Uncertainty

What Do You Think?

Tim Wilson的更多文章

社区洞察

其他会员也浏览了

F-distribution and its Application in Hypothesis Testing

My evolving feelings about data

The Powers of “Normal Distribution”

Check Regional Information via Coarsen

Real Statistics: Starting 18th Sep 2022

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

4th Story – Lies, Damned Lies and Statistics

Delivering The Right Level Of Analytical Detail

A Statistical Wisdom

Homoscedasticity — From a line in a checklist to a key element in data analysis

The Premise

Terminology: Dimensions and Metrics? Meet Variables!

Atomic Data vs. Aggregated Counts

Samples vs. Populations

Decision Under Uncertainty

What Do You Think?

Tim Wilson的更多文章

Hello, world. Thus begins the most exciting leap of my career.

A Claim: All Data Work Falls into One of Two Categories

Analytics, Statistics, Inference, and the Evidence Ladder

50 Things I Believe About Analytics

Please Stop Touting the "Citizen Data Scientist"

Assessing Organizations by the Analytical Maturity of the Key Players

Multi-Touch Attribution: Are GDPR, CCPA, and ITP Making the Sky Fall, or Was It Never That High in the First Place?

Data Science for Business: A Slack-Based Book Group

Oct. 25th: 1-Day Workshop in San Francisco – Intro to R for the Digital Analyst

Did That KPI Move Enough for Me to Care?

社区洞察

其他会员也浏览了

F-distribution and its Application in Hypothesis Testing

My evolving feelings about data

The Powers of “Normal Distribution”

Check Regional Information via Coarsen

Real Statistics: Starting 18th Sep 2022

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

4th Story – Lies, Damned Lies and Statistics

Delivering The Right Level Of Analytical Detail

A Statistical Wisdom

Homoscedasticity — From a line in a checklist to a key element in data analysis