Means Lie more than Medians
There is a fine line between the numerator and the denominator.

Means Lie more than Medians

HEADLINE: The average number of testicles for the population of the Grand Duchy of Luxembourg is 1. OMG!!! Cue C.P.O! (Concern->panic->outrage.) But even if this disturbing stat didn’t seem hilarious about .0001 seconds after reading it, it surely would as soon as the average were grouped into its component parts. Unfortunately, many averages are used just because they scream good headlines and not because of their utility or even logic. I have always loved the quick mean calc until one time, in late, 2002, one bit me in the bottom. In late 2019, I had a relapse. This is my story. (I can just now, finally, write about it. Sorry...)

No alt text provided for this image

For the second time in my career of analyzing traffic, I have fallen victim to things not working as I had expected because the distribution of said traffic was not normal but rather hugely skewed. For a distribution to be “normal”, 68% of the population must fall within a single standard deviation of the mean, 95% within 2 and 99.7% within 3.

Things in nature tend to be normally distributed. Heights. Weights. Bench-press abilities. This is why given an average American male height of ~5’8”, you’ll find a number of 5’4”s, perhaps an equal number of 6 footers and even a tiny number of 4’8”s and 6’8”s. But you will absolutely NEVER find someone who is 26 feet tall nor will you ever find anyone who weighs 3,000 lbs. as those measures would represent so many standard deviations from the mean that the probability of their existence is as close to zero as makes no difference. It’s also why a bunch of super tall or super heavy people don’t make really any difference to a large enough population. If you get a whole building of people Yao Ming-size moving into a city with NYC's population, the height of the average resident won’t budge one millimeter, but a single basketballer in your book club most definitely will cause a massive skew.

In stark contrast, artificial things like wealth are not at all normally distributed and a single person of Bill Gates’ means (pun intended) moving next to you will immediately raise the average net worth of your entire zip code, if not (small) city by several trailing zeros. Will such an “average” be at all representative of YOUR or your other neighbors’ wealth? Not really. And this is why you almost never hear about average income but rather the median. In a normal distribution, the mean, median and mode are all equal. In a distribution that is skewed, by massive outliers many standard deviations from the average like in wealth, the mean shifts violently toward the outlier(s) but the median does not. This is also why the census reports “median” household income in your zip code knowing it’s immune to a small number of super-wealthy (or super-impoverished) people. The 50th percentile remains relatively firm so long as the population does.

As you may infer by now, on-line traffic is anything but normally distributed. And unlike the blissful ivory tower of academia, we cannot just throw away the outliers and chalk them up to flukes. The outliers eat our bandwidth and take away compute resources from the pool of those available so unless you succeed in blocking, you really should account for them. Besides, even though “bots” now seems like a bad word, they’re often doing you a service. Google’s indexer is a bot as is the scraper that puts previews of your content into Facebook and other social media.

Let’s say that your normal monthly traffic is reasonably normal (blue line) but on days that Google indexes you, it will look closer to the green. These days, such things pretty much handle themselves since all meaningful hosts are in front of load-balancers and content delivery networks and much else. But back in the day, think about some of the optimizations and their consequences. Do you remember having to take care of session stickiness and ensure that once a session was created for a user, that user would be sent to the same server for the duration of the session. Now imagine that user session is this Google bot. Ouch. -1 server for that duration.

No alt text provided for this image

But that was then. That particular problem has pretty much been fixed and all such bots now get handled automatically and (for the most part) are not abusive by choking your pipe.

What happened to me recently was much more sinister specifically BECAUSE it was no bot of any kind but a small group of dedicated users. I cannot get into specifics because this is an active client engagement but I can discuss a remarkably parallel case that happened in 2002-3 during my time at Liquid Generation.

2002 was eons ago in tech-time. A ~2GHz Xeon server with ~16GB of ram could cost over $20,000. Ours did. And the servers we leased at Rackspace Managed Hosting along with the bandwidth they required was a nice $30,000 expense PER MONTH. We routinely pushed 10-15 Terabytes of traffic so this was actually a deal. Pipes were just not as thick back then. Lots of folk were still dialing-up and if they had 128k DSL, they were doing well. You can see, however, why ignoring potential spikes in traffic was absolutely at our peril.

But the most interesting problem was when we started noticing our average time-on-site growing suddenly and massively. Whereas our competition (according to ComScore) was seeing 10, 20, maybe 30 second averages we started seeing minutes. 2. Then 3. Then 5. Massive pageviews. We were total rock stars! Except…we weren’t seeing this reflected in our ad-views. So despite the fact that we had all of this glorious traffic, we weren’t able to monetize it. That kind of defeats the purpose doesn’t it? We needed to understand WHY the ads weren’t being served and for that we (I) needed my raw datums (sp?) – not just some aggregatereport about them.

For reasons stated above, whenever faced with outcomes that don’t seem to gel with mean statistics, it’s good to follow data to the edge cases. 1.) let’s see if the distro is normal. So we calc the median. Always (relatively) simple in Excel except that back then, Excel was limited to 65,536 rows. One of our busy months could see views in the 500 Million range. SQL Server never had a median function (still) and PERCENTILE_DISC didn’t come around until 2012. I don’t specifically remember how I did it but am having PTSD memories of sorting arrays:

No alt text provided for this image

Anyway, the math was the relatively simple part. Where it took us was the bummer. As you may already figure, the mean and median were VASTLY different. Whereas the mean time-on-site was over 5 minutes, median was still in the 20-30 seconds like all of our other competition. It would appear that a very small number of dedicated fans were responsible for HOURS and HOURS on the site consuming huge bandwidth and watching all of our content many times over. These several thousand super-fans skewed the mean for 10 million unique visitors who mostly watched one thing and left. That is the sensitivity of means to large enough outliers. And, ad servers of the day were (and still are) frequency-capped. (all except the shady kind). So although these folk would see ads the first few views of anything, the next 50,000 were free. Uh oh.

It took almost 18 years for this issue to manifest itself again for me but this time I was ready much faster. As Ray Dalio writes: the benefit of experience and remembering your history is you can recognize the symptoms and just tell yourself: “ahh…another one of those…. This is what we need to do.” Mr. Dalio is talking about economic conditions and how events impact market segments. I, of course, am talking about how traffic flukes can impact customer success and leave people scratching their heads asking what happened. The next time you find yourself asking that very question, check if your data are normally distributed and then chase down a few outliers. As an anonymous wit once quipped: “God lives in the details.” Immediately, another anon quipped back: “So does the devil.” You should meet them in the middle. And the median. Stat.

要查看或添加评论,请登录

Simon Aloyts的更多文章

  • German Way of War by Robert Citino

    German Way of War by Robert Citino

    1: A large pocket of atmospheric nitrogen, oxygen and carbon dioxide was subjected to a rapid heating causing it to…

    1 条评论
  • In Defense of Flogging by Peter Moskos

    In Defense of Flogging by Peter Moskos

    I had meant to read In Defense of Flogging by Peter Moskos for years and it's been very recently pumped up again by no…

    1 条评论
  • Back to (not that) Basics

    Back to (not that) Basics

    Unlike so many modern CTOs, Werner Hans Peter Vogels is for realzies. His ’03 dissertation, Scalable Cluster…

  • Shaping of a World Religion by Cynthia Chung

    Shaping of a World Religion by Cynthia Chung

    Unlike my old Friendster buddy, Eddie G, who split his Decline and Fall of the Roman Empire into 6 excruciating…

    1 条评论
  • Nudge by Richard Thaler and Cass Sunstein

    Nudge by Richard Thaler and Cass Sunstein

    Anno domini MMXXV begins with a re-read. The first time I read Nudge I was living in West Hollywood, driving to the…

  • Reading 2024

    Reading 2024

    As I closed the book on 2024, I had finished off 44 of my perennial opponents for a total of 12,154 pages. I’ve done…

    1 条评论
  • Late Admissions by Glenn Loury

    Late Admissions by Glenn Loury

    Glenn Loury skipped 2 grades on his way to high school. Poor baby.

    1 条评论
  • Running on Empty by Alexander Macris

    Running on Empty by Alexander Macris

    Running on Empty by Alexander Macris is a short, concise yet surprisingly detailed synthesis of the Petrodollar world…

    5 条评论
  • Real Anthony Fauci by Robert F. Kennedy Jr.

    Real Anthony Fauci by Robert F. Kennedy Jr.

    The Real Anthony Fauci by @Robert F. Kenndy Jr.

    4 条评论
  • Life Goal: ?? Five-Minute-Mile Before Fifty

    Life Goal: ?? Five-Minute-Mile Before Fifty

    On the 25th of October and my 17,771st day of identifying as a boy, I started to identify as Jamaican. Why? Because I…

    6 条评论

社区洞察

其他会员也浏览了