How Big Data Killed Sampling
A frame from the Killing Eve TV series by BBC.

How Big Data Killed Sampling

"Sampling" live digital data isn't just plain wrong; it is clownish at best. Tell your data provider to stop. Now.

Discussing Big Data sampling methodologies is not precisely the definition of sexy. Yes, we know what?Villanelle, one of the funniest killers who ever appeared on the screen, would say about this article: "This is so boooooring!". We agree with Villanelle. But if you currently rely on digital data providers for your insights, this short (but boooooring) read is worth your attention.


No alt text provided for this image
Villanelle says that this article is boring.


Raise your hand if you ever saw a Social Media data analysis where the presenter proudly declared,?"This study is based on millions of tweets and posts,"?as proof of their analysis' solidity. But fewer marketers still buy this cheap trick; with?more than 500 million tweets sent each day, a few million data points are not meaningful at all - on the contrary, it's a 100% unreliable sample. As?live digital data sampling is impossible?- the millions of tweets and posts boasted in these analyses are the nails in the coffin of their reliability and credibility.

Sampling ≠ random extraction

As of 2020,?5,83 billion pages have been indexed?on the visible part of the World Wide Web, almost doubling since 2012. Regarding content size,?it would take 10 trillion years?to download the whole Web from your computer.

So when it comes to digital data - and e-commerce data and consumer reviews in particular -Big Data is quite an understatement. These data are growing staggeringly, continuously changing both in size and nature. Consequently, we can't apply sampling to digital data analytics as in Market Research. Sampling is technically impossible here because we do not know the size and the stratification of the universe we want to sample in the first place. Any "sampling" in this field would mean a meaningless?random extraction, a.k.a.?accidental sampling, the statistical equivalent of throwing a coin in the air. And yet this is what most digital data analytics providers have done in the last ten years, especially in Social Media Listening. Many DaaS companies have grown to the status of global data providers using extensive, random extractions of digital data, often shielding themselves behind the sheer size of the dataset they were pulling insights from.

To marketers used to the average sample size of traditional market research, a data provider claiming a considerable sample size might have initially sounded like an incredible endeavor ("Wow!!! Millions of data points!"). But a million data points have no significance if the size and quality of the universe are?unknown and ever-changing. These analyses have proven to be pretty useless and shallow. Frustration about many digital data platforms - particularly Social Media Listening tools - is widespread in the marketing community today.

Digital data are a vast, uncharted territory

With digital data analytics, we are in uncharted territory. Using sampling techniques would be equal to inferring that we must have landed on a tropical island because of the white sand on the shore. But cartographers in the Middle Ages already knew there was only one way to map out the nature of an unknown territory; to navigate the whole shoreline and chart the entire map.

Similarly, in digital data analytics, we have only one chance to provide statistically valid datasets and insights: we must take a complete census of the universe we want to analyze. If we're going to know the sentiment on Twitter in China in 2019, we need to include every tweet in the dataset. And if we want to know if getting a 4.1 rating on Amazon is good, we need to analyze the performance of every single product sold on Amazon in that same category. There are no shortcuts, and we can use no mathematical magic to avoid this simple fact.

Because taking a complete data census is a scary task, many limit the analysis to just a few Brands or critical topics. But that can't work either. Focusing only on main competitors or market leaders alone is a recipe for disaster. One might think that minor Brands are marginal data that have a negligible impact on market influence, but there's no such thing as marginal data in Big Data. Today categories are flooded with new Brands and products daily. Many start-ups or co-packers are trying their fortune by fast-prototyping new concepts. Only a few might succeed, but these small initiatives' collective impact is very significant. Analyzing only a few big Brands implies missing the emerging trends completely, only to identify them when it's too late.

Let's look at what happens in the Diapers market in China. Every Chinese co-packer is trying to get a larger slice by launching several new Brand concepts every quarter. In Q2-20 only, the number of Brands sold in the category grew to 319, a 10% increase. In the Wipes market, the increase was a whopping +35% up to 411 Brands.

No alt text provided for this image
Today new Brand Concepts are flooding the markets.


Many markets look like this today. Think of Beverages or Cosmetics; no industry is immune in the "Insurgent Brands" era. Cluster 1 (blue dots) in the image above represents China's most significant 65 incumbent diaper Brands. They make the majority of volumes but show a negative sentiment (Net Opinion Score) overall, as they all sit on the left of the vertical axis. Clusters 2 and 3 (orange and light green dots, respectively) only count 55 Brands combined, but their general sentiment is very positive as they make their consumers happier. These are all small Brands that did not exist a few years ago, collectively shifting the market in their favor and driving the trends. And their influence keeps growing, as they have gained ten share points YoY reaching one-third of the review volume now (image below: reviews share is a very close proxy of online sales share).

No alt text provided for this image
Smaller but innovative Brands significantly drive change

The above analysis shows that we must analyze the entire market to understand competitive dynamics. The classic Market Research approach to cover only one's top competitors in the study worked in the past when there were only a few competitors to care for. Back then, selection bias was a necessary evil caused by the limits of the available technologies. But in today's hyper-crowded markets and with today's Artificial Intelligence capabilities, using the same approach is suicidal.

So if you're looking for a digital data provider, ask this simple question: are they sampling or taking a complete census?

Now you know the answer you want to hear.

Stefano Curotto

Stammdatenmanager

1 年

Dear Gianluca, you keep on posting articles which I find incredibly interesting. Albeit you know, in this case this is not my branch of interest. But one never knows! I might be confronted at some point with a Convenience Sampling analysis, and then I will know what they′re talking about and why I must be cautious with it.. :-) Meanwhile I am already thinking that we witness, nearly every day, examples of that methodology being used in the field of Politics or Social Debate, where one counts the Likes obtained by a particular post to argue that the "overwhelming part of the population" is of the same advice. This is not a reliable evidence, like you have so well explained.

Ryan Kappedal

Data Science Tech Lead @ Google | Data Quality, LLMs

1 年

Big Data is actually Big Sample Bias in disguise

Edward "Ted" Vandenberg, MBA, MCIS

Experienced insurance executive | Ex-Accenture, Ex-Famers Insurance, Ex-Aon and JLT | Data Science | Claims | Process Improvement | Consulting

1 年

Thanks, https://www.dhirubhai.net/in/ruggierogianluca/ for the insights This reminds me of a saying I learned somewhere: " the plural of anecdote is not data". Perhaps even more to the point is a joke about the drunk looking for his car keys at night under a lamppost. Why? Because that's where the light is. There is no such thing as random sampling. But the universe as a sample is intractable if even possible. I leave it to my clever data scientist friends to find a solution. But in the meantime, it seems you are pointing out that everyone's marketing information is...ah..sub-optimal. No doubt Chat GTP can figure this out ??

Nico Sprotti

Copilot Studio & Power Platform @ Microsoft

1 年

Aka use convenience sampling when you understand what you are solving lol

Dominic Ligot

Technologist, Social Impact, Data Ethics, AI

1 年

Convenience is not sampling. But I know people who get away with it.

要查看或添加评论,请登录

Gianluca Ruggiero的更多文章

  • Archetypal Branding: handle with care

    Archetypal Branding: handle with care

    Are you ready to go full transcendental with your clients? I've been working on Brand strategies for 20 years, but I…

    15 条评论

社区洞察

其他会员也浏览了