登录查看更多内容

Statistical Pitfalls: Proper Sampling

Lennaert van den Brink

Cluster Manager & BI consultant at E-mergo

发布日期: 2024年3月29日

There are many arguments for letting your decisions be guided by data. "Numbers don't lie" is an often heard phrase. At first glance, that's absolutely true. Data is an impartial representation of reality and a good test for assumptions and gut feelings. Unfortunately, it's not always easy to interpret data correctly. That's why almost without exception, schools and universities dedicate part of their curriculum to statistics lessons, no matter what the main subject may be. Those who work with data run the risk of falling into one of the many pitfalls. In this series of blogs, I will explain some common pitfalls and provide concrete tips to avoid them. In this fourth and final blog of this series, we discuss 'Proper Sampling'.

The Quest for a Proper Sample:

We all collect as much data as possible nowadays. After all, data is the "new gold" and should help us make better decisions. In the past blogs, we have shown that you have to be careful how you analyze the data to avoid drawing the wrong conclusion. Today, we explain that the way you collect the data can also have a significant impact on obtaining the right results.

In 1948, the American presidential election was between Truman and Dewey. It was an exciting contest, but The Chicago Daily Tribune was sure. They had conducted a poll, and the victory would go to the Republican candidate Dewey. They were so sure of their case that they had already printed the newspaper on June 6 with the headline "Dewey defeats Truman" before the official results were known. However, the result was an overwhelming victory in favor of Truman, who triumphantly posed with the newspaper in his hand for photographers.

What had gone wrong?

The newspaper had conducted the polls by randomly selecting telephone numbers and calling to ask whom the person answering the phone would vote for. In 1948, however, the home telephone was a luxury item, so it was mainly the wealthier middle-class Americans who had a telephone number. The newspaper's sample was therefore over represented by Republicans.

Another pitfall in collecting data is the so-called "survivorship bias". During World War II, the US Army conducted research on their combat aircraft. They wanted to strengthen the aircraft with additional armor. To determine where to place the extra armor, they looked at the bullet holes in planes that had returned after being shot at.

Initially, they decided to place armor on the fuselage and wingtips. After all, that's where the planes were most often hit. However, statistician Abraham Wald advised strengthening the cockpit and engines instead. Why? Because the planes that were analyzed had all returned. The planes that had their cockpit or engines shot were all downed and therefore were not part of the set of planes in the analysis.

So it is important to carefully consider who or what you are measuring when collecting data, but also how you conduct the measurement can have unintended effects. For example, suppose you ask a group of people via survey which color they like best: red, green, or yellow? The results are:

Red 30%
Green 50%
Yellow 20%

领英推荐

ANOVA in Statistics (VIDEO??)

Lean Manufacturing & Six Sigma Worldwide 1 年前

Humans of Analyze - Nick Warncke

Analyze Consulting 1 年前

Unveiling the Mastery of Super Forecasters: Charting…

Richard Norén, MSc, MBA 1 年前

You cannot then state that their favorite color is green, at most that it is preferred over red and yellow. After all, you have given them the choice between those three options, but colors like blue or orange could not be given as an answer. Factors such as the wording of the question and even the order in which they are asked can influence the answers.

Moreover, the fact that you are trying to measure an effect can influence the values you measure. This sounds paradoxical, but it is a well-known phenomenon also known as the "Hawthorne Effect". Hawthorne Works was a factory in Illinois in the 1920s where they tried to increase efficiency by making small adjustments and then investigating their effect on productivity. Most adjustments seemed to have a positive effect during the study itself, but these effects disappeared almost immediately after the study ended. The analysis showed that the employees worked harder when the researchers were "watching them". A similar effect is often seen in exit polls during elections where voters often say to the researcher that they voted for the "socially desirable" party but in reality voted for a completely different party.

4 Tips for Good Samples:

A good analysis starts with collecting good data. You will often hear the mantra "garbage in is garbage out" when working with data. Here are a few tips to improve the quality of your sample:

#1 Random Samples

If you cannot measure all transactions, try to work with random samples as much as possible. The more random, the better, so vary, for example, in the moments when you measure.

#2 Quotas

If you know in advance which factors (major) influence your process, you can provide your sample with quotas. For example, if gender is important and you know that the male-female distribution in your target group is 50%-50%, you survey 10 men and 10 women (this is also called "Stratified Sampling").

#3 Stratified Sampling

If you have collected enough data, you can also apply Stratified Sampling afterwards by randomly selecting measurement points from your complete dataset that meet the requested categories and not include the rest of the data in your research.

#4 Safe and ethical Sampling

The Hawthorne effect occurs especially when the observed persons feel threatened. For a good sample, it is in your own interest to ensure a socially safe and open atmosphere. Sometimes it is possible to measure without the observed person being aware of it, but be aware of ethical and moral objections.

要查看或添加评论，请登录

Lennaert van den Brink的更多文章

Doing more with AutoML: Model Selection

2025年2月7日

Doing more with AutoML: Model Selection

Predicting the future—who wouldn’t want to? With Qlik AutoML, you can easily create predictive models based on your…

2 条评论
Doing more with AutoML: Feature Engineering

2024年11月29日

Doing more with AutoML: Feature Engineering

Predicting the future—who wouldn’t want to? With Qlik AutoML, you can easily create predictive models based on your…

2 条评论
What is Qlik Answers?

2024年8月9日

What is Qlik Answers?

One of the biggest announcements from Qlik Connect 2024 was the introduction of a new product: Qlik Answers. Back then,…
Looking back on Qlik Connect 2024

2024年7月5日

Looking back on Qlik Connect 2024

It was almost impossible to miss. From June 3 to 5, there was a whirlwind of news surrounding Qlik.
Qlik Connect 2024 - 4 Developments to Keep an Eye On

2024年5月24日

Qlik Connect 2024 - 4 Developments to Keep an Eye On

From June 3 to 5, it's that time again when Qlik partners, customers, and other Qlik aficionados from all over the…

1 条评论
Statistical pitfalls: Correlation vs. Causation

2024年3月1日

Statistical pitfalls: Correlation vs. Causation

There are many arguments for letting your decisions be guided by data. "Numbers don't lie" is an often heard phrase.
Statistical pitfalls: Aggregated data

2024年2月23日

Statistical pitfalls: Aggregated data

There are many arguments for letting your decisions be guided by data. "Numbers don't lie" is an often heard phrase.
Statistical pitfalls: Cherry Picking

2024年2月16日

Statistical pitfalls: Cherry Picking

There are many arguments for letting your decisions be guided by data. "Numbers don't lie" is an often heard phrase.
Generating formatted Excel exports with Qlik Cloud

2024年1月26日

Generating formatted Excel exports with Qlik Cloud

Last December, Qlik launched a new extension for the reporting service of Qlik Cloud. In this blog, I explain the…

2 条评论
Data Quality: how to measure and improve

2024年1月5日

Data Quality: how to measure and improve

"Data is the new gold", a phrase you've likely heard for several years now. However, with gold, we know that its value…

3 条评论

See all articles

Statistical Pitfalls: Proper Sampling

Lennaert van den Brink

Cluster Manager & BI consultant at E-mergo

The Quest for a Proper Sample:

What had gone wrong?

领英推荐

4 Tips for Good Samples:

#1 Random Samples

#2 Quotas

#3 Stratified Sampling

#4 Safe and ethical Sampling

Lennaert van den Brink的更多文章

社区洞察

其他会员也浏览了

Stop Using Pie Charts, They Are Evil!

Visualizing Sortino Ratio

Easy way to learn the Lessons I learned the hard way

Getting comfortable with grey areas

Monster Panel, part 5

How To Make Big Numbers Tangible

The importance of the Right Questions and Correct Analysis in Strategic Planning

Beauty is in the eye of the beholder... sort of

Scatter Plot Graph

The Quest for a Proper Sample:

What had gone wrong?

领英推荐

4 Tips for Good Samples:

#1 Random Samples

#2 Quotas

#3 Stratified Sampling

#4 Safe and ethical Sampling

Lennaert van den Brink的更多文章

Doing more with AutoML: Model Selection

Doing more with AutoML: Feature Engineering

What is Qlik Answers?

Looking back on Qlik Connect 2024

Qlik Connect 2024 - 4 Developments to Keep an Eye On

Statistical pitfalls: Correlation vs. Causation

Statistical pitfalls: Aggregated data

Statistical pitfalls: Cherry Picking

Generating formatted Excel exports with Qlik Cloud

Data Quality: how to measure and improve

社区洞察

其他会员也浏览了

Stop Using Pie Charts, They Are Evil!

Visualizing Sortino Ratio

Easy way to learn the Lessons I learned the hard way

Getting comfortable with grey areas

Monster Panel, part 5

How To Make Big Numbers Tangible

The importance of the Right Questions and Correct Analysis in Strategic Planning

Beauty is in the eye of the beholder... sort of

Scatter Plot Graph