Statistical Pitfalls: Proper Sampling
There are many arguments for letting your decisions be guided by data. "Numbers don't lie" is an often heard phrase. At first glance, that's absolutely true. Data is an impartial representation of reality and a good test for assumptions and gut feelings. Unfortunately, it's not always easy to interpret data correctly. That's why almost without exception, schools and universities dedicate part of their curriculum to statistics lessons, no matter what the main subject may be. Those who work with data run the risk of falling into one of the many pitfalls. In this series of blogs, I will explain some common pitfalls and provide concrete tips to avoid them. In this fourth and final blog of this series, we discuss 'Proper Sampling'.
The Quest for a Proper Sample:
We all collect as much data as possible nowadays. After all, data is the "new gold" and should help us make better decisions. In the past blogs, we have shown that you have to be careful how you analyze the data to avoid drawing the wrong conclusion. Today, we explain that the way you collect the data can also have a significant impact on obtaining the right results.
In 1948, the American presidential election was between Truman and Dewey. It was an exciting contest, but The Chicago Daily Tribune was sure. They had conducted a poll, and the victory would go to the Republican candidate Dewey. They were so sure of their case that they had already printed the newspaper on June 6 with the headline "Dewey defeats Truman" before the official results were known. However, the result was an overwhelming victory in favor of Truman, who triumphantly posed with the newspaper in his hand for photographers.
What had gone wrong?
The newspaper had conducted the polls by randomly selecting telephone numbers and calling to ask whom the person answering the phone would vote for. In 1948, however, the home telephone was a luxury item, so it was mainly the wealthier middle-class Americans who had a telephone number. The newspaper's sample was therefore over represented by Republicans.
Another pitfall in collecting data is the so-called "survivorship bias". During World War II, the US Army conducted research on their combat aircraft. They wanted to strengthen the aircraft with additional armor. To determine where to place the extra armor, they looked at the bullet holes in planes that had returned after being shot at.
Initially, they decided to place armor on the fuselage and wingtips. After all, that's where the planes were most often hit. However, statistician Abraham Wald advised strengthening the cockpit and engines instead. Why? Because the planes that were analyzed had all returned. The planes that had their cockpit or engines shot were all downed and therefore were not part of the set of planes in the analysis.
So it is important to carefully consider who or what you are measuring when collecting data, but also how you conduct the measurement can have unintended effects. For example, suppose you ask a group of people via survey which color they like best: red, green, or yellow? The results are:
领英推荐
You cannot then state that their favorite color is green, at most that it is preferred over red and yellow. After all, you have given them the choice between those three options, but colors like blue or orange could not be given as an answer. Factors such as the wording of the question and even the order in which they are asked can influence the answers.
Moreover, the fact that you are trying to measure an effect can influence the values you measure. This sounds paradoxical, but it is a well-known phenomenon also known as the "Hawthorne Effect". Hawthorne Works was a factory in Illinois in the 1920s where they tried to increase efficiency by making small adjustments and then investigating their effect on productivity. Most adjustments seemed to have a positive effect during the study itself, but these effects disappeared almost immediately after the study ended. The analysis showed that the employees worked harder when the researchers were "watching them". A similar effect is often seen in exit polls during elections where voters often say to the researcher that they voted for the "socially desirable" party but in reality voted for a completely different party.
4 Tips for Good Samples:
A good analysis starts with collecting good data. You will often hear the mantra "garbage in is garbage out" when working with data. Here are a few tips to improve the quality of your sample:
#1 Random Samples
If you cannot measure all transactions, try to work with random samples as much as possible. The more random, the better, so vary, for example, in the moments when you measure.
#2 Quotas
If you know in advance which factors (major) influence your process, you can provide your sample with quotas. For example, if gender is important and you know that the male-female distribution in your target group is 50%-50%, you survey 10 men and 10 women (this is also called "Stratified Sampling").
#3 Stratified Sampling
If you have collected enough data, you can also apply Stratified Sampling afterwards by randomly selecting measurement points from your complete dataset that meet the requested categories and not include the rest of the data in your research.
#4 Safe and ethical Sampling
The Hawthorne effect occurs especially when the observed persons feel threatened. For a good sample, it is in your own interest to ensure a socially safe and open atmosphere. Sometimes it is possible to measure without the observed person being aware of it, but be aware of ethical and moral objections.