What is Statistical Inference?
We have already briefly addressed?Descriptive Statistics to describe how the data are organized and?Probability Theory?to measure the variability of everyday phenomena according to their occurrence. From now on, we will study the third central area of statistics, Inferential Statistics.
Inference in a Nutshell
In Inferential Statistics, we have a population — a complete set of data. We apply a sampling technique to collect a sample of this population, analyze this sample and make inferences about the population.
The inferential statistic aims at extrapolating the results (obtained with descriptive statistics) to the population; that is, we applied a sampling technique to extract a representative sample; otherwise, we will reach inferences that do not represent reality.
From this sample, we calculate the mean, median, fashion, and several other statistics, and then, applying a series of inferential statistics techniques, we will make inferences about the population.
The concept of sampling is crucial in Data Science. In practice, we work with samples all the time. During the preprocessing phase, we prepare samples and then train the Machine Learning model — using even the same sampling techniques that some companies use in election surveys, but now applied in Data Science.
Population and Sample
To start our discussion on inferential statistics, we have to distinguish between population and sample clearly.
Let’s assume a scenario: We were invited to survey to measure the durability of the lamps produced by a particular factory. What approach would we use?
It’s not hard to conclude… of course it is not feasible to make this analysis, daily, in all the lamps manufactured by the company. So, therefore, what can be done is to collect a representative sample and, from this sample, make inferences about the population of lamps produced.
When we work with sampling, we have an expected error rate — including, we can calculate this error rate and put confidence intervals. After all, as representative as the sample is, it is not the population itself.
Principles of Sampling
Inferential statistics provide us with tools to make inferences and estimate characteristics about the entire population through the analysis of this sample. The concept of sampling is quite simple: to know if the pie is delicious, eat a slice (an inference on the whole pie).
While a census (a costly budget) involves examining all elements of a given group, sampling consists of a study of only a representative part of the elements.
The sampling theory studies the relationships between a population and the samples extracted from that population. Sampling is beneficial for evaluating unknown population quantities:
Or to determine whether the differences observed between two samples are due to chance or if they are significant.
Sampling Types
We will see what the techniques and procedures to collect representative samples of a population are. We have two main sampling methods: Random and Non-Random Methods.
Probabilistic or Random Sampling: in this type of sampling, the samples are randomly obtained; that is, any population unit has the same probability of being chosen. In this method, all we want is to select the sample components purely randomly. Thus, we have three variants of random sampling.
Systematic Sampling: When population elements are ordered and removed periodically, an example would be a production line where the piece is removed in a specific range of items — quality control.
Stratified Sampling: in this type of sampling, a heterogeneous population is stratified and divided into homogeneous subpopulations — in each of these extracts, a sample is taken; that is, we make some previous divisions in the data population then make the selection to make up the sample.
Conglomerate Sampling: it is a concept almost inverse to stratified sampling. In the stratified, we divided into several groups and took units from each group, here we split into groups, and we stay with the groups.
Hypothesis Tests
We will now address one of the main tools of Inferential Statistics, hypothesis tests. One of the main problems to be solved by statistical inference is to test hypotheses. A statistical hypothesis assumes veracity or falsehood about a given population parameter, such as mean, standard deviation, correlation coefficient, etc.
For a statistical hypothesis to be validated or rejected, it would be necessary to examine the entire population — which in practice is unfeasible. Alternatively, we extracted a random sample of the population of interest and decided based on this sample — some errors may occur:
We want to use a hypothesis test to validate a population parameter through a random sample — from the hypothesis test, we infer the population parameter.
Decisions
Therefore, a statistical hypothesis test is a procedure that allows us to decide between H°(null hypothesis) and Ha (alternative hypothesis) based on the information contained in the sample.
H°
The null hypothesis states that a population parameter (such as mean, standard deviation, and so on) is equal to a hypothetical value. Thus, the null hypothesis is often an initial claim based on previous analyses.
Ha
The alternative hypothesis states that a population parameter is minor, higher, or different from the hypothetical value in the null hypothesis. Therefore, the alternative hypothesis is believed to be true or false.
Because we are analyzing sample data and not population data, errors can occur:
One of the secrets behind the hypothesis test is the correct definition of what H° and what is Ha — the Data Scientist is responsible for defining the null hypothesis or the alternative hypothesis. We have to understand the business problem and identify both assumptions from this problem. An incorrect definition can compromise the entire process.
Example: A researcher has some exam results for a sample of students who took a skilled course for a national exam. The researcher wants to know if the formed students scored above the national average of 78.
The researcher wants to do an analysis based on some business needs, check the test results of samples of students who have taken a specific course, and compare to verify whether or not the average of students is according to the national average.
领英推荐
An alternative hypothesis can be used because the researcher is specifically raising the hypothesis that the scores for formed students are higher than the national average.
We want to establish an alternative hypothesis that students who have training have a standard higher than 78.
Hypothesis Testing Pipeline
Here we have a sequence of steps used in a hypothesis test, commonly used in Digital Marketing and A/B Testing:
Unilateral Hypothesis Test
The Unilateral or One-tailed test is used when the alternative hypothesis is expressed as: < or >, we have the null hypothesis. Then, we define the alternative hypothesis as a sign of greater or with a symbol of minor.
We have the null hypothesis that the average is 1.8 — in the box on the left, we have Ha with an average value > 1.8, and in the other box of direct Ha with < 1.8.
Bringing this into a graphical translation is like having a chart with a normal distribution where we find the average. So, if the mean is in the white area, we reject H°; if the average value is in the yellow region, we reject H°.
Note that we will be on one side or the other of the tail of the normal distribution. Because of this, the test is called one-sided or one-tailed.
If the mean is within the white region of the chart, we do not reject the null hypothesis we already have. Otherwise, we reject it.
Example: A school has a group of students (population) considered obese. The probability distribution of the weight of students is between 12 and 17 years is normal, with an average of 80 kgs and a standard deviation of 10 kgs. The school principal proposes a treatment campaign to combat obesity. The doctor states that the result of the treatment will be presented in and months. And that students will have their weights decreased in this period.
Bilateral Hypothesis Test
Just as we can apply a one-sided hypothesis test, that is, if the alternative hypothesis is higher or lower than the average, we also have the option of performing a bilateral hypothesis test. The bilateral test is used whenever the alternative hypothesis is expressed ≠.
Now we don’t want to know if it’s bigger or smaller; we want to know if the alternative hypothesis is simply different.
The curve above represents the sampling distribution of the average broadband utilization. It is assumed that the population means is 1.8 GB, according to the null hypothesis H° μ = 1.8. Because there are two yellow regions of rejection in the graph, this is called a bilateral or two-tailed hypothesis test — since the null hypothesis is expressed with ≠.
Example 2: A cookie factory packs boxes weighing 500 grams. Weight is monitored periodically. The quality department has established that it should maintain the weight at 500 grams. What is the condition for the quality department to stop the production of the biscuits?
Error Type I and Type II
The purpose of the Hypothesis Test is to verify the validity of a statement about a population parameter based on sampling. However, as we take the sample as a basis, we are exposed to the risk of wrong conclusions about the population due to sampling errors.
Significant significance α and Confidence Level 1-α
To test H°, it is necessary to define a decision rule to establish a zone of rejection of the hypothesis, that is, to determine a significance level, α, the most consensual being alphas 0.10, 0.05, and 0.01:
That is what we are defining as our margin of error. Again, we define this through the level of α.
Suppose the population parameter value, defended by the null hypothesis, falls in the rejection zone. In that case, this value is doubtful to be the actual value of the population, and we will reject the null hypothesis to the detriment of the alternative hypothesis.
Type I error
It may happen that although rejected based on data from a sample, the null hypothesis is true. In that case, we’d be making a mistake in deciding. This error is called Type 1Error. The probability of occurring depends on the alpha you choose.
Type II Error
When the value defended by the Null Hypothesis falls outside the rejection zone, it falls in the white region of the distribution. We consider that there is no evidence to reject the null hypothesis to the detriment of the alternative hypothesis. However, we can also mistake if the alternative hypothesis, although discarded, is indeed true. It is Type II Error
Type 1 error is associated with the significance level. Therefore, we can increase or decrease the significance level and thus have or not have the type 1 error within the hypothesis testing process.
Example 3: The effectiveness of a particular vaccine after one year is 25% (i.e., the immune effect extends for more than one year in only 25% of the people taking it). A new vaccine develops, more expensive, and one wishes to know if this is, in fact, better.
The probability of making a Type I Error depends on the values of the population parameters (α — significance level). We then say that the level of α significance is the maximum probability with which we want to run the risk of a Type I Error — typically α = 5%. The likelihood of making a Type II error is called β.
And there we have it. I hope you have found this helpful. Thank you for reading. ??
Leonardo Anello
References:
The Logic of Science: Principles and Elementary Applications E. T. Jaynes & Data Science Academy