What is Statistical Inference?

What is Statistical Inference?

We have already briefly addressed?Descriptive Statistics to describe how the data are organized and?Probability Theory?to measure the variability of everyday phenomena according to their occurrence. From now on, we will study the third central area of statistics, Inferential Statistics.

Inference in a Nutshell

In Inferential Statistics, we have a population — a complete set of data. We apply a sampling technique to collect a sample of this population, analyze this sample and make inferences about the population.

The inferential statistic aims at extrapolating the results (obtained with descriptive statistics) to the population; that is, we applied a sampling technique to extract a representative sample; otherwise, we will reach inferences that do not represent reality.

From this sample, we calculate the mean, median, fashion, and several other statistics, and then, applying a series of inferential statistics techniques, we will make inferences about the population.

The concept of sampling is crucial in Data Science. In practice, we work with samples all the time. During the preprocessing phase, we prepare samples and then train the Machine Learning model — using even the same sampling techniques that some companies use in election surveys, but now applied in Data Science.

Population and Sample

To start our discussion on inferential statistics, we have to distinguish between population and sample clearly.

  • Population: it is the set of all elements or results under investigation.
  • Sample: Is any subset of the population. In the vast majority of cases, we will only work with the samples.

Let’s assume a scenario: We were invited to survey to measure the durability of the lamps produced by a particular factory. What approach would we use?

  1. Test all the lamps produced by that factory
  2. Obtain a representative sample of the lamp population and infer the durability of all lamps produced.

It’s not hard to conclude… of course it is not feasible to make this analysis, daily, in all the lamps manufactured by the company. So, therefore, what can be done is to collect a representative sample and, from this sample, make inferences about the population of lamps produced.

When we work with sampling, we have an expected error rate — including, we can calculate this error rate and put confidence intervals. After all, as representative as the sample is, it is not the population itself.

Principles of Sampling

Inferential statistics provide us with tools to make inferences and estimate characteristics about the entire population through the analysis of this sample. The concept of sampling is quite simple: to know if the pie is delicious, eat a slice (an inference on the whole pie).

While a census (a costly budget) involves examining all elements of a given group, sampling consists of a study of only a representative part of the elements.

The sampling theory studies the relationships between a population and the samples extracted from that population. Sampling is beneficial for evaluating unknown population quantities:

  • Voting intention poll for elections
  • Audience calculation of television programs

Or to determine whether the differences observed between two samples are due to chance or if they are significant.

Sampling Types

We will see what the techniques and procedures to collect representative samples of a population are. We have two main sampling methods: Random and Non-Random Methods.

  • Non-Aleatory Methods: purposeful sampling, snowball, quotas, and convenience. These samplings are little used, ideally using random methods that do not interfere with the study results — especially in Machine Learning.
  • Random methods: simple, systematic, stratified, clustered, multi-phase, multiphase random sampling

Probabilistic or Random Sampling: in this type of sampling, the samples are randomly obtained; that is, any population unit has the same probability of being chosen. In this method, all we want is to select the sample components purely randomly. Thus, we have three variants of random sampling.

  1. Simple random sampling is the most commonly used method when it is necessary to separate training and test data.
  2. Simple random sampling without replacement: in this type of sampling without data reset, the population elements are numbered from 1 to n, that is, any number of population elements. Then, with equal probability, one of the n observations of the population; without the next draw, the observation returns to the “pot.”
  3. Simple random sampling without replacement: we have the opposite. When drawing the first element of the population, it is directed to the sample; when we proceed to the second draw, the piece is already removed back to the “pot” — bootstrapping helps us choose with or without replacement.

Systematic Sampling: When population elements are ordered and removed periodically, an example would be a production line where the piece is removed in a specific range of items — quality control.

Stratified Sampling: in this type of sampling, a heterogeneous population is stratified and divided into homogeneous subpopulations — in each of these extracts, a sample is taken; that is, we make some previous divisions in the data population then make the selection to make up the sample.

Conglomerate Sampling: it is a concept almost inverse to stratified sampling. In the stratified, we divided into several groups and took units from each group, here we split into groups, and we stay with the groups.

Hypothesis Tests

We will now address one of the main tools of Inferential Statistics, hypothesis tests. One of the main problems to be solved by statistical inference is to test hypotheses. A statistical hypothesis assumes veracity or falsehood about a given population parameter, such as mean, standard deviation, correlation coefficient, etc.

For a statistical hypothesis to be validated or rejected, it would be necessary to examine the entire population — which in practice is unfeasible. Alternatively, we extracted a random sample of the population of interest and decided based on this sample — some errors may occur:

  • reject a hypothesis when it is true
  • don’t reject a hypothesis when it’s false

We want to use a hypothesis test to validate a population parameter through a random sample — from the hypothesis test, we infer the population parameter.

Decisions

Therefore, a statistical hypothesis test is a procedure that allows us to decide between H°(null hypothesis) and Ha (alternative hypothesis) based on the information contained in the sample.

N?o foi fornecido texto alternativo para esta imagem

The null hypothesis states that a population parameter (such as mean, standard deviation, and so on) is equal to a hypothetical value. Thus, the null hypothesis is often an initial claim based on previous analyses.

Ha

The alternative hypothesis states that a population parameter is minor, higher, or different from the hypothetical value in the null hypothesis. Therefore, the alternative hypothesis is believed to be true or false.

Because we are analyzing sample data and not population data, errors can occur:

  • Type I error: This is the probability of rejecting the null hypothesis when it is ephemerally true.
  • Type II error: is the probability of rejecting the alternative hypothesis when it is effectively true.

One of the secrets behind the hypothesis test is the correct definition of what H° and what is Ha — the Data Scientist is responsible for defining the null hypothesis or the alternative hypothesis. We have to understand the business problem and identify both assumptions from this problem. An incorrect definition can compromise the entire process.

Example: A researcher has some exam results for a sample of students who took a skilled course for a national exam. The researcher wants to know if the formed students scored above the national average of 78.

The researcher wants to do an analysis based on some business needs, check the test results of samples of students who have taken a specific course, and compare to verify whether or not the average of students is according to the national average.

An alternative hypothesis can be used because the researcher is specifically raising the hypothesis that the scores for formed students are higher than the national average.

  • H°: population mean is equal to 78 — a statement we have today
  • Ha: population average is greater than 78 — what we want to prove

We want to establish an alternative hypothesis that students who have training have a standard higher than 78.

Hypothesis Testing Pipeline

Here we have a sequence of steps used in a hypothesis test, commonly used in Digital Marketing and A/B Testing:

  1. Formulate null and alternative hypotheses — a problem of interpretation;
  2. Collect a sample size n and calculate the sample mean;
  3. Plot the sample mean on the x-axis of the sample distribution;
  4. Set a significance level α based on the severity of error 1;
  5. Calculate statistics, critical values, and critical region;
  6. If the sample mean is in the white area of the chart, we do NOT reject the null hypothesis;
  7. If the sample means swerves into one of the tails, we REJECT the null hypothesis.

Unilateral Hypothesis Test

The Unilateral or One-tailed test is used when the alternative hypothesis is expressed as: < or >, we have the null hypothesis. Then, we define the alternative hypothesis as a sign of greater or with a symbol of minor.

N?o foi fornecido texto alternativo para esta imagem

We have the null hypothesis that the average is 1.8 — in the box on the left, we have Ha with an average value > 1.8, and in the other box of direct Ha with < 1.8.

Bringing this into a graphical translation is like having a chart with a normal distribution where we find the average. So, if the mean is in the white area, we reject H°; if the average value is in the yellow region, we reject H°.

Note that we will be on one side or the other of the tail of the normal distribution. Because of this, the test is called one-sided or one-tailed.

If the mean is within the white region of the chart, we do not reject the null hypothesis we already have. Otherwise, we reject it.

Example: A school has a group of students (population) considered obese. The probability distribution of the weight of students is between 12 and 17 years is normal, with an average of 80 kgs and a standard deviation of 10 kgs. The school principal proposes a treatment campaign to combat obesity. The doctor states that the result of the treatment will be presented in and months. And that students will have their weights decreased in this period.

  • H° μ = 80 — status quo, i.e., the current reality
  • Ha μ < 80 — proof of the different reality

Bilateral Hypothesis Test

Just as we can apply a one-sided hypothesis test, that is, if the alternative hypothesis is higher or lower than the average, we also have the option of performing a bilateral hypothesis test. The bilateral test is used whenever the alternative hypothesis is expressed ≠.

Now we don’t want to know if it’s bigger or smaller; we want to know if the alternative hypothesis is simply different.

N?o foi fornecido texto alternativo para esta imagem

The curve above represents the sampling distribution of the average broadband utilization. It is assumed that the population means is 1.8 GB, according to the null hypothesis H° μ = 1.8. Because there are two yellow regions of rejection in the graph, this is called a bilateral or two-tailed hypothesis test — since the null hypothesis is expressed with ≠.

Example 2: A cookie factory packs boxes weighing 500 grams. Weight is monitored periodically. The quality department has established that it should maintain the weight at 500 grams. What is the condition for the quality department to stop the production of the biscuits?

  • H° μ = 500g — status quo, i.e., the current reality
  • Ha μ ≠ 500g — proof of the different reality

Error Type I and Type II

The purpose of the Hypothesis Test is to verify the validity of a statement about a population parameter based on sampling. However, as we take the sample as a basis, we are exposed to the risk of wrong conclusions about the population due to sampling errors.

Significant significance α and Confidence Level 1-α

To test H°, it is necessary to define a decision rule to establish a zone of rejection of the hypothesis, that is, to determine a significance level, α, the most consensual being alphas 0.10, 0.05, and 0.01:

N?o foi fornecido texto alternativo para esta imagem

That is what we are defining as our margin of error. Again, we define this through the level of α.

Suppose the population parameter value, defended by the null hypothesis, falls in the rejection zone. In that case, this value is doubtful to be the actual value of the population, and we will reject the null hypothesis to the detriment of the alternative hypothesis.

N?o foi fornecido texto alternativo para esta imagem

Type I error

It may happen that although rejected based on data from a sample, the null hypothesis is true. In that case, we’d be making a mistake in deciding. This error is called Type 1Error. The probability of occurring depends on the alpha you choose.

Type II Error

When the value defended by the Null Hypothesis falls outside the rejection zone, it falls in the white region of the distribution. We consider that there is no evidence to reject the null hypothesis to the detriment of the alternative hypothesis. However, we can also mistake if the alternative hypothesis, although discarded, is indeed true. It is Type II Error

N?o foi fornecido texto alternativo para esta imagem

Type 1 error is associated with the significance level. Therefore, we can increase or decrease the significance level and thus have or not have the type 1 error within the hypothesis testing process.

Example 3: The effectiveness of a particular vaccine after one year is 25% (i.e., the immune effect extends for more than one year in only 25% of the people taking it). A new vaccine develops, more expensive, and one wishes to know if this is, in fact, better.

  • H° p = 0.25- status quo, i.e. the current reality
  • Ha p > 025 — proof of different reality
  • Type I error: approve the vaccine when, in reality, it has no effect more significant than that of the vaccine in use.
  • Type II error: reject the new vaccine when it is, in fact, better than the vaccine in use

The probability of making a Type I Error depends on the values of the population parameters (α — significance level). We then say that the level of α significance is the maximum probability with which we want to run the risk of a Type I Error — typically α = 5%. The likelihood of making a Type II error is called β.

And there we have it. I hope you have found this helpful. Thank you for reading. ??

Leonardo Anello

References:

The Logic of Science: Principles and Elementary Applications E. T. Jaynes & Data Science Academy

要查看或添加评论,请登录

Leonardo A.的更多文章

社区洞察

其他会员也浏览了