Important statistics for Data science
Suravi Mahanta
Senior Consultant at EY GDS | Ex-Accenture | Microsoft Modern Data Platform Expert | Big Data Specialist | 4X Microsoft Certified | 3X Databricks Certified | AI Developer | Data Architecture
In this blog I'll try to cover most of the statistical measures which are used in almost all data science projects. So lets start with the very basic statistical measures like Measure of central tendency, Measure of dispersion, Measure of shape of distribution and measure of dependence. After that I'll explain some important distribution curves and then we'll move to different test statistics like Z statistics, t statistics, F statistics, chi statistics etc.
Before we move forward with different statistical tests it is imperative to understand the difference between a sample and a population.
In statistics “population” refers to the total set of observations that can be made. For example, if we want to calculate average height of humans present on the earth, “population” will be the “total number of people actually present on the earth”.
A sample, on the other hand, is a set of data collected/selected from a pre-defined procedure. For our example above, it will be a small group of people selected randomly from some parts of the earth.
For instance, if we select people randomly from all regions(Asia, America, Europe, Africa etc.) to calculate the average height, our estimate will be close to the actual estimate and can be assumed as a sample mean, whereas if we make selection let’s say only from the United States, then our average height estimate will not be accurate but would only represent the data of a particular region (United States). Such a sample is then called a biased sample and is not a representative of “population”.
Another important aspect in statistics is “distribution”. When “population” is infinitely large it is improbable to validate any hypothesis by calculating the mean value or test parameters on the entire population. In such cases, a population is assumed to be of some type of a distribution.The most common forms of distributions are Binomial, Poisson, Discrete etc.
So let's start with our first statistical measure :
1. Measure of central tendency ( mean, median and mode.)
a.Mean:
The mean is the arithmetic average and it is probably the measure of central tendency that you are most familiar. Calculating the mean is very simple. You just add up all of the values and divide by the number of observations.
b.Median:
The median is the middle value. It is the value that splits the dataset in half. To find the median, order your data from smallest to largest, and then find the data point that has an equal amount of values above it and below it. The method for locating the median varies slightly depending on whether your dataset has an even or odd number of values.
c.Mode:
The mode is the value that occurs the most frequently in your data set. On a bar chart, the mode is the highest bar. If the data have multiple values that are tied for occurring the most frequently, you have a multi modal distribution. If no value repeats, the data do not have a mode.
2. Measure of statistical dispersion:
Standard Deviation:
In statistics the standard deviation (SD, also represented by the lower case Greek letter sigma σ for the population standard deviation or the Latin letter s for the sample standard deviation) is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.
In addition to expressing the variability of a population, the standard deviation is commonly used to measure confidence in statistical conclusions. This derivation of a standard deviation is often called the "standard error" of the estimate or "standard error of the mean" when referring to a mean.
3. Measure of the shape of distribution:
a. Skewness:
Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right. Skewness can be quantified to define the extent to which a distribution differs from a normal distribution.
b. Kurtosis:
The degree of tailedness of a distribution is measured by kurtosis. It tells us the extent to which the distribution is more or less outlier-prone (heavier or light-tailed) than the normal distribution. Kurtosis can be calculated using this formula:
4. Measure of statistical dependence:
When more than one statistical variable is used then we check this measure called correlation coefficient or Pearson correlation coefficient.
According to Wikipedia, It is the measure of the linear correlation between two variables X and Y. According to the Cauchy–Schwarz inequality it has a value between +1 and ?1, where 1 is total positive linear correlation, 0 is no linear correlation, and ?1 is total negative linear correlation. It is widely used in the sciences.
There are several types of correlation coefficient formulas.
One of the most commonly used formulas in stats is Pearson’s correlation coefficient formula. If you’re taking a basic stats class, this is the one you’ll probably use:
So far we discuss all the measures of statistics and now by using these concepts we will try to understand Hypothesis Testing, Chi-Square, Z-Test, T-test, One-tailed and two tailed Test etc.
Test statistics:
A test statistic is a random variable that is calculated from sample data and used in a hypothesis test. You can use test statistics to determine whether to reject the null hypothesis. The test statistic compares your data with what is expected under the null hypothesis. The test statistic is used to calculate the p-value.
For example, the test statistic for a Z-test is the Z-statistic, which has the standard normal distribution under the null hypothesis. Suppose you perform a two-tailed Z-test with an α of 0.05, and obtain a Z-statistic (also called a Z-value) based on your data of 2.5. This Z-value corresponds to a p-value of 0.0124. Because this p-value is less than α, you declare statistical significance and reject the null hypothesis.
P Value:
A test statistic is used in a hypothesis testing when you are deciding to support or reject the null hypothesis. A p-value is a probability associated with your critical value. The critical value depends on the probability you are allowing for a Type I error. It measures the chance of getting results at least as strong as yours if the claim (H0) were true.
To find the p-value for your test statistic:
Step 1:Look up your test statistic on the appropriate distribution — in this case, on the standard normal (Z-) distribution (see the following Z-table).
Step 2:Find the probability that Z is beyond (more extreme than) your test statistic:
2.1:If Ha contains a less-than alternative, find the probability that Z is less than your test statistic (that is, look up your test statistic on the Z-table and find its corresponding probability). This is the p-value. (Note: In this case, your test statistic is usually negative.)
2.2:If Ha contains a greater-than alternative, find the probability that Z is greater than your test statistic (look up your test statistic on the Z-table, find its corresponding probability, and subtract it from one). The result is your p-value. (Note: In this case, your test statistic is usually positive.)
2.3If Ha contains a not-equal-to alternative, find the probability that Z is beyond your test statistic and double it. There are two cases:
--If your test statistic is negative, first find the probability that Z is less than your test statistic (look up your test statistic on the Z-table and find its corresponding probability). Then double this probability to get the p-value.
--If your test statistic is positive, first find the probability that Z is greater than your test statistic (look up your test statistic on the Z-table, find its corresponding probability, and subtract it from one). Then double this result to get the p-value.
Let's try to understand the test statistics and P value with understanding hypothesis testing.
Hypothesis Testing:
Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter.
According to Wikipedia, A statistical hypothesis, sometimes called confirmatory data analysis, is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables. A statistical hypothesis test is a method of statistical inference.
Steps to do hypothesis testing:
- There is an initial research hypothesis of which the truth is unknown.
- The first step is to state the relevant null and alternative hypotheses. This is important, as mis-stating the hypotheses will muddy the rest of the process.
- The second step is to consider the statistical assumptions being made about the sample in doing the test; for example, assumptions about the statistical independence or about the form of the distributions of the observations. This is equally important as invalid assumptions will mean that the results of the test are invalid.
- Decide which test is appropriate, and state the relevant test statistics(T).
- Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example, the test statistic might follow a Student's t distribution or a normal distribution.
- Select a significance level (α), a probability threshold below which the null hypothesis will be rejected. Common values are 5% and 1%.
- The distribution of the test statistic under the null hypothesis partitions the possible values of T into those for which the null hypothesis is rejected—the so-called critical region—and those for which it is not. The probability of the critical region is α.
- Compute from the observations the observed value t(obs) of the test statistic T.
- Decide to either reject the null hypothesis in favor of the alternative or not reject it. The decision rule is to reject the null hypothesis H0 if the observed value tobs is in the critical region, and to accept or "fail to reject" the hypothesis otherwise.
Alternate Hypothesis Vs Null Hypothesis:
Example 1: It’s an accepted fact that ethanol boils at 173.1°F; you have a theory that ethanol actually has a different boiling point, of over 174°F. The accepted fact (“ethanol boils at 173.1°F”) is the null hypothesis; your theory (“ethanol boils at temperatures of 174°F”) is the alternate hypothesis.
Alpha Value/Significance Level:
The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.
In hypothesis tests, two errors are possible,Type 1 and Type 2 error
Type I error: Supporting the alternate hypothesis when the null hypothesis is true.
Type II error: Not supporting the alternate hypothesis when the alternate hypothesis is true.
Example: Let’s say that the null hypothesis is that a man is innocent and the alternate hypothesis is that he is guilty. if you convict an innocent man (Type I error), you support the alternate hypothesis (that he is guilty). A type II error would be letting a guilty man go free.
An alpha level is the probability of a type I error, or you reject the null hypothesis, when it is true. A related term, beta, is the opposite; the probability of rejecting the alternate hypothesis when it is true.
How to calculate the Alpha value?
-There is no direct way to calculate the alpha value, however it mostly depends on the business problem and the confidence interval for that business problem.
-To get alpha subtract your confidence interval from 1.
For example, if you want to be 95 percent confident that your analysis is correct, the alpha level (Significant level) would be 1 – .95 = 5 percent, assuming you had a one tailed test. For two-tailed tests, divide the alpha level by 2. In this example, the two tailed alpha would be .05/2 = 2.5 percent.
-Again one-tail and two tail test selection is depend on the business problem.
One tailed Vs Two tailed Hypothesis testing:
Example question 1: A government official claims that the dropout rate for local schools is 25%. Last year, 190 out of 603 students dropped out. Is there enough evidence to reject the government official’s claim?
Example question 2: A government official claims that the dropout rate for local schools is less than 25%. Last year, 190 out of 603 students dropped out. Is there enough evidence to reject the government official’s claim?
Example question 3: A government official claims that the dropout rate for local schools is greater than 25%. Last year, 190 out of 603 students dropped out. Is there enough evidence to reject the government official’s claim?
Steps to perform One-tailed and Two-tailed test:
Step 1: Read the question.
Step 2: Rephrase the claim in the question with an equation.
- In example question #1, Drop out rate = 25%
- In example question #2, Drop out rate < 25%
- In example question #3, Drop out rate > 25%.
Step 3: If step 2 has an equals sign in it, this is a two-tailed test. If it has > or < it is a one-tailed test.
So far we understand what is alternate and null hypothesis.What is Alpha value(Confidence interval) and can calculate the rejection region. So our next task is to find the sample Z score value.
Z Test:
A z-test is a statistical test used to determine whether two population means are different when the variances are known and the sample size is large.
Running a Z test on your data requires five steps:
- Define the null and alternate hypothesis.
- Choose an alpha value(significance value).
- Find the criteria value of z in a z table.
- Calculate the z test statistic (see below).
- Compare the test statistics to the critical z value and decide whether to support or to reject the null hypothesis
Ex- Null Hypothesis = μ = 100
Alternate Hypothesis - μ >100
Population Mean = 100
Sample mean = 112
Standard Deviation =15
Alpha(Significant interval) = 0.05, which means we are 95% confident about our hypothesis)
We need to find the Z score at 45% above mean, which we can find from the Z score table.
Z score at 45% = 1.645
Now let's find the Z score using the formula:
Z statistics = (112-100)/15*sqrt(30)
=4.56
So, now the Z statistics values value in that rejection region, so we can reject the null hypothesis. And can say that the population mean > 100.
T-Test:
The T distribution (also called Student’s T Distribution) is a family of distributions that look almost identical to the normal distribution curve, only a bit shorter and fatter. The t distribution is used instead of the normal distribution when you have small samples (for more on this, see: t-score vs. z-score). The larger the sample size, the more the t distribution looks like the normal distribution. In fact, for sample sizes larger than 20 (e.g. more degrees of freedom), the distribution is almost exactly like the normal distribution.
How to find T value?
Step 1: Subtract one from your sample size. This will be your degrees of freedom.
Step 2: Look up the df(Degree of freedom) in the left hand side of the t-distribution table. Locate the column under your alpha level (the alpha level is usually given to you in the question).
Like z-scores, t-scores are also a conversion of individual scores into a standard form. However, t-scores are used when you don’t know the population standard deviation
Chi Square Statistics:
Chi-square test in hypothesis testing is used to test the hypothesis about the distribution of observations/frequencies in different categories.
Let’s learn the use of chi-square with an intuitive example.
A research scholar is interested in the relationship between the placement of students in the statistics department of a reputed University and their C.G.P.A (their final assessment score).
He obtains the placement records of the past five years from the placement cell database (at random). He records how many students who got placed fell into each of the following C.G.P.A. categories – 9-10, 8-9, 7-8, 6-7, and below 6.
If there is no relationship between the placement rate and the C.G.P.A., then the placed students should be equally spread across the different C.G.P.A. categories (i.e. there should be similar numbers of placed students in each category).
However, if students having C.G.P.A more than 8 are more likely to get placed, then there would be a large number of placed students in the higher C.G.P.A. categories as compared to the lower C.G.P.A. categories. In this case, the data collected would make up the observed frequencies.
So the question is, are these frequencies being observed by chance or do they follow some pattern?
Here enters the chi-square test! The chi-square test helps us answer the above question by comparing the observed frequencies to the frequencies that we might expect to obtain purely by chance.
Assumptions:
- The χ2 assumes that the data for the study is obtained through random selection, i.e. they are randomly picked from the population
- The categories are mutually exclusive i.e. each subject fits in only one category. For e.g.- from our above example – the number of people who lunched in your restaurant on Monday can’t be filled in the Tuesday category
- The data should be in the form of frequencies or counts of a particular category and not in percentages
- The data should not consist of paired samples or groups or we can say the observations should be independent of each other
- When more than 20% of the expected frequencies have a value of less than 5 then Chi-square cannot be used. To tackle this problem: Either one should combine the categories only if it is relevant or obtain more data
Mathematically:
Conclusion
In this article I tried to explain different statistical measures used which I feel are very important for a Data Scientist. If I left any important statistics then please comment below.
I hope you like this article.
Through this article, I would also like to thank each and everyone who read, liked, clapped, commented on my articles. This is the sole motivation which encourages me to write articles.
Keep reading and I’ll keep writing.
Reference:
https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/chi-square/
https://towardsdatascience.com/statistical-tests-when-to-use-which-704557554740
Aqib Ali, Shreya Tiwary
Data Scientist with expertise in building AI models for automation
4 年Suravi Mahanta Very comprehensive and crisp information.
Expert CISCO/Brocade SAN Switches, DELL EMC Isilon/SAN, IBM SAN/Netapp | Storage Lead at Kyndryl | Ex-IBM | GCP & Azure & AWS Certified Solution Architect | 2X Microsoft Certified Cloud Practitioner
4 年nice blog.. keep writing?