登录查看更多内容

Important statistics for Data science

Suravi Mahanta

Senior Consultant at EY GDS | Ex-Accenture | Microsoft Modern Data Platform Expert | Big Data Specialist | AI/ML Engineer | 4X Microsoft Certified | 3X Databricks Certified | Data Architecture

发布日期: 2020年2月13日

In this blog I'll try to cover most of the statistical measures which are used in almost all data science projects. So lets start with the very basic statistical measures like Measure of central tendency, Measure of dispersion, Measure of shape of distribution and measure of dependence. After that I'll explain some important distribution curves and then we'll move to different test statistics like Z statistics, t statistics, F statistics, chi statistics etc.

Before we move forward with different statistical tests it is imperative to understand the difference between a sample and a population.

In statistics “population” refers to the total set of observations that can be made. For example, if we want to calculate average height of humans present on the earth, “population” will be the “total number of people actually present on the earth”.

A sample, on the other hand, is a set of data collected/selected from a pre-defined procedure. For our example above, it will be a small group of people selected randomly from some parts of the earth.

For instance, if we select people randomly from all regions(Asia, America, Europe, Africa etc.) to calculate the average height, our estimate will be close to the actual estimate and can be assumed as a sample mean, whereas if we make selection let’s say only from the United States, then our average height estimate will not be accurate but would only represent the data of a particular region (United States). Such a sample is then called a biased sample and is not a representative of “population”.

Another important aspect in statistics is “distribution”. When “population” is infinitely large it is improbable to validate any hypothesis by calculating the mean value or test parameters on the entire population. In such cases, a population is assumed to be of some type of a distribution.The most common forms of distributions are Binomial, Poisson, Discrete etc.

So let's start with our first statistical measure :

1. Measure of central tendency ( mean, median and mode.)

a.Mean:

The mean is the arithmetic average and it is probably the measure of central tendency that you are most familiar. Calculating the mean is very simple. You just add up all of the values and divide by the number of observations.

b.Median:

The median is the middle value. It is the value that splits the dataset in half. To find the median, order your data from smallest to largest, and then find the data point that has an equal amount of values above it and below it. The method for locating the median varies slightly depending on whether your dataset has an even or odd number of values.

c.Mode:

The mode is the value that occurs the most frequently in your data set. On a bar chart, the mode is the highest bar. If the data have multiple values that are tied for occurring the most frequently, you have a multi modal distribution. If no value repeats, the data do not have a mode.

2. Measure of statistical dispersion:

Standard Deviation:

In statistics the standard deviation (SD, also represented by the lower case Greek letter sigma σ for the population standard deviation or the Latin letter s for the sample standard deviation) is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

In addition to expressing the variability of a population, the standard deviation is commonly used to measure confidence in statistical conclusions. This derivation of a standard deviation is often called the "standard error" of the estimate or "standard error of the mean" when referring to a mean.

3. Measure of the shape of distribution:

a. Skewness:

Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right. Skewness can be quantified to define the extent to which a distribution differs from a normal distribution.

b. Kurtosis:

The degree of tailedness of a distribution is measured by kurtosis. It tells us the extent to which the distribution is more or less outlier-prone (heavier or light-tailed) than the normal distribution. Kurtosis can be calculated using this formula:

4. Measure of statistical dependence:

When more than one statistical variable is used then we check this measure called correlation coefficient or Pearson correlation coefficient.

According to Wikipedia, It is the measure of the linear correlation between two variables X and Y. According to the Cauchy–Schwarz inequality it has a value between +1 and ?1, where 1 is total positive linear correlation, 0 is no linear correlation, and ?1 is total negative linear correlation. It is widely used in the sciences.

There are several types of correlation coefficient formulas.

One of the most commonly used formulas in stats is Pearson’s correlation coefficient formula. If you’re taking a basic stats class, this is the one you’ll probably use:

So far we discuss all the measures of statistics and now by using these concepts we will try to understand Hypothesis Testing, Chi-Square, Z-Test, T-test, One-tailed and two tailed Test etc.

Test statistics:

A test statistic is a random variable that is calculated from sample data and used in a hypothesis test. You can use test statistics to determine whether to reject the null hypothesis. The test statistic compares your data with what is expected under the null hypothesis. The test statistic is used to calculate the p-value.

For example, the test statistic for a Z-test is the Z-statistic, which has the standard normal distribution under the null hypothesis. Suppose you perform a two-tailed Z-test with an α of 0.05, and obtain a Z-statistic (also called a Z-value) based on your data of 2.5. This Z-value corresponds to a p-value of 0.0124. Because this p-value is less than α, you declare statistical significance and reject the null hypothesis.

P Value:

A test statistic is used in a hypothesis testing when you are deciding to support or reject the null hypothesis. A p-value is a probability associated with your critical value. The critical value depends on the probability you are allowing for a Type I error. It measures the chance of getting results at least as strong as yours if the claim (H0) were true.

To find the p-value for your test statistic:

Step 1:Look up your test statistic on the appropriate distribution — in this case, on the standard normal (Z-) distribution (see the following Z-table).

Step 2:Find the probability that Z is beyond (more extreme than) your test statistic:

2.1:If Ha contains a less-than alternative, find the probability that Z is less than your test statistic (that is, look up your test statistic on the Z-table and find its corresponding probability). This is the p-value. (Note: In this case, your test statistic is usually negative.)

2.2:If Ha contains a greater-than alternative, find the probability that Z is greater than your test statistic (look up your test statistic on the Z-table, find its corresponding probability, and subtract it from one). The result is your p-value. (Note: In this case, your test statistic is usually positive.)

2.3If Ha contains a not-equal-to alternative, find the probability that Z is beyond your test statistic and double it. There are two cases:

--If your test statistic is negative, first find the probability that Z is less than your test statistic (look up your test statistic on the Z-table and find its corresponding probability). Then double this probability to get the p-value.

--If your test statistic is positive, first find the probability that Z is greater than your test statistic (look up your test statistic on the Z-table, find its corresponding probability, and subtract it from one). Then double this result to get the p-value.

Let's try to understand the test statistics and P value with understanding hypothesis testing.

Hypothesis Testing:

Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter.

According to Wikipedia, A statistical hypothesis, sometimes called confirmatory data analysis, is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables. A statistical hypothesis test is a method of statistical inference.

Steps to do hypothesis testing:

There is an initial research hypothesis of which the truth is unknown.
The first step is to state the relevant null and alternative hypotheses. This is important, as mis-stating the hypotheses will muddy the rest of the process.
The second step is to consider the statistical assumptions being made about the sample in doing the test; for example, assumptions about the statistical independence or about the form of the distributions of the observations. This is equally important as invalid assumptions will mean that the results of the test are invalid.
Decide which test is appropriate, and state the relevant test statistics(T).
Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example, the test statistic might follow a Student's t distribution or a normal distribution.
Select a significance level (α), a probability threshold below which the null hypothesis will be rejected. Common values are 5% and 1%.
The distribution of the test statistic under the null hypothesis partitions the possible values of T into those for which the null hypothesis is rejected—the so-called critical region—and those for which it is not. The probability of the critical region is α.
Compute from the observations the observed value t(obs) of the test statistic T.
Decide to either reject the null hypothesis in favor of the alternative or not reject it. The decision rule is to reject the null hypothesis H0 if the observed value tobs is in the critical region, and to accept or "fail to reject" the hypothesis otherwise.

Alternate Hypothesis Vs Null Hypothesis:

Example 1: It’s an accepted fact that ethanol boils at 173.1°F; you have a theory that ethanol actually has a different boiling point, of over 174°F. The accepted fact (“ethanol boils at 173.1°F”) is the null hypothesis; your theory (“ethanol boils at temperatures of 174°F”) is the alternate hypothesis.

Alpha Value/Significance Level:

The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.

In hypothesis tests, two errors are possible,Type 1 and Type 2 error

Type I error: Supporting the alternate hypothesis when the null hypothesis is true.

Type II error: Not supporting the alternate hypothesis when the alternate hypothesis is true.

Example: Let’s say that the null hypothesis is that a man is innocent and the alternate hypothesis is that he is guilty. if you convict an innocent man (Type I error), you support the alternate hypothesis (that he is guilty). A type II error would be letting a guilty man go free.

An alpha level is the probability of a type I error, or you reject the null hypothesis, when it is true. A related term, beta, is the opposite; the probability of rejecting the alternate hypothesis when it is true.

How to calculate the Alpha value?

-There is no direct way to calculate the alpha value, however it mostly depends on the business problem and the confidence interval for that business problem.

-To get alpha subtract your confidence interval from 1.

For example, if you want to be 95 percent confident that your analysis is correct, the alpha level (Significant level) would be 1 – .95 = 5 percent, assuming you had a one tailed test. For two-tailed tests, divide the alpha level by 2. In this example, the two tailed alpha would be .05/2 = 2.5 percent.

-Again one-tail and two tail test selection is depend on the business problem.

One tailed Vs Two tailed Hypothesis testing:

Example question 1: A government official claims that the dropout rate for local schools is 25%. Last year, 190 out of 603 students dropped out. Is there enough evidence to reject the government official’s claim?

Example question 2: A government official claims that the dropout rate for local schools is less than 25%. Last year, 190 out of 603 students dropped out. Is there enough evidence to reject the government official’s claim?

Example question 3: A government official claims that the dropout rate for local schools is greater than 25%. Last year, 190 out of 603 students dropped out. Is there enough evidence to reject the government official’s claim?

Steps to perform One-tailed and Two-tailed test:

Step 1: Read the question.

Step 2: Rephrase the claim in the question with an equation.

In example question #1, Drop out rate = 25%
In example question #2, Drop out rate < 25%
In example question #3, Drop out rate > 25%.

Step 3: If step 2 has an equals sign in it, this is a two-tailed test. If it has > or < it is a one-tailed test.

So far we understand what is alternate and null hypothesis.What is Alpha value(Confidence interval) and can calculate the rejection region. So our next task is to find the sample Z score value.

Z Test:

A z-test is a statistical test used to determine whether two population means are different when the variances are known and the sample size is large.

Running a Z test on your data requires five steps:

Define the null and alternate hypothesis.
Choose an alpha value(significance value).
Find the criteria value of z in a z table.
Calculate the z test statistic (see below).
Compare the test statistics to the critical z value and decide whether to support or to reject the null hypothesis

Ex- Null Hypothesis = μ = 100

Alternate Hypothesis - μ >100

Population Mean = 100

Sample mean = 112

Standard Deviation =15

Alpha(Significant interval) = 0.05, which means we are 95% confident about our hypothesis)

We need to find the Z score at 45% above mean, which we can find from the Z score table.

Z score at 45% = 1.645

Now let's find the Z score using the formula:

Z statistics = (112-100)/15*sqrt(30)

=4.56

So, now the Z statistics values value in that rejection region, so we can reject the null hypothesis. And can say that the population mean > 100.

T-Test:

The T distribution (also called Student’s T Distribution) is a family of distributions that look almost identical to the normal distribution curve, only a bit shorter and fatter. The t distribution is used instead of the normal distribution when you have small samples (for more on this, see: t-score vs. z-score). The larger the sample size, the more the t distribution looks like the normal distribution. In fact, for sample sizes larger than 20 (e.g. more degrees of freedom), the distribution is almost exactly like the normal distribution.

How to find T value?

Step 1: Subtract one from your sample size. This will be your degrees of freedom.

Step 2: Look up the df(Degree of freedom) in the left hand side of the t-distribution table. Locate the column under your alpha level (the alpha level is usually given to you in the question).

Like z-scores, t-scores are also a conversion of individual scores into a standard form. However, t-scores are used when you don’t know the population standard deviation

Chi Square Statistics:

Chi-square test in hypothesis testing is used to test the hypothesis about the distribution of observations/frequencies in different categories.

Let’s learn the use of chi-square with an intuitive example.

A research scholar is interested in the relationship between the placement of students in the statistics department of a reputed University and their C.G.P.A (their final assessment score).

He obtains the placement records of the past five years from the placement cell database (at random). He records how many students who got placed fell into each of the following C.G.P.A. categories – 9-10, 8-9, 7-8, 6-7, and below 6.

If there is no relationship between the placement rate and the C.G.P.A., then the placed students should be equally spread across the different C.G.P.A. categories (i.e. there should be similar numbers of placed students in each category).

However, if students having C.G.P.A more than 8 are more likely to get placed, then there would be a large number of placed students in the higher C.G.P.A. categories as compared to the lower C.G.P.A. categories. In this case, the data collected would make up the observed frequencies.

So the question is, are these frequencies being observed by chance or do they follow some pattern?

Here enters the chi-square test! The chi-square test helps us answer the above question by comparing the observed frequencies to the frequencies that we might expect to obtain purely by chance.

Assumptions:

The χ2 assumes that the data for the study is obtained through random selection, i.e. they are randomly picked from the population
The categories are mutually exclusive i.e. each subject fits in only one category. For e.g.- from our above example – the number of people who lunched in your restaurant on Monday can’t be filled in the Tuesday category
The data should be in the form of frequencies or counts of a particular category and not in percentages
The data should not consist of paired samples or groups or we can say the observations should be independent of each other
When more than 20% of the expected frequencies have a value of less than 5 then Chi-square cannot be used. To tackle this problem: Either one should combine the categories only if it is relevant or obtain more data

Mathematically:

Conclusion

In this article I tried to explain different statistical measures used which I feel are very important for a Data Scientist. If I left any important statistics then please comment below.

I hope you like this article.

Through this article, I would also like to thank each and everyone who read, liked, clapped, commented on my articles. This is the sole motivation which encourages me to write articles.

Keep reading and I’ll keep writing.

Reference:

https://stats.idre.ucla.edu/spss/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-spss/

https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/chi-square/

https://towardsdatascience.com/statistical-tests-when-to-use-which-704557554740

Rakesh Mondal

5 年

Aqib Ali, Shreya Tiwary

1 次回应

Saksham S

Salesforce Certified AI Specialist with expertise in building AI models for automation

5 年

Suravi Mahanta Very comprehensive and crisp information.

1 次回应

Dibya Ranjan Rath

Expert CISCO/Brocade SAN Switches, DELL EMC Isilon/SAN, IBM SAN/Netapp | Storage Lead at Kyndryl | Ex-IBM | GCP & Azure & AWS Certified Solution Architect | 2X Microsoft Certified Cloud Practitioner

5 年

nice blog.. keep writing?

1 次回应

查看更多评论

要查看或添加评论，请登录

Suravi Mahanta的更多文章

Why we need Analytics in Suppl chain management

2020年6月20日

Why we need Analytics in Suppl chain management

Supply chain is the backbone of any product based company. To have a succesful busniess, you need a sucessfull supply…
Prophet Forecasting

2020年6月7日

Prophet Forecasting

Forecasting is one of the most commonly used machine learning algorithms in any business. It’s become a necessity to…
Hierarchical clustering: The simplest clustering algorithm

2019年8月21日

Hierarchical clustering: The simplest clustering algorithm

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into…

4 条评论
Text Mining

2019年7月7日

Text Mining

Text mining is the process of exploring and analyzing large amounts of unstructured text data aided by software that…

2 条评论
Let's evaluate classification model with ROC and PR curves.

2019年5月31日

Let's evaluate classification model with ROC and PR curves.

Model evaluation is one of the most important part while building model. And to evaluate the model we use different…

4 条评论
Let's find the value of K for K means in 2 minutes and by using two methods.

2019年5月10日

Let's find the value of K for K means in 2 minutes and by using two methods.

Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, such as K…

1 条评论
What is the best algorithm for classification problem?

2019年4月23日

What is the best algorithm for classification problem?

Classification is one of the data mining tasks, applied in many area especially in retail, banking sector…
Journey from DataBricks to Azure DataBricks

2019年4月4日

Journey from DataBricks to Azure DataBricks

DataBricks is an organization and big data processing platform designed by the creators of Apache Spark. It was founded…
What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

2019年3月24日

What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

Intelligence has been defined in many ways, including: the capacity for logic, understanding, self-awareness, learning,…

See all articles

Important statistics for Data science

Suravi Mahanta

Senior Consultant at EY GDS | Ex-Accenture | Microsoft Modern Data Platform Expert | Big Data Specialist | AI/ML Engineer | 4X Microsoft Certified | 3X Databricks Certified | Data Architecture

1. Measure of central tendency ( mean, median and mode.)

2. Measure of statistical dispersion:

3. Measure of the shape of distribution:

4. Measure of statistical dependence:

Hypothesis Testing:

Z Test:

T-Test:

Chi Square Statistics:

Conclusion

Suravi Mahanta的更多文章

社区洞察

其他会员也浏览了

Critical analysis of Big Data challenges and analytical methods

Understanding the Z-Test and T-Test: Key Tools for Statistical Inference in Data Science

20 Key Questions in Data Science Interviews

Statistics for Data Science: The Foundation of Data-Driven Decisions

Dealing with Erratic Data in Time Series Forecasting: Strategies and Algorithms

Mastering the Top 10 Statistical Concepts: The Key to Success in Data Science

Wannabe Data Scientist? Let's start from this #statistics topics.

Cluster Analysis: Grouping Data for Better Insights

Descriptive Statistics in Data Science

How to explain Data Science to a 5-year-old?

1. Measure of central tendency ( mean, median and mode.)

2. Measure of statistical dispersion:

3. Measure of the shape of distribution:

4. Measure of statistical dependence:

Hypothesis Testing:

Z Test:

T-Test:

Chi Square Statistics:

Conclusion

Suravi Mahanta的更多文章

Why we need Analytics in Suppl chain management

Prophet Forecasting

Hierarchical clustering: The simplest clustering algorithm

Text Mining

Let's evaluate classification model with ROC and PR curves.

Let's find the value of K for K means in 2 minutes and by using two methods.

What is the best algorithm for classification problem?

Journey from DataBricks to Azure DataBricks

What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

社区洞察

其他会员也浏览了

Critical analysis of Big Data challenges and analytical methods

Understanding the Z-Test and T-Test: Key Tools for Statistical Inference in Data Science

20 Key Questions in Data Science Interviews

Statistics for Data Science: The Foundation of Data-Driven Decisions

Dealing with Erratic Data in Time Series Forecasting: Strategies and Algorithms

Mastering the Top 10 Statistical Concepts: The Key to Success in Data Science

Wannabe Data Scientist? Let's start from this #statistics topics.

Cluster Analysis: Grouping Data for Better Insights

Descriptive Statistics in Data Science

How to explain Data Science to a 5-year-old?