ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Data Analytics is all about Statistics

Abhi Sharma

Cloud Engineer @ Google | Ex-Founder | Building on Entrepreneurial Foundations

å‘å¸ƒæ—¥æœŸ: 2023å¹´6æœˆ6æ—¥

Data Analytics is a term which is being used widely these days. Almost everyone is either doing data analytics or using results obtained by analytics in their day-to-day life. For example - you check weather forecast for the next day, use google maps to check time it takes to travel from point A to point B, check estimated run rate during a cricket match or even forecast the next month's sales for your business. In this article, we will touch base some very fundamental concepts and then try to understand a few advanced concepts ranging from probability distributions to confidence intervals and hypothesis testing (in the next blog).

Population and Sample

If the entire human race is a population, people living in a state of US can be called out as a sample. Population is a superset of all the data points available whereas a sample is a subset of the population.

Random Sample - A random sample is a subset of the population which is selected at random without any bias. There are two things to consider for a sample to be random.

Randomness - Sample should be selected randomly without any bias
Representativeness - Sample should represent the entire known population

Measures of Central Tendency

Measures of Central Tendency are the measures of values which summarise the centre of the distribution of all the data points. They are namely - Mean, Median and Mode.

Most of us may have already studied these concepts in our schools. Mean is an average of all the values in the dataset, Median is the middle value if all the values are arranged in an order and Mode is the most frequent value.

These were most important concepts in our school to pass an primary mathematics exam but individually they are not so useful in real life. Mean alone does not solidify anything in statistics so i really discourage using averages without giving more context about them. For example - if we say that average temperature of a city during the year is 32Â° C. Can we assume that this city is usually hot during the year or it does not have winters or snow? Of course, Not!

Mean is just a summarisation of the entire set of values but it does not tell us about the distribution of data until we introduce standard deviation and other metrics. We will see how mean can be made a useful measure once we cover more advanced topics.

Variance and Standard Deviation

In simpler terms, Variance is a measure which tells about the spread of data points across the mean. Formula for variance as described in various textbooks is as follows

No alt text provided for this image — Population Variance Formula

Here, we can see that variance is calculated by squaring the differences between each data point and mean then adding them up. As the variance is a squared metric which may not be very useful, we have another measure which is called standard deviation. Standard deviation is just a square root of variance.

In real life we rarely deal with population data, we always assume that we are working with a sample data. Therefore, we need variance and standard deviation of the sample.

Note - It is advisable to always consider your dataset as a sample, even for complex machine learning models we consider that we are dealing with sample data only.

In the above figure notice the denominator for sample variance and std. dev. is N-1. This is due to the bias induced when the calculation of variance is done for a finite sample. It was observed that when variance is calculated for a sample it is underestimation of variance of the population. To cope up with this bias, Bessel's correction of N-1 is introduced which gives a value more closer to actual population variance.

Distribution of Data

When we talk about distribution it usually means probability distribution of data. Probability distribution is basically derived from PMF(probability mass function) or PDF(probability density function). PMF is for discrete random variables whereas PDF is for continuous random variables. Now, we are going to go through some very important distributions that we generally find in data analytics problems.

Binomial Distribution

Binomial Distribution is one of the most important distributions because of its applications in real life. A random variable can have a binomial distribution if it follow the following properties.

It can have only two outcomes - success or failure
probability of success is p and failure is 1-p or vice versa
probability of success remains same in all the trails.
PMF = f(x) = nCr * p^r * (1-p)^(n-r)

If number of trails in a binomial distribution is large, it can be approximated to a normal distribution.

Poisson Distribution

Poisson distribution helps us to find the probability of number of occurrences of an event over a period of time or any rate of change in that manner. For example, if a machine produces a defect on an average of 5 defects in an hour, we can calculate what is the probability of machine producing 17 defects in 3 hours.

Î» = average rate over time

x = value of random variable

e = Euler's number = 2.72

Poisson distribution is approximated to binomial distribution when n (number of trials) is large and (p) probability of success is low.

Uniform Distribution

A uniform distribution is a special type of distribution where all the values of random variable X have same probabilities. For example - Rolling a dice, tossing a coin etc.

Normal Distribution

Normal Distribution or Gaussian Distribution is the most common distribution that happens in real life scenarios. Height, weight, IQ etc. follows normal distribution. It is a continuous probability distribution and there are ways to calculate probabilities for a range of values following a normal distribution.

A normal distribution typically looks like a bell shaped curve.

é¢†è‹±æŽ¨è

BASICS OF PROBABILITY AND STATISTICS :

Priyanka Sethi 4 å¹´å‰

Is Data Analytics Your True Calling?

Awesome Analytics 1 å¹´å‰

Important statistics for Data science

Suravi Mahanta 5 å¹´å‰

If Î¼ is mean and Ïƒ is standard deviation, then the Î¼ will lie at the centre of the bell shape curve and it follow 1-2-3 rule a.k.a 68-95-99.7 rule. This rule states that 68% of the values will lie between [Î¼-Ïƒ , Î¼+Ïƒ], 95% values will fall between [Î¼-2Ïƒ , Î¼+2Ïƒ] and 99.7% values will lie between [Î¼-3Ïƒ , Î¼+3Ïƒ].

Sampling distribution of a very large sample also follows a normal distribution. This is called CLT or Central Limit Theorem, which is the most important concept in statistics and it made a lot of data analytics easy. Please read more about CLT on the internet, it is also one of the favourite topic of interviewers.

Standard Normal Distribution

A standard normal distribution is a normal distribution with Î¼ as 0 and Ïƒ as 1. Any normal distribution can be converted to standard normal distribution by calculating the Z values as

Z = (X-Î¼)/Ïƒ

The probability density plot of Z values will come up to be a standard normal distribution with mean as 0 and standard deviation as 1

Chi-Squared Distribution

Chi squared Distribution with n degree of freedom is a sum of n independent random variables which follows standard normal distribution. Chi-squared is very important for analytics as it is used in hypothesis testing and various other machine learning algorithms.

Consider a standard normal distribution X1 then the square of X1 is called Chi Squared distribution with 1 degree of freedom. Similarly, if X1 and X2 are two different standard normal distributions then square(X1)+square(X2) is another chi-squared distribution with 2 degree of freedom.

As degree of freedom increases in a chi squared distribution, it approaches a normal distribution approximately.

Student's T Distribution

Student's T distribution is used to estimate means of of normal distribution when the sample is too small and population standard deviation is not known. It is widely used in calculating confidence intervals and hypothesis testing. Generally, it is best suited when the number of values in sample is less than 30. As the number of samples increases, it approaches to normal distribution.

The graph of a Student's T distribution looks like a normal distribution but with wider tails.

We will see practical applications of student's T distribution while doing confidence intervals and hypothesis testing

Conclusion

In this article, we tried to understand why it is important to know statistics before we actually do any data analysis or machine learning. Most of the machine learning algorithms use various statistical techniques which can be understood only if we know basics of statistics. The ask is not to be an statistician but a foundational knowledge is definitely required and needed.

In the coming blogs, I will cover confidence intervals and hypothesis testing. See you soon! Have a nice week ahead!

Author?: Abhi Sharma -?Linkedin

Machines who think - ML & AI

927 ä½å…³æ³¨è€…

è®¢é˜…

Lalit Kumar

Operations Specialist. | MySQL. | Product Support.

1 å¹´

Hi Abhishek, Why the bessel's correction is introduced? *To cope up with this bias,?Bessel's correction?of N-1 is introduced which gives a value more closer to actual population variance.* And what kind of bias should a sample taken out should be kept from refraining... ??

èµž

å›žå¤

Deepak Sharma

Data Analyst @ IndiGo | Statistics, Python, SQL, Microsoft Power BI

1 å¹´

Insightful ??

èµž

å›žå¤

1 æ¬¡å›žåº”

Ishita Srivastava

1 å¹´

Great share Abhishek ??

èµž

å›žå¤

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Abhi Sharmaçš„æ›´å¤šæ–‡ç«

Idea Evaluation is the foundation - Mind your own business

2024å¹´9æœˆ13æ—¥

Idea Evaluation is the foundation - Mind your own business

Welcome to the first blog of "Mind your own business" series. If you're a seasoned entrepreneur with a track record ofâ€¦
Is your data normal? Check for normality

2023å¹´11æœˆ17æ—¥

Is your data normal? Check for normality

Normal distribution is one of the extremely important concept in data science. It is a bread and butter for dataâ€¦

1 æ¡è¯„è®º
Do data scientists earn more than data engineers? - Prove by Hypothesis Testing

2023å¹´6æœˆ22æ—¥

Do data scientists earn more than data engineers? - Prove by Hypothesis Testing

Welcome back to the newsletter. This is another dose of inferential statistics where we are going to see how hypothesisâ€¦
Be confident with confidence intervals

2023å¹´6æœˆ16æ—¥

Be confident with confidence intervals

In continuation to my last blog on statistics - Data analytics is all about Statistics where we saw various probabilityâ€¦
Market Basket Analysis - Association Rule Mining, Apriori Algorithm

2023å¹´5æœˆ24æ—¥

Market Basket Analysis - Association Rule Mining, Apriori Algorithm

Market Basket Analysis is one of the most common and basic problem in data science world. It is typically used forâ€¦

See all articles

Data Analytics is all about Statistics

Abhi Sharma

Cloud Engineer @ Google | Ex-Founder | Building on Entrepreneurial Foundations

Population and Sample

Measures of Central Tendency

Variance and Standard Deviation

Distribution of Data

é¢†è‹±æŽ¨è

Conclusion

Machines who think - ML & AI

927 ä½å…³æ³¨è€…

Abhi Sharmaçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Important statistics for Data science

?? Day 2 of 50: A Comprehensive Guide to Descriptive Statistics ??

Statistics and probability

Analysis of Complex Sample Survey Data Course

Introduction to Statistics for Data Analysts II

Statistics and Probability :-

Highlights of Essential Stats in the World of Data

Foundations & Basic Tools: Week 1 - Understand the Role & Basic Statistics

Business Statistics - Introduction

Covariance and Correlation in Statistics

Population and Sample

Measures of Central Tendency

Variance and Standard Deviation

Distribution of Data

é¢†è‹±æŽ¨è

Conclusion

Machines who think - ML & AI

927 ä½å…³æ³¨è€…

Abhi Sharmaçš„æ›´å¤šæ–‡ç«

Idea Evaluation is the foundation - Mind your own business

Is your data normal? Check for normality

Do data scientists earn more than data engineers? - Prove by Hypothesis Testing

Be confident with confidence intervals

Market Basket Analysis - Association Rule Mining, Apriori Algorithm

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Important statistics for Data science

?? Day 2 of 50: A Comprehensive Guide to Descriptive Statistics ??

Statistics and probability

Analysis of Complex Sample Survey Data Course

Introduction to Statistics for Data Analysts II

Statistics and Probability :-

Highlights of Essential Stats in the World of Data

Foundations & Basic Tools: Week 1 - Understand the Role & Basic Statistics

Business Statistics - Introduction

Covariance and Correlation in Statistics

é¢†è‹±æŽ¨è

927 ä½å…³æ³¨è€…

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†