Data Analytics is all about Statistics
Abhi Sharma
Cloud Engineer @ Google | Ex-Founder | Building on Entrepreneurial Foundations
Data Analytics is a term which is being used widely these days. Almost everyone is either doing data analytics or using results obtained by analytics in their day-to-day life. For example - you check weather forecast for the next day, use google maps to check time it takes to travel from point A to point B, check estimated run rate during a cricket match or even forecast the next month's sales for your business. In this article, we will touch base some very fundamental concepts and then try to understand a few advanced concepts ranging from probability distributions to confidence intervals and hypothesis testing (in the next blog).
Population and Sample
If the entire human race is a population, people living in a state of US can be called out as a sample. Population is a superset of all the data points available whereas a sample is a subset of the population.
Random Sample - A random sample is a subset of the population which is selected at random without any bias. There are two things to consider for a sample to be random.
Measures of Central Tendency
Measures of Central Tendency are the measures of values which summarise the centre of the distribution of all the data points. They are namely - Mean, Median and Mode.
Most of us may have already studied these concepts in our schools. Mean is an average of all the values in the dataset, Median is the middle value if all the values are arranged in an order and Mode is the most frequent value.
These were most important concepts in our school to pass an primary mathematics exam but individually they are not so useful in real life. Mean alone does not solidify anything in statistics so i really discourage using averages without giving more context about them. For example - if we say that average temperature of a city during the year is 32° C. Can we assume that this city is usually hot during the year or it does not have winters or snow? Of course, Not!
Mean is just a summarisation of the entire set of values but it does not tell us about the distribution of data until we introduce standard deviation and other metrics. We will see how mean can be made a useful measure once we cover more advanced topics.
Variance and Standard Deviation
In simpler terms, Variance is a measure which tells about the spread of data points across the mean. Formula for variance as described in various textbooks is as follows
Here, we can see that variance is calculated by squaring the differences between each data point and mean then adding them up. As the variance is a squared metric which may not be very useful, we have another measure which is called standard deviation. Standard deviation is just a square root of variance.
In real life we rarely deal with population data, we always assume that we are working with a sample data. Therefore, we need variance and standard deviation of the sample.
Note - It is advisable to always consider your dataset as a sample, even for complex machine learning models we consider that we are dealing with sample data only.
In the above figure notice the denominator for sample variance and std. dev. is N-1. This is due to the bias induced when the calculation of variance is done for a finite sample. It was observed that when variance is calculated for a sample it is underestimation of variance of the population. To cope up with this bias, Bessel's correction of N-1 is introduced which gives a value more closer to actual population variance.
Distribution of Data
When we talk about distribution it usually means probability distribution of data. Probability distribution is basically derived from PMF(probability mass function) or PDF(probability density function). PMF is for discrete random variables whereas PDF is for continuous random variables. Now, we are going to go through some very important distributions that we generally find in data analytics problems.
Binomial Distribution
Binomial Distribution is one of the most important distributions because of its applications in real life. A random variable can have a binomial distribution if it follow the following properties.
If number of trails in a binomial distribution is large, it can be approximated to a normal distribution.
Poisson Distribution
Poisson distribution helps us to find the probability of number of occurrences of an event over a period of time or any rate of change in that manner. For example, if a machine produces a defect on an average of 5 defects in an hour, we can calculate what is the probability of machine producing 17 defects in 3 hours.
λ = average rate over time
x = value of random variable
e = Euler's number = 2.72
Poisson distribution is approximated to binomial distribution when n (number of trials) is large and (p) probability of success is low.
Uniform Distribution
A uniform distribution is a special type of distribution where all the values of random variable X have same probabilities. For example - Rolling a dice, tossing a coin etc.
Normal Distribution
Normal Distribution or Gaussian Distribution is the most common distribution that happens in real life scenarios. Height, weight, IQ etc. follows normal distribution. It is a continuous probability distribution and there are ways to calculate probabilities for a range of values following a normal distribution.
A normal distribution typically looks like a bell shaped curve.
领英推荐
If μ is mean and σ is standard deviation, then the μ will lie at the centre of the bell shape curve and it follow 1-2-3 rule a.k.a 68-95-99.7 rule. This rule states that 68% of the values will lie between [μ-σ , μ+σ], 95% values will fall between [μ-2σ , μ+2σ] and 99.7% values will lie between [μ-3σ , μ+3σ].
Sampling distribution of a very large sample also follows a normal distribution. This is called CLT or Central Limit Theorem, which is the most important concept in statistics and it made a lot of data analytics easy. Please read more about CLT on the internet, it is also one of the favourite topic of interviewers.
Standard Normal Distribution
A standard normal distribution is a normal distribution with μ as 0 and σ as 1. Any normal distribution can be converted to standard normal distribution by calculating the Z values as
Z = (X-μ)/σ
The probability density plot of Z values will come up to be a standard normal distribution with mean as 0 and standard deviation as 1
Chi-Squared Distribution
Chi squared Distribution with n degree of freedom is a sum of n independent random variables which follows standard normal distribution. Chi-squared is very important for analytics as it is used in hypothesis testing and various other machine learning algorithms.
Consider a standard normal distribution X1 then the square of X1 is called Chi Squared distribution with 1 degree of freedom. Similarly, if X1 and X2 are two different standard normal distributions then square(X1)+square(X2) is another chi-squared distribution with 2 degree of freedom.
As degree of freedom increases in a chi squared distribution, it approaches a normal distribution approximately.
Student's T Distribution
Student's T distribution is used to estimate means of of normal distribution when the sample is too small and population standard deviation is not known. It is widely used in calculating confidence intervals and hypothesis testing. Generally, it is best suited when the number of values in sample is less than 30. As the number of samples increases, it approaches to normal distribution.
The graph of a Student's T distribution looks like a normal distribution but with wider tails.
We will see practical applications of student's T distribution while doing confidence intervals and hypothesis testing
Conclusion
In this article, we tried to understand why it is important to know statistics before we actually do any data analysis or machine learning. Most of the machine learning algorithms use various statistical techniques which can be understood only if we know basics of statistics. The ask is not to be an statistician but a foundational knowledge is definitely required and needed.
In the coming blogs, I will cover confidence intervals and hypothesis testing. See you soon! Have a nice week ahead!
Author?: Abhi Sharma -?Linkedin
Operations Specialist. | MySQL. | Product Support.
1 年Hi Abhishek, Why the bessel's correction is introduced? *To cope up with this bias,?Bessel's correction?of N-1 is introduced which gives a value more closer to actual population variance.* And what kind of bias should a sample taken out should be kept from refraining... ??
Data Analyst @Indigo | Artificial Intelligence | Machine Learning
1 年Insightful ??
Data Engineer | SNOWFLAKE ??| Matillion | SQL | AWS | Data Warehouse | Python| Power BI
1 年Great share Abhishek ??