Data & Business Analytics Series: Basis Statistics (1/n)

Data & Business Analytics Series: Basis Statistics (1/n)

In this series of posts, I'll try to cover tools and techniques required for analysis, visualization, and presentation of data & business insights from basic to advance level.

"Data analysis is a combination of statistical mathematics and the art of organizing, analyzing, and interpreting the data to generate meaningful insights for taking right action at right time."

Basic Statistical terminologies and its usage:

1. Mean, Median & Mode

  • Mean: The mean, often referred to as the average, is calculated by adding up all the numbers in a data set, and then dividing by the count of numbers in that set.
  • Weighted Mean: The weighted mean, on the other hand, takes into account the importance (or weight) of each value in the data set. In a weighted mean, each data point contributes to the final average proportionally to its assigned weight.
  • Median: The median is the middle number in a sorted, ascending or descending, list of numbers. If the data set has an odd number of observations, the number in the middle is the median. If there’s an even number of observations, the median is the average of the two middle numbers.
  • Mode: The mode is the number that appears most frequently in a data set. A data set may have one mode, more than one mode, or no mode at all.

Each of these measures gives us different insights and they are used based on the nature and distribution of the data.

Tip #1: Mean is a good option when the data distribution is symmetrical, while the median is often a better choice when the distribution is skewed because it isn’t affected by extreme values. The mode, on the other hand, can be useful for categorical data.

Tip #2: In a simple mean, all numbers are treated equally, while in a weighted mean, some numbers contribute more to the final average than others. The weighted mean is particularly useful when we want to calculate an average that is not skewed by a few outliers or when the data elements in the set are not equally important. We generally give more weights to recent data sets or relevant data sets without ignoring older / earlier data trends.


2. Variance & Standard Deviation:

Variance and Standard Deviation are both statistical measurements that describe the spread of data points in a data set.

  • Variance is the average of the squared differences from the mean (μ) or average of a random data set.It gives you a rough idea of the spread of your data and whether it’s clustered around the mean or spread out over a wider range.
  • Standard Deviation is the square root of the variance.The standard deviation is a measure of how spread out numbers are from the mean, and it’s expressed in the same units as the original data, which can be more intuitive to understand.A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.


3. Central Limit Theorem of a Normal Distribution & 68–95–99.7 rule:

  • Central Limit Theorem states that the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable, whose distribution converges to a normal distribution.

The popular alternate name of Central Limit Theorem is 'Bell Curve' due to its shape (which has a infamous use during annual appraisal cycle of employees by People Managers)

  • 68–95–99.7 rule, also known as the empirical rule, generally used to remember the percentage of values that lie within an interval probability in a normal distribution i.e. 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviation of the mean, respectively. Thus, the other name of this rule is "3 sigma rule of thumb".

68–95–99.7 rule in Normal Distribution (Photo Credit: Wikipedia)



Stay tuned for coming parts of this series. Do let me know your thoughts through comment / DM and share if found useful.



要查看或添加评论,请登录

社区洞察

其他会员也浏览了