Look what I got in my data - Part 1
I put down my pen for a while. It is not because I got lazy to jot something down. But it is because I was trying my hand on something that was enticing me for a long time. And that thing is nothing but the science of data. I am still a kid in this area, but just like the kid, I am also learning fast. As usual, when I learn something I love to share it as that helps me to cement my learning. So, today I will talk about Exploratory Data Analysis(a.k.a EDA). I will start with the basics of EDA and then move on to complex areas in subsequent articles.
EDA refers to a set of tools that helps analyze data in order to describe and explore some aspects of data for future model building. It is not a means to extract intelligence from data, but is a set of techniques that supports the model building techniques. EDA is not a new term or approach, it has been there for a long time. The term was introduced by John W. Tukey in the 1960s. From my experience with EDA so far, it looked to me like the work of a detective. Like a detective who tries to find clues out of everything that is related to crime data, EDA approach also tries to identify important clues from data that will help discover the rules behind why a certain event occurred. And that rule when applied to similar data will tell what event is going to happen. Is not that cool?
I want to run this as a series of articles and in each article I will explain a specific concept. In this article, I will talk about two basic aspects that are core to EDA.
- Measure of central tendency
- Measure of dispersion
Both of them are descriptive data measures. Central tendency means that observations tend to group around a middle value. And such a value is called a measure of Central Tendency or Measures of location or Statistical Average. Mean, Median and Mode fall under this category.
Mean(or Arithmetic mean) is the most simplistic way to calculate the middle value. And we are all familiar with this from our school days. Arithmetic mean is however susceptible to extreme values and data fluctuations.
Median is the middle value. It is such a value that 50% of the observations are above this value and rest 50% are below this value. It has a better resistance to extreme data fluctuations.
Mode is the typical value. It is the value that occurs most of the time.
While central tendency describes the central point in a dataset, dispersion describes the spread of the data. Variance, Standard Deviation, Range, Kurtosis and Skewness are measures of dispersion.
Variance is the average of the squared differences from the mean. Standard deviation measures how far a certain observation is from the average or the mean. It is calculated by taking the square root of variance.
Range is also a measure of dispersion which is calculated by taking the difference between the maximum and minimum observation. Range is highly susceptible to outliers
Quartile divides the observations into four equal parts. The lowest 25%, the second lowest 25%, the highest 25% and the highest 25%. The specific quartiles gives us an idea of both central tendency and dispersion. The 2nd quartile is the median which divides the observations into two halves
The difference between the 3rd and the 1st quartile or in other words , the range of the middle 50% of the data is called IQR(Inter quartile range). This is a measure of dispersion that ignores the bottom and the top quarter of the data thus eliminating outliers.
Ok, now we know what are the measures of dispersion, but how do we know how large or how small is the dispersion. That is where Coefficient of variation comes into picture. It gives an idea of the variation irrespective of the scale. It is measured by the standard deviation relative to the mean(or standard deviation as the percentage of mean)
Z score is another measure of variation irrespective of the scale. It is measured by taking a difference of the observation from the mean of the observations and then dividing it by the standard deviation. It calibrates the measure with its own distribution.
These are all very basic but are the core set of measures used in EDA. As we go along, we will see that other techniques of describing the data is revolving around these measures only. So, it is very important to understand these metrics before delving deep into EDA.
Cloud Solutions Architect | Artificial Intelligence, Microsoft Azure & Google Cloud
4 年Thanks Rajib, pl also add explanation/overview for Skew.
Public Safety Officer at police department
4 年You are an amazing learner
AI & Advanced Analytics Consulting | Solving High Impact Industry Problems with Machine Learning and AI.
4 年pandas-profiling package (Python) provides a very nice way of comprehensive EDA with couple of lines of code. Check it out if you are trying hands-on.