Statistics vs. Visualization (#Data Science)
Raja Saurabh Tiwari
Vice President @ Citi | Java , Cloud, ML Solutions | Gen AI enthusiast | Wildlife Photography
Understanding the statistical properties of the data is one of the key aspect of data science or Machine Learning.
While working with different data sets we largely rely on the statistical properties of the data . Whether you want to get importance of features( p-value), co-linearity , importance of model etc. everything is driven by statistical properties of the data that we are working on.
But does statistics always give you correct insight of the data? Is it good enough for us to make decisions?
Today we are going to talk about a different aspect of 'numbers'/'stats which can really misguide you if you don’t really pay much attention.
Look at the below 4 data sets. Each of them have different underlying values of Xs and ys. Between 4 to 19.
To get the basic statistical information about these data sets we'll get average and variance of the data which is important.
Now if you really look at these values, for each data sets the stats properties are either identical or similar.
- Each X's have average as 9.
- Each y's have average as 7.5.
- Each X's have variance as 11.
- Each y's have variance as 4.12 - 4.13.
Statically they look identical.
But are these data same? While building model can same model be a good fit for all of these data set?
Let's answering these questions by plotting these data in python (or excel),
# Import the required libraries import pandas as pd import seaborn as sns # Import the Xlsx file data = pd.read_excel('E:/Book1.xlsx',encoding='iso-8859-1') #Draw the scatterplot
sns.regplot(x='X',y='y',data=data)
Statistically these are datasets look similar, but these are very different from each other. For first and third dataset may be a linear model can be a good fit but certainly not for the 2nd and 4th which is evident from the trend line.
So essentially statistics might not tell you the whole story . It's always better to understand the relationship by drawing graphs/charts and visualizing them.
(From wiki) This is called Anscombe's quartet . Established by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties.