Statistics vs. Visualization (#Data Science)

Understanding the statistical properties of the data is one of the key aspect of data science or Machine Learning.

While working with different data sets we largely rely on the statistical properties of the data . Whether you want to get importance of features( p-value), co-linearity , importance of model etc. everything is driven by statistical properties of the data that we are working on.

But does statistics always give you correct insight of the data? Is it good enough for us to make decisions?

Today we are going to talk about a different aspect of 'numbers'/'stats which can really misguide you if you don’t really pay much attention.

Look at the below 4 data sets. Each of them have different underlying values of Xs and ys. Between 4 to 19.

To get the basic statistical information about these data sets we'll get average and variance of the data which is important.


No alt text provided for this image

Now if you really look at these values, for each data sets the stats properties are either identical or similar.

  •  Each X's have average as 9.
  • Each y's have average as 7.5.
  • Each X's have variance as 11.
  • Each y's have variance as 4.12 - 4.13.

Statically they look identical.

But are these data same? While building model can same model be a good fit for all of these data set?

 Let's answering these questions by plotting these data in python (or excel),

# Import the required libraries

import pandas as pd

import seaborn as sns

  

# Import the Xlsx file

data = pd.read_excel('E:/Book1.xlsx',encoding='iso-8859-1')

 

#Draw the scatterplot

sns.regplot(x='X',y='y',data=data)


No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Statistically these are datasets look similar, but these are very different from each other. For first and third dataset may be a linear model can be a good fit but certainly not for the 2nd and 4th which is evident from the trend line.

So essentially statistics might not tell you the whole story . It's always better to understand the relationship by drawing graphs/charts and visualizing them.

 (From wiki) This is called Anscombe's quartet . Established by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties. 


要查看或添加评论,请登录

Raja Saurabh Tiwari的更多文章

  • The Hidden Cost of AI

    The Hidden Cost of AI

    Artificial Intelligence (AI) is revolutionizing industries, enhancing automation, and creating new possibilities for…

    3 条评论
  • Agentic AI - My take

    Agentic AI - My take

    Introduction In recent months, Agentic AI has emerged as a focal point in the technology sector, captivating both…

    16 条评论
  • Large Language Models vs Small Language Models

    Large Language Models vs Small Language Models

    Before directly jumping to LLM, a quick recap on AI and Machine Learning. We all have been seeing the below image which…

    2 条评论
  • So what makes a good data science profile

    So what makes a good data science profile

    Let's start with some stats Data science was named the fastest-growing job in 2017 by LinkedIn, and in 2018 Glassdoor…

    3 条评论
  • Don't let your fear win

    Don't let your fear win

    Once Krishna and Balarama got late playing in the forest. They decided to rest in there over the night and thought to…

    1 条评论
  • Data Lake & Data Mesh

    Data Lake & Data Mesh

    Global data creation is projected to exceed 180 zettabytes in the next five years. It was always a struggle to create a…

  • Analytics of Data Scientists in Kaggle

    Analytics of Data Scientists in Kaggle

    Kaggle has recently published a report on the Kaggle users on various aspects. The trend shows analysis of people…

  • Text Analysis - Word Cloud

    Text Analysis - Word Cloud

    Text Analysis : Text analysis one of the richest area in the Machine Learning space. Text analysis is the process of…

  • Machine Learning (Without CODE)

    Machine Learning (Without CODE)

    Machine learning is very fascinating for data science practitioners and everyone and there's a continuous effort…

    2 条评论
  • AutoML - first glance

    AutoML - first glance

    "Machine Learning and AI attempts to automate manual work..

社区洞察

其他会员也浏览了