登录查看更多内容

Statistics vs. Visualization (#Data Science)

Raja Saurabh Tiwari

Vice President @ Citi | Java , Cloud, ML Solutions | Gen AI enthusiast | Wildlife Photography

发布日期: 2020年10月24日

Understanding the statistical properties of the data is one of the key aspect of data science or Machine Learning.

While working with different data sets we largely rely on the statistical properties of the data . Whether you want to get importance of features( p-value), co-linearity , importance of model etc. everything is driven by statistical properties of the data that we are working on.

But does statistics always give you correct insight of the data? Is it good enough for us to make decisions?

Today we are going to talk about a different aspect of 'numbers'/'stats which can really misguide you if you don’t really pay much attention.

Look at the below 4 data sets. Each of them have different underlying values of Xs and ys. Between 4 to 19.

To get the basic statistical information about these data sets we'll get average and variance of the data which is important.

Now if you really look at these values, for each data sets the stats properties are either identical or similar.

Each X's have average as 9.
Each y's have average as 7.5.
Each X's have variance as 11.
Each y's have variance as 4.12 - 4.13.

Statically they look identical.

But are these data same? While building model can same model be a good fit for all of these data set?

Let's answering these questions by plotting these data in python (or excel),

# Import the required libraries

import pandas as pd

import seaborn as sns

  

# Import the Xlsx file

data = pd.read_excel('E:/Book1.xlsx',encoding='iso-8859-1')

 

#Draw the scatterplot

sns.regplot(x='X',y='y',data=data)

Statistically these are datasets look similar, but these are very different from each other. For first and third dataset may be a linear model can be a good fit but certainly not for the 2nd and 4th which is evident from the trend line.

So essentially statistics might not tell you the whole story . It's always better to understand the relationship by drawing graphs/charts and visualizing them.

(From wiki) This is called Anscombe's quartet . Established by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties.

要查看或添加评论，请登录

Raja Saurabh Tiwari的更多文章

The Hidden Cost of AI

2025年3月1日

The Hidden Cost of AI

Artificial Intelligence (AI) is revolutionizing industries, enhancing automation, and creating new possibilities for…

3 条评论
Agentic AI - My take

2025年2月16日

Agentic AI - My take

Introduction In recent months, Agentic AI has emerged as a focal point in the technology sector, captivating both…

16 条评论
Large Language Models vs Small Language Models

2024年5月5日

Large Language Models vs Small Language Models

Before directly jumping to LLM, a quick recap on AI and Machine Learning. We all have been seeing the below image which…

2 条评论
So what makes a good data science profile

2022年4月19日

So what makes a good data science profile

Let's start with some stats Data science was named the fastest-growing job in 2017 by LinkedIn, and in 2018 Glassdoor…

3 条评论
Don't let your fear win

2022年4月17日

Don't let your fear win

Once Krishna and Balarama got late playing in the forest. They decided to rest in there over the night and thought to…

1 条评论
Data Lake & Data Mesh

2022年1月21日

Data Lake & Data Mesh

Global data creation is projected to exceed 180 zettabytes in the next five years. It was always a struggle to create a…
Analytics of Data Scientists in Kaggle

2021年2月14日

Analytics of Data Scientists in Kaggle

Kaggle has recently published a report on the Kaggle users on various aspects. The trend shows analysis of people…
Text Analysis - Word Cloud

2020年11月30日

Text Analysis - Word Cloud

Text Analysis : Text analysis one of the richest area in the Machine Learning space. Text analysis is the process of…
Machine Learning (Without CODE)

2020年10月30日

Machine Learning (Without CODE)

Machine learning is very fascinating for data science practitioners and everyone and there's a continuous effort…

2 条评论
AutoML - first glance

2020年10月10日

AutoML - first glance

"Machine Learning and AI attempts to automate manual work..

See all articles

Statistics vs. Visualization (#Data Science)

Raja Saurabh Tiwari

Vice President @ Citi | Java , Cloud, ML Solutions | Gen AI enthusiast | Wildlife Photography

Raja Saurabh Tiwari的更多文章

社区洞察

其他会员也浏览了

?? Ridge vs. Lasso: Tuning Models for Stock Markets ??

The Effects of Data Noise on the Efficiency of Vector Search Algorithms

A Comprehensive Guide to the Grammar of Graphics for Effective Visualization of Multi-dimensional Data

Essential Data scientist skills

Adventures in Data Science: From Wrangling Rogue Data to Predicting the Future (and Everything in Between)

Learning Data Science with Kaggle's Titantic: Machine Learning from Disaster

?? Unlock Time Series Insights Using Python’s KPSS Test ??

(Week 9) NumPy and Visualization Tools: A Journey into Efficient Data Manipulation and Stunning Visualizations!

The Role of Statistical Power in Experiment Design

The Power of R for Data Analysis

Raja Saurabh Tiwari的更多文章

The Hidden Cost of AI

Agentic AI - My take

Large Language Models vs Small Language Models

So what makes a good data science profile

Don't let your fear win

Data Lake & Data Mesh

Analytics of Data Scientists in Kaggle

Text Analysis - Word Cloud

Machine Learning (Without CODE)

AutoML - first glance

社区洞察

其他会员也浏览了

?? Ridge vs. Lasso: Tuning Models for Stock Markets ??

The Effects of Data Noise on the Efficiency of Vector Search Algorithms

A Comprehensive Guide to the Grammar of Graphics for Effective Visualization of Multi-dimensional Data

Essential Data scientist skills

Adventures in Data Science: From Wrangling Rogue Data to Predicting the Future (and Everything in Between)

Learning Data Science with Kaggle's Titantic: Machine Learning from Disaster

?? Unlock Time Series Insights Using Python’s KPSS Test ??

(Week 9) NumPy and Visualization Tools: A Journey into Efficient Data Manipulation and Stunning Visualizations!

The Role of Statistical Power in Experiment Design

The Power of R for Data Analysis