登录查看更多内容

Look what I got in my data - Part 1

Rajib Deb

A technology leader specializing in data, AI and analytics architecture

发布日期: 2020年9月9日

I put down my pen for a while. It is not because I got lazy to jot something down. But it is because I was trying my hand on something that was enticing me for a long time. And that thing is nothing but the science of data. I am still a kid in this area, but just like the kid, I am also learning fast. As usual, when I learn something I love to share it as that helps me to cement my learning. So, today I will talk about Exploratory Data Analysis(a.k.a EDA). I will start with the basics of EDA and then move on to complex areas in subsequent articles.

EDA refers to a set of tools that helps analyze data in order to describe and explore some aspects of data for future model building. It is not a means to extract intelligence from data, but is a set of techniques that supports the model building techniques. EDA is not a new term or approach, it has been there for a long time. The term was introduced by John W. Tukey in the 1960s. From my experience with EDA so far, it looked to me like the work of a detective. Like a detective who tries to find clues out of everything that is related to crime data, EDA approach also tries to identify important clues from data that will help discover the rules behind why a certain event occurred. And that rule when applied to similar data will tell what event is going to happen. Is not that cool?

I want to run this as a series of articles and in each article I will explain a specific concept. In this article, I will talk about two basic aspects that are core to EDA.

Measure of central tendency
Measure of dispersion

Both of them are descriptive data measures. Central tendency means that observations tend to group around a middle value. And such a value is called a measure of Central Tendency or Measures of location or Statistical Average. Mean, Median and Mode fall under this category.

Mean(or Arithmetic mean) is the most simplistic way to calculate the middle value. And we are all familiar with this from our school days. Arithmetic mean is however susceptible to extreme values and data fluctuations.

Median is the middle value. It is such a value that 50% of the observations are above this value and rest 50% are below this value. It has a better resistance to extreme data fluctuations.

Mode is the typical value. It is the value that occurs most of the time.

While central tendency describes the central point in a dataset, dispersion describes the spread of the data. Variance, Standard Deviation, Range, Kurtosis and Skewness are measures of dispersion.

Variance is the average of the squared differences from the mean. Standard deviation measures how far a certain observation is from the average or the mean. It is calculated by taking the square root of variance.

Range is also a measure of dispersion which is calculated by taking the difference between the maximum and minimum observation. Range is highly susceptible to outliers

Quartile divides the observations into four equal parts. The lowest 25%, the second lowest 25%, the highest 25% and the highest 25%. The specific quartiles gives us an idea of both central tendency and dispersion. The 2nd quartile is the median which divides the observations into two halves

The difference between the 3rd and the 1st quartile or in other words , the range of the middle 50% of the data is called IQR(Inter quartile range). This is a measure of dispersion that ignores the bottom and the top quarter of the data thus eliminating outliers.

Ok, now we know what are the measures of dispersion, but how do we know how large or how small is the dispersion. That is where Coefficient of variation comes into picture. It gives an idea of the variation irrespective of the scale. It is measured by the standard deviation relative to the mean(or standard deviation as the percentage of mean)

Z score is another measure of variation irrespective of the scale. It is measured by taking a difference of the observation from the mean of the observations and then dividing it by the standard deviation. It calibrates the measure with its own distribution.

These are all very basic but are the core set of measures used in EDA. As we go along, we will see that other techniques of describing the data is revolving around these measures only. So, it is very important to understand these metrics before delving deep into EDA.

Pankaj Jainani

Cloud Solutions Architect | Artificial Intelligence, Microsoft Azure & Google Cloud

4 年

Thanks Rajib, pl also add explanation/overview for Skew.

Kalidindi Chiranjeevi

Public Safety Officer at police department

4 年

You are an amazing learner

Manas Ranjan Kar

AI & Advanced Analytics Consulting | Solving High Impact Industry Problems with Machine Learning and AI.

4 年

pandas-profiling package (Python) provides a very nice way of comprehensive EDA with couple of lines of code. Check it out if you are trying hands-on.

查看更多评论

要查看或添加评论，请登录

Rajib Deb的更多文章

Diving deep into Magentic-One...

2024年12月29日

Diving deep into Magentic-One...

I did a deep dive into the Magentic-One(SingleThreadedAgentRuntime) code today. I wanted to understand how it has been…
The Rise of Agentic Architecture: Reflecting on 2024 and Envisioning the Future

2024年12月26日

The Rise of Agentic Architecture: Reflecting on 2024 and Envisioning the Future

2024: The Dawn of the Agentic Architecture Era This year, 2024, will likely be remembered as one of the most…

5 条评论
Magentic-One | An instantiation of thinker/actor pattern...

2024年12月26日

Magentic-One | An instantiation of thinker/actor pattern...

Multi-Agent systems are evolving to not only process information, but also act on it with human supervision…

1 条评论
Thinker/Actor pattern | Leveraging the reasoners in a multi-agent system...

2024年11月30日

Thinker/Actor pattern | Leveraging the reasoners in a multi-agent system...

..
Convergence of symbolic and connectionist AI...

2024年11月29日

Convergence of symbolic and connectionist AI...

..
Amazon Bedrock Flows...

2024年11月24日

Amazon Bedrock Flows...

When new technology arrives, the early years often demand significant custom effort to make it work for specific use…
Taxonomy, Ontology and Knowledge Graph...

2024年11月17日

Taxonomy, Ontology and Knowledge Graph...

..
The evolution from web of documents to web of knowledge...

2024年11月10日

The evolution from web of documents to web of knowledge...

..
Context is the king...

2024年11月10日

Context is the king...

..

1 条评论
Language is not enough...

2024年11月9日

Language is not enough...

As human, we don't just speak a language, we use language to convey our knowledge on a certain subject based on the…

See all articles

Look what I got in my data - Part 1

Rajib Deb

A technology leader specializing in data, AI and analytics architecture

Rajib Deb的更多文章

社区洞察

其他会员也浏览了

Pipeline Construction

What is Data Science — A guide to the beginners

My Model Is Better Than Yours

From E-R to Data Vault: Transforming Models with Shape Functions

Data Wrangling in R

Stats Made Really Simple: A Step-by-Step Guide to Understanding Data (for Everyone and Data Science!)

Sherlock Holmes: Data Scientist

Data Science in Action: Feature Engineering (Crafting the Perfect Ingredients [Raw Data = Flour. Feature Engineering = The Cake! ??])

Understanding PCA: Principal Component Analysis Simplified

Rajib Deb的更多文章

Diving deep into Magentic-One...

The Rise of Agentic Architecture: Reflecting on 2024 and Envisioning the Future

Magentic-One | An instantiation of thinker/actor pattern...

Thinker/Actor pattern | Leveraging the reasoners in a multi-agent system...

Convergence of symbolic and connectionist AI...

Amazon Bedrock Flows...

Taxonomy, Ontology and Knowledge Graph...

The evolution from web of documents to web of knowledge...

Context is the king...

Language is not enough...

社区洞察

其他会员也浏览了

Pipeline Construction

What is Data Science — A guide to the beginners

My Model Is Better Than Yours

From E-R to Data Vault: Transforming Models with Shape Functions

Data Wrangling in R

Stats Made Really Simple: A Step-by-Step Guide to Understanding Data (for Everyone and Data Science!)

Sherlock Holmes: Data Scientist

Data Science in Action: Feature Engineering (Crafting the Perfect Ingredients [Raw Data = Flour. Feature Engineering = The Cake! ??])

Understanding PCA: Principal Component Analysis Simplified