登录查看更多内容

Data Analysis (EDA) elucidated using Golden Circle – Why? , How? , & What?

Amit T

Product Thinketh | Product Leader | Product Manager

发布日期: 2017年2月15日

Data is buzz word in today’s IT industry. Anything and Everything in today’s world is Data. The rate with which data is being generated and collected today is 2.5 Exabyte (2.5e+9). But the question arise what to do with this much amount of data, first answer comes in mind is Generate Analytics that helps in growing every increasing demand of Business. However phrase “Generate Analytics” is not as easy as it sounds during conversations. In order to “Generate Analytics” that addresses business demands and needs, data needs to examine with all the variables present in data and this process is called Data Exploration or Exploratory Data Analysis (EDA)

Like Simon Sneak famous Ted talk on Golden circle, will start with “WHY = The Purpose”

This is first & foremost important thing in Data Analysis. The reason is data generated and plan to be utilized for the purpose of analytics are collected from various sources and its quality of data is always question and because of this quality checks are required to identify

mistakes in data
patterns data has been collected
any statistical data violations assumptions
and to generate hypotheses

earlier data anomaly is identified, better for data processing and analytics generation

before moving to 2nd golden circle “How?”, need to understand how the data defined and its associated dimensions is.

There are mainly two types of data – Categorical & Quantitative. Categorical data variables are again sub-divided further in three categories – binary, nominal, & ordinal. Nature of categorical values of these variables are:

Binary has maximum two different categories e.g. Yes/No
Nominal has more categories without having order like ordinal or binary
Ordinal can have more than two categories however while defining order of values take precedence

Quantitative data variables are sub-divided in two categories – discrete & continuous

Numerical e.g. 0,1...
Uninterrupted e.g. 1,2,3,4,…

Primarily there are three dimensions for any given data sets – univariate, bivariate, & multivariate. These dimensions are defined or decided based on the number of variables measurement per subject.

“HOW = The Process” to summaries the data are:

Central Location measures, this is summary measures that includes identification of MEAN, MODE, & MEDIAN with the data population. These measures helps in identification or describe nature of whole data sets available with you. Each of these measures describes central value in the data distribution.

Variability measures, means data set spread. The term variability, spread, & dispersion are synonyms, and refer to how spread out distribution is. This includes Range, Interquartile range, Variance, & Standard Deviation.

Relative standing measures, this is next step to variability measures. This measure helps where values in data set stands relative to the entire population distribution. With the idea of relative standing, quality of data can be predicated and how data will sense with analytics. To find relative standing, statistical methodology used are z-Score, Quartile & Percentiles, and Box Plots

Moving to “What = The Result” part of the Golden Circle, as the saying goes “A good picture is equivalent to more than 1,000 words” the inferences that could be drawn from the above data analysis leads to the result in more pictorial representation of data. As per statistical guidelines, exploratory data analysis performed on data is considered good

Nearly 65-70% of the observations lies within 1st standard deviation
More than 90% of the observations lies within 2nd standard deviation
Quartile & Interquartile
Percentiles same as Quartile

Univariate dimension data, bar plots and histograms are the best suite however the usage depends upon the kind of variables (categorical or continuous) is being used to plot the graph

Bivariate dimension date, different types of graph plots are used based on the combination of variables like

Categorical & Categorical – Crosstabs or Stacked Box Plot
Categorical & Continuous – Box Plot
Continuous & Continuous – Scatter or Stacked Box Plot

Multivariate dimension data, important tasks is clustering of data i.e. organizing units into clusters and use Data Reduction approach using Principal component analysis.

To conclude exploratory data analysis is very important and time consuming steps however with the technological languages available in market like SAS, R, Python etc. and understood steps correctly makes it very interesting and less time consuming and very interesting step in entire Data Science process.

要查看或添加评论，请登录

Amit T的更多文章

AGI Unleashed: Shaping the Future with Artificial General Intelligence

2023年11月19日

AGI Unleashed: Shaping the Future with Artificial General Intelligence

In the rapidly evolving world of technology, we stand on the brink of a revolutionary era marked by the emergence of…
Decoding Machine Learning: A Strategic Approach to Model Selection

2023年11月18日

Decoding Machine Learning: A Strategic Approach to Model Selection

Machine learning is a tapestry woven with diverse techniques, each uniquely designed for specific data and tasks. Pedro…

2 条评论
Benefits Realization Management (BRM) framework to manage, monitor and realize customer benefits

2019年6月28日

Benefits Realization Management (BRM) framework to manage, monitor and realize customer benefits

Project Management Institute (PMI) has come up with another fantastic framework to identify, analyze, deliver and…
What are the Value Propositions & Customer Value Proposition building blocks

2019年6月25日

What are the Value Propositions & Customer Value Proposition building blocks

Value Propositions are the tangible and intangible aspects of your offerings. In laymen language, Value Propositions…

1 条评论

Data Analysis (EDA) elucidated using Golden Circle – Why? , How? , & What?

Amit T

Product Thinketh | Product Leader | Product Manager

Amit T的更多文章

社区洞察

其他会员也浏览了

DATA CLEANING

Demystifying Data Analytics: Your Guide to Formulas and Functions

8 Steps to Data Analysis: A Detailed Guide

Data profiling

Unleashing the Power of Data Analysis: Enhancing Input Efficiency for a Compelling Story

Dive into Data: A Step-by-Step Approach to Becoming a Data Analyst

Data Strategy Roadmap: Prioritizing Analytics Projects

From Raw Data to Business Insights: The Thrilling World of Data Analysis ??

Focusing on the Data - perspective by Darko Medin

Achieve Thought Leadership with Our Ultimate Data Consultancy Guide

Amit T的更多文章

AGI Unleashed: Shaping the Future with Artificial General Intelligence

Decoding Machine Learning: A Strategic Approach to Model Selection

Benefits Realization Management (BRM) framework to manage, monitor and realize customer benefits

What are the Value Propositions & Customer Value Proposition building blocks

社区洞察

其他会员也浏览了

DATA CLEANING

Demystifying Data Analytics: Your Guide to Formulas and Functions

8 Steps to Data Analysis: A Detailed Guide

Data profiling

Unleashing the Power of Data Analysis: Enhancing Input Efficiency for a Compelling Story

Dive into Data: A Step-by-Step Approach to Becoming a Data Analyst

Data Strategy Roadmap: Prioritizing Analytics Projects

From Raw Data to Business Insights: The Thrilling World of Data Analysis ??

Focusing on the Data - perspective by Darko Medin

Achieve Thought Leadership with Our Ultimate Data Consultancy Guide