Data Analysis (EDA) elucidated using Golden Circle – Why? , How? , & What?

Data is buzz word in today’s IT industry. Anything and Everything in today’s world is Data. The rate with which data is being generated and collected today is 2.5 Exabyte (2.5e+9). But the question arise what to do with this much amount of data, first answer comes in mind is Generate Analytics that helps in growing every increasing demand of Business. However phrase “Generate Analytics” is not as easy as it sounds during conversations. In order to “Generate Analytics” that addresses business demands and needs, data needs to examine with all the variables present in data and this process is called Data Exploration or Exploratory Data Analysis (EDA)

Like Simon Sneak famous Ted talk on Golden circle, will start with “WHY = The Purpose

This is first & foremost important thing in Data Analysis. The reason is data generated and plan to be utilized for the purpose of analytics are collected from various sources and its quality of data is always question and because of this quality checks are required to identify

  • mistakes in data
  • patterns data has been collected
  • any statistical data violations assumptions
  • and to generate hypotheses

earlier data anomaly is identified, better for data processing and analytics generation

before moving to 2nd golden circle “How?”, need to understand how the data defined and its associated dimensions is.

There are mainly two types of data – Categorical & Quantitative. Categorical data variables are again sub-divided further in three categories – binary, nominal, & ordinal. Nature of categorical values of these variables are:

  • Binary has maximum two different categories e.g. Yes/No
  • Nominal has more categories without having order like ordinal or binary
  • Ordinal can have more than two categories however while defining order of values take precedence 

Quantitative data variables are sub-divided in two categories – discrete & continuous

  • Numerical e.g. 0,1...
  • Uninterrupted e.g. 1,2,3,4,…

Primarily there are three dimensions for any given data sets – univariate, bivariate, & multivariate. These dimensions are defined or decided based on the number of variables measurement per subject.

HOW = The Process” to summaries the data are:

Central Location measures, this is summary measures that includes identification of MEAN, MODE, & MEDIAN with the data population. These measures helps in identification or describe nature of whole data sets available with you. Each of these measures describes central value in the data distribution.  

Variability measures, means data set spread. The term variability, spread, & dispersion are synonyms, and refer to how spread out distribution is. This includes Range, Interquartile range, Variance, & Standard Deviation.

Relative standing measures, this is next step to variability measures. This measure helps where values in data set stands relative to the entire population distribution. With the idea of relative standing, quality of data can be predicated and how data will sense with analytics. To find relative standing, statistical methodology used are z-Score, Quartile & Percentiles, and Box Plots

Moving to “What = The Result” part of the Golden Circle, as the saying goes “A good picture is equivalent to more than 1,000 words” the inferences that could be drawn from the above data analysis leads to the result in more pictorial representation of data. As per statistical guidelines, exploratory data analysis performed on data is considered good

  • Nearly 65-70% of the observations lies within 1st standard deviation
  • More than 90% of the observations lies within 2nd standard deviation
  • Quartile & Interquartile
  • Percentiles same as Quartile

Univariate dimension data, bar plots and histograms are the best suite however the usage depends upon the kind of variables (categorical or continuous) is being used to plot the graph

Bivariate dimension date, different types of graph plots are used based on the combination of variables like

  • Categorical & Categorical – Crosstabs or Stacked Box Plot
  • Categorical & Continuous – Box Plot
  • Continuous & Continuous – Scatter or Stacked Box Plot

Multivariate dimension data, important tasks is clustering of data i.e. organizing units into clusters and use Data Reduction approach using Principal component analysis.

To conclude exploratory data analysis is very important and time consuming steps however with the technological languages available in market like SAS, R, Python etc. and understood steps correctly makes it very interesting and less time consuming and very interesting step in entire Data Science process.

要查看或添加评论,请登录

Amit T的更多文章

社区洞察

其他会员也浏览了