Anscombe's quartet - Can statistic properties describe realistic datasets?
CIO.com

Anscombe's quartet - Can statistic properties describe realistic datasets?

Statistics is a crucial process behind how we make apprehensions based on data, and to make decisions.

Best examples could be a 

1. p-value in Hypothesis Testing or 

2. Mean and Standard Deviation in data distributions.

The primary objective of this article is to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.

Let's take the following Anscombe's quartet which comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points.

Four Datasets

Anscombe described his article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough."

The above dataset is used to demonstrate both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties.

Let's take a look at statistical properties for all four datasets and understand how stats can be deceiving sometimes and how we can counter the above stated impression by apprehending the importance of graphical illustration while analyzing datasets.

No alt text provided for this image

All four sets are identical when examined using simple summary statistics, but vary considerably when graphed.

No alt text provided for this image
  • The first scatter plot (top left) appears to be a simple linear relationship, corresponding to two variables correlated where y could be modelled as Gaussian with mean linearly dependent on x.
  • The second graph (top right) is not distributed normally; while a relationship between the two variables is obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A more general regression and the corresponding coefficient of determination would be more appropriate.
  • In the third graph (bottom left), the distribution is linear, but should have a different regression line (a robust regression would have been called for). The calculated regression is offset by the one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
  • Finally, the fourth graph (bottom right) shows an example when one high-leverage point is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables.The first scatter plot (top left) appears to be a simple linear relationship, corresponding to two variables correlated where y could be modelled as Gaussian with mean linearly dependent on x.

Source : Anscombe's quartet



要查看或添加评论,请登录

社区洞察

其他会员也浏览了