Exploring Quantitative Data

Exploring data is a kind of data health check to see whether there are outliers that cause abnormal distribution. Parametric inferential statistics are designed to fit the normal distribution. Several summary functions and graphical displays are used to explore data. I exemplify these measures from the daily database on COVID-19 incidence worldwide as of April 29, 2020, which I downloaded from the European Centre for Disease Prevention and Control website. The dataset has 109 daily records. This article needs to read along with my previous article entitled ‘Describing Quantitative Data’.

Five Number Summary

Minimum value, first quartile (Q1) or 25th percentile, median or 50th percentile, third quartile (Q3) or 75th percentile, and maximum value are five summary numbers used to explore data.

In a normal distribution the median is at the center of the distribution. Besides, the difference between the minimum value and median is equal to the difference between the median and maximum value. Likewise, the difference between the minimum value and Q1 is equal to the difference between Q3 and maximum value as shown by black and red arrows in figure 1.

Figure 1: Five Number Summary in Normal Distribution

No alt text provided for this image

In positively skewed distribution distance between minimum value and median is smaller than the distance between median and maximum value as shown by arrows in figure 2. Likewise, the distance between the minimum value and Q1 is smaller than the distance between Q3 and maximum value.

Figure 2: Five Number Summary in Positively Skewed Distribution

No alt text provided for this image

In negatively skewed distribution the distance between minimum value and median is greater than the distance between median and maximum value as shown by black and red arrows in figure 3. Likewise, the distance between the minimum value and Q1 is greater than the distance between Q3 and maximum value.

Figure 3: Five Number Summary in Negatively Skewed Distribution

No alt text provided for this image

In the given dataset, the minimum value is one and the maximum value is 101,728. Q1, median, and Q3 are respectively 1761.5, 3907, and 66403.5. The difference between minimum value and median is equal to 3906 which is far smaller than the difference between the median and maximum value equal to 97,821. This is consistent with figure 2 indicating that data is positively skewed.

Normality Checks

Box plot displays the five number summaries as shown in figure 4. The thick black line inside the box symbolizes the median. The lower and upper hinges or boundaries of the box symbolize Q1 and Q3, respectively. Whiskers below and above the hinges symbolize the minimum and maximum values. Box plot shows the outliers and extreme values. Outliers are the values below or above the value calculated as one and half times the difference between Q1 and Q3. The given dataset does not have statistically outlying values as there are no indications on the box plot. However, there are five extreme values above Q3 from 86,046 to the maximum number of persons infected in a day. Likewise, five extreme values below Q1 from 17 to the minimum number of infected cases. Those values have led to skewed distribution.

Figure 4: Box plot

No alt text provided for this image

The stem and leaf plot shows the shape of distribution showing positive skew statistic in which the values cluster below 10,000 cases per day having 62 data points. Values from 50,000 to below 90,000 cases are also clustered as shown in figure 5. This shows that the distribution is positively skewed.

Figure 5: Stem and leaf plot

No alt text provided for this image

A normal Q-Q plot displays the straight line that represents expected values for normal distribution. The observed incidence values deviate distinctly from that line indicating positively skewed distribution as shown in figure 6.

Figure 6: Normal Q-Q Plot

No alt text provided for this image

In a nutshell, exploring data summarizes distribution, identifies outliers, and checks normality. These procedures confirm that the distribution of the given dataset is positively skewed.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了