登录查看更多内容

Exploring Quantitative Data

Basan Shrestha

Monitoring and Evaluation Specialist at Prime Minister Employment Programme (????????????? ?????? ?????????)

发布日期: 2020年5月9日

Exploring data is a kind of data health check to see whether there are outliers that cause abnormal distribution. Parametric inferential statistics are designed to fit the normal distribution. Several summary functions and graphical displays are used to explore data. I exemplify these measures from the daily database on COVID-19 incidence worldwide as of April 29, 2020, which I downloaded from the European Centre for Disease Prevention and Control website. The dataset has 109 daily records. This article needs to read along with my previous article entitled ‘Describing Quantitative Data’.

Five Number Summary

Minimum value, first quartile (Q1) or 25th percentile, median or 50th percentile, third quartile (Q3) or 75th percentile, and maximum value are five summary numbers used to explore data.

In a normal distribution the median is at the center of the distribution. Besides, the difference between the minimum value and median is equal to the difference between the median and maximum value. Likewise, the difference between the minimum value and Q1 is equal to the difference between Q3 and maximum value as shown by black and red arrows in figure 1.

Figure 1: Five Number Summary in Normal Distribution

In positively skewed distribution distance between minimum value and median is smaller than the distance between median and maximum value as shown by arrows in figure 2. Likewise, the distance between the minimum value and Q1 is smaller than the distance between Q3 and maximum value.

Figure 2: Five Number Summary in Positively Skewed Distribution

In negatively skewed distribution the distance between minimum value and median is greater than the distance between median and maximum value as shown by black and red arrows in figure 3. Likewise, the distance between the minimum value and Q1 is greater than the distance between Q3 and maximum value.

Figure 3: Five Number Summary in Negatively Skewed Distribution

In the given dataset, the minimum value is one and the maximum value is 101,728. Q1, median, and Q3 are respectively 1761.5, 3907, and 66403.5. The difference between minimum value and median is equal to 3906 which is far smaller than the difference between the median and maximum value equal to 97,821. This is consistent with figure 2 indicating that data is positively skewed.

Normality Checks

Box plot displays the five number summaries as shown in figure 4. The thick black line inside the box symbolizes the median. The lower and upper hinges or boundaries of the box symbolize Q1 and Q3, respectively. Whiskers below and above the hinges symbolize the minimum and maximum values. Box plot shows the outliers and extreme values. Outliers are the values below or above the value calculated as one and half times the difference between Q1 and Q3. The given dataset does not have statistically outlying values as there are no indications on the box plot. However, there are five extreme values above Q3 from 86,046 to the maximum number of persons infected in a day. Likewise, five extreme values below Q1 from 17 to the minimum number of infected cases. Those values have led to skewed distribution.

Figure 4: Box plot

The stem and leaf plot shows the shape of distribution showing positive skew statistic in which the values cluster below 10,000 cases per day having 62 data points. Values from 50,000 to below 90,000 cases are also clustered as shown in figure 5. This shows that the distribution is positively skewed.

Figure 5: Stem and leaf plot

A normal Q-Q plot displays the straight line that represents expected values for normal distribution. The observed incidence values deviate distinctly from that line indicating positively skewed distribution as shown in figure 6.

Figure 6: Normal Q-Q Plot

In a nutshell, exploring data summarizes distribution, identifies outliers, and checks normality. These procedures confirm that the distribution of the given dataset is positively skewed.

Exploring Quantitative Data

Basan Shrestha

Monitoring and Evaluation Specialist at Prime Minister Employment Programme (????????????? ?????? ?????????)

更多精彩文章

社区洞察

其他会员也浏览了

Decoding Data: Navigating Trends and Insights

COVID-19: Important lessons in Data Management and Data Science

Analysts, we need to talk about…

The measure of Central Tendency

Coronavirus crisis: can data visualization save lives?

How To Use Data In A Crisis

This is how to get better data from your audience

Transforming Raw Data into Actionable Insights

Data Center UPS Market Analysis by Current Industry Status & Growth Opportunities, Top Key Players 2030

Power and Beauty of Data

Employment Management Information System in Nepal

2021年8月6日

Table Preparation and Visualization

2021年7月21日

Exploring Data Across Groups

2020年5月23日

Division Sign and Reality

2020年5月11日

Cross Mark and Laws of Demand and Supply

2020年5月10日

Tick Mark and Resilience

2020年5月10日

Descriptive Measures of Quantitative Data

2020年5月5日

Frequency, Tabulation and Graphical Presentation

2020年5月2日

How Predictable Death by COVID-19 Incidence?

2020年4月29日

Graphics Visualizing Target Population

2020年4月9日