How to describe a statistical population using R - Part 2: Distribution
Generated by Jesper Martinsson using r

How to describe a statistical population using R - Part 2: Distribution

Besides Location and variability you can also use the distribution as a way to describe your data.

Frequencies and proportions, along with graphical representations like bar plots and histograms, provide a visual overview of your entire dataset. This is helpful in understanding the distribution of the observed values.

Throughout this article we use an example where heights have been randomly measured of 100 oak trees in a area:

Ingen alternativ text angiven f?r den h?r bilden


Frequencies

The array of oak tree heights above really does not give any feeling of how the sample is distributed.

One way of getting a more visual understanding of the distribution of the values is to make a frequency table.

You can create a frequency table in two ways; tallies and counts.

Tallies

A tally represents an observation of a specific height:

Ingen alternativ text angiven f?r den h?r bilden

The use of tallies to create a frequency table provides an overview of the data, offering insights into its distribution. However, this approach is time-consuming and thus not particularly practical. Interestingly, it resemblances a bar plot, which we will create shortly. Before that, we will proceed to illustrate the frequency table using numerical values instead of tallies.

Counts

The number of each height is presented:


Ingen alternativ text angiven f?r den h?r bilden

If you go on and sum these values you’ll find that it adds up to 100. That means all the observations are accounted for. Now we can go on making a plot describing the sample:

Ingen alternativ text angiven f?r den h?r bilden

Do it in r


install.packages("ggplot2")
library(ggplot2)


# Frequencies
? y = c(15,	16,	17,	16,	19,	19,	17,	17,	17,	10,
? ? ? ? 17,	17,	12,	17,	12,	12,	17,	17,	17,	13,
? ? ? ? 17,	17,	13,	17,	14,	14,	17,	17,	17,	15,
? ? ? ? 17,	17,	15,	17,	15,	16,	17,	17,	17,	16,
? ? ? ? 17,	17,	16,	17,	16,	16,	17,	17,	17,	18,
? ? ? ? 17,	17,	18,	17,	18,	18,	17,	17,	17,	19,
? ? ? ? 17,	17,	10,	18,	10,	10,	18,	18,	18,	11,
? ? ? ? 18,	18,	11,	18,	12,	12,	18,	18,	18,	12,
? ? ? ? 18,	18,	13,	18,	14,	14,	18,	18,	18,	15,
? ? ? ? 18,	18,	16,	18,	17,	18,	18,	18,	18,	10
? )
? my_counts = table(y)
? count_df = as.data.frame(my_counts)
??
? # Plot the distribution using bar chart
? font_family = 'sans'
??
? p<-ggplot(data=count_df, aes(x=y, y=Freq)) +
? ? geom_bar(stat="identity",width = 0.5,fill = 'coral') +
? ? scale_y_continuous(name = 'Frequency', limits=c(0,max(count_df$Freq)+0.5),expand = c(0,0)) +
? ? theme(axis.title.x=element_blank(),
? ? ? ? ? axis.text.x=element_text(size=12,family=font_family),
? ? ? ? ? axis.text.y = element_text(size=12),
? ? ? ? ? axis.title.y=element_text(margin=margin(0,20,0,0),size=16, family=font_family, face='bold'),
? ? ? ? ? plot.title = element_text(size = 16, face = "bold", family=font_family),
? ? ? ? ? plot.margin = unit(c(1, 5, 1, 1), "lines")
? ? )
  p        


Proportions

The data can also be represented using proportions. When you divide the frequency for each height by the total number of units in your sample or population, you obtain the proportion corresponding to that height. In other words, this proportion reflects the portion of space occupied by that particular height in the dataset. Additionally, this proportion also signifies the probability of encountering a tree with that specific height in the given area.

Subsequently, we proceed to calculate the proportion for each height by dividing the count for that height by 100, which is the total number (n) of units in the sample:

Ingen alternativ text angiven f?r den h?r bilden

?Be sure everything is correct by summing all proportions; they should add up to 1 exactly.

The plot for proportions looks as follows:

Ingen alternativ text angiven f?r den h?r bilden


Do it in r

library(ggplot2)
# Proportions
? y = c(15,	16,	17,	16,	19,	19,	17,	17,	17,	10,
? ? ? ? 17,	17,	12,	17,	12,	12,	17,	17,	17,	13,
? ? ? ? 17,	17,	13,	17,	14,	14,	17,	17,	17,	15,
? ? ? ? 17,	17,	15,	17,	15,	16,	17,	17,	17,	16,
? ? ? ? 17,	17,	16,	17,	16,	16,	17,	17,	17,	18,
? ? ? ? 17,	17,	18,	17,	18,	18,	17,	17,	17,	19,
? ? ? ? 17,	17,	10,	18,	10,	10,	18,	18,	18,	11,
? ? ? ? 18,	18,	11,	18,	12,	12,	18,	18,	18,	12,
? ? ? ? 18,	18,	13,	18,	14,	14,	18,	18,	18,	15,
? ? ? ? 18,	18,	16,	18,	17,	18,	18,	18,	18,	10
? )
? my_counts = table(y)? 
  proportions = my_counts/sum(my_counts)
? prop_df = as.data.frame(proportions)
??
? font_family = 'sans'
? p<-ggplot(data=prop_df, aes(x=y, y=Freq)) +
? ? geom_bar(stat="identity",width = 0.5,fill = 'coral') +
? ? scale_y_continuous(name = 'Proportion', limits=c(0,max(prop_df$Freq)+0.1),expand = c(0,0)) +
? ? theme(axis.title.x=element_blank(),
? ? ? ? ? axis.text.x=element_text(size=12,family=font_family),
? ? ? ? ? axis.text.y = element_text(size=12),
? ? ? ? ? axis.title.y=element_text(margin=margin(0,20,0,0),size=16, family=font_family, face='bold'),
? ? ? ? ? plot.title = element_text(size = 16, face = "bold", family=font_family),
? ? ? ? ? plot.margin = unit(c(1, 5, 1, 1), "lines")
? ? )?        


The class interval

When plotting observation values on the x-axis, it's not always practical to show each individual value. There might be numerous unique values, and in many cases, certain values might have only one observation.

To address this issue, we can group the values into intervals. For instance, we can group heights into intervals of three units each. For example, a height of 11 meters would fall into the interval 10-12. This range is referred to as the 'class interval,' and the specific height value (in this case, 11) is known as the 'class mark.'

We then proceed to aggregate the frequencies of all heights within that interval and present this information beside the corresponding class interval:

Ingen alternativ text angiven f?r den h?r bilden

Smoother, hey? This table does not take up as much space as the former and present a smoother distribution? of the values. How does this looks like in the plot? Wait a minute. A barplot made on these class intervals is actually called a histogram:?

Ingen alternativ text angiven f?r den h?r bilden

Do it in r

library(ggplot2)
# Class intervals? 
  y = c(15,	16,	17,	16,	19,	19,	17,	17,	17,	10,
? ? ? ? 17,	17,	12,	17,	12,	12,	17,	17,	17,	13,
? ? ? ? 17,	17,	13,	17,	14,	14,	17,	17,	17,	15,
? ? ? ? 17,	17,	15,	17,	15,	16,	17,	17,	17,	16,
? ? ? ? 17,	17,	16,	17,	16,	16,	17,	17,	17,	18,
? ? ? ? 17,	17,	18,	17,	18,	18,	17,	17,	17,	19,
? ? ? ? 17,	17,	10,	18,	10,	10,	18,	18,	18,	11,
? ? ? ? 18,	18,	11,	18,	12,	12,	18,	18,	18,	12,
? ? ? ? 18,	18,	13,	18,	14,	14,	18,	18,	18,	15,
? ? ? ? 18,	18,	16,	18,	17,	18,	18,	18,	18,	10
? )?
  class_intervals = as.data.frame(table(cut(y,seq(min(y),max(y),2))))
? add_last_interval = rbind(class_intervals,data.frame(Var1='(19,19]',Freq=3))
??
? # Remove unwanted characters
??
? class_interval_remove_parantheses = gsub('[()]','',add_last_interval$Var1)
? class_interval_remove_brackets = gsub('\\[|\\]','',class_interval_remove_parantheses)
? add_last_interval$Var1<-class_interval_remove_brackets

? # Create plot
  font_family = 'sans'? 
??
?   ggplot(data=add_last_interval, aes(x=Var1, y=Freq)) +
? ? geom_bar(stat="identity",width = 0.5,fill = 'coral') +
? ? scale_y_continuous(name = 'Frequency', limits=c(0,max(add_last_interval$Freq)+1),expand = c(0,0)) +
? ? scale_x_discrete("Class interval") +
? ? theme(axis.title.x=element_text(margin=margin(20,0,0,0),size=16, family=font_family, face='bold'),
? ? ? ? ? axis.text.x=element_text(size=12,family=font_family),
? ? ? ? ? axis.text.y = element_text(size=12),
? ? ? ? ? axis.title.y=element_text(margin=margin(0,20,0,0),size=16, family=font_family, face='bold'),
? ? ? ? ? plot.title = element_text(size = 16, face = "bold", family=font_family),
? ? ? ? ? plot.margin = unit(c(1, 5, 1, 1), "lines")
? ? )        


Additional methods to create histograms in r

geom_histogram

Ingen alternativ text angiven f?r den h?r bilden


?# Histogram with geom_histogram

  y = c(15,	16,	17,	16,	19,	19,	17,	17,	17,	10,
? ? ? ? 17,	17,	12,	17,	12,	12,	17,	17,	17,	13,
? ? ? ? 17,	17,	13,	17,	14,	14,	17,	17,	17,	15,
? ? ? ? 17,	17,	15,	17,	15,	16,	17,	17,	17,	16,
? ? ? ? 17,	17,	16,	17,	16,	16,	17,	17,	17,	18,
? ? ? ? 17,	17,	18,	17,	18,	18,	17,	17,	17,	19,
? ? ? ? 17,	17,	10,	18,	10,	10,	18,	18,	18,	11,
? ? ? ? 18,	18,	11,	18,	12,	12,	18,	18,	18,	12,
? ? ? ? 18,	18,	13,	18,	14,	14,	18,	18,	18,	15,
? ? ? ? 18,	18,	16,	18,	17,	18,	18,	18,	18,	10
? )?
??
? hist_df=data.frame(y=y)
??
? font_family = 'sans'
? ggplot(data=hist_df, aes(x=y)) +
? ? geom_histogram(bins=3,fill = 'coral')+
? ? scale_y_continuous(name = 'Count',expand = c(0,0)) +
? ? xlab("Height(m)") +
? ? theme(axis.title.x=element_text(margin=margin(20,0,0,0),size=16, family=font_family, face='bold'),
? ? ? ? ? axis.text.x=element_text(size=12,family=font_family),
? ? ? ? ? axis.text.y = element_text(size=12),
? ? ? ? ? axis.title.y=element_text(margin=margin(0,20,0,0),size=16, family=font_family, face='bold'),
? ? ? ? ? plot.title = element_text(size = 16, face = "bold", family=font_family),
? ? ? ? ? plot.margin = unit(c(1, 5, 1, 1), "lines")
? ? )? ??
        

hist() - function

Ingen alternativ text angiven f?r den h?r bilden


# Histogram with hist function
  y = c(15,	16,	17,	16,	19,	19,	17,	17,	17,	10,
? ? ? ? 17,	17,	12,	17,	12,	12,	17,	17,	17,	13,
? ? ? ? 17,	17,	13,	17,	14,	14,	17,	17,	17,	15,
? ? ? ? 17,	17,	15,	17,	15,	16,	17,	17,	17,	16,
? ? ? ? 17,	17,	16,	17,	16,	16,	17,	17,	17,	18,
? ? ? ? 17,	17,	18,	17,	18,	18,	17,	17,	17,	19,
? ? ? ? 17,	17,	10,	18,	10,	10,	18,	18,	18,	11,
? ? ? ? 18,	18,	11,	18,	12,	12,	18,	18,	18,	12,
? ? ? ? 18,	18,	13,	18,	14,	14,	18,	18,	18,	15,
? ? ? ? 18,	18,	16,	18,	17,	18,	18,	18,	18,	10
? )?
? xrange = c(seq(min(y),max(y),2))
? hist(y,?
? ? ? ?breaks = "Sturges",
? ? ? ?freq = FALSE,
? ? ? ?col = "coral",?
? ? ? ?main = NULL,
? ? ? ?ylim = c(0,1),
? ? ? ?xlim= c(min(y),max(y)),?
? ? ? ?xlab = "Oak tree heights (m)",?
? ? ? ?ylab="Probability",
? ? ? ?bty="l",?
? ? ? ?las=1,?
? ? ? ?xaxt="n",
? ? ? ?cex.lab=1.2)
??
? axis(side=1,
? ? ? ?at=xrange,
? ? ? ?labels = xrange,
? ? ? ?pos=0,
? ? ? ?las=1,
? ? ? ?tick=T)n        

Cover photo: Histogram using 10 000

Histogram using 10 000 normally distributed values using the mean and standard deviation of the examples above

Do it in r

# Use rnorm() and geom_hist(
??
? y = round(rnorm(10000,mean(y),sd(y)),0)
? my_counts = table(y)
? count_df = as.data.frame(my_counts)
? proportions = my_counts/sum(my_counts)
? prop_df = as.data.frame(proportions)
??
? font_family = 'sans'
? p<-ggplot(data=prop_df, aes(x=y, y=Freq)) +
? ? geom_bar(stat="identity",width = 0.5,fill = 'coral') +
? ? scale_y_continuous(name = 'Proportion', limits=c(0,max(prop_df$Freq)),expand = c(0,0)) +
? ? theme(axis.title.x=element_blank(),
? ? ? ? ? axis.text.x=element_blank(),
? ? ? ? ? axis.text.y = element_text(size=12),
? ? ? ? ? axis.ticks.x=element_blank(),
? ? ? ? ? axis.title.y=element_text(margin=margin(0,20,0,0),size=16, family=font_family, face='bold'),
? ? ? ? ? plot.title = element_text(size = 16, face = "bold", family=font_family),
? ? ? ? ? plot.margin = unit(c(1, 5, 1, 1), "lines")
? ? ))
 p        


Key takeaways:

  • Frequency tables and histograms are important tools to describe the distribution of a sample or population
  • The frequencies can easily be transformed to probabilities
  • A histogram is a graphical presentation of the distribution of your data

要查看或添加评论,请登录

Jesper Martinsson的更多文章

  • Standard error

    Standard error

    I believe the standard error is one of the most confusing concepts for those that are new in statistics. That is my…

  • The normal distribution

    The normal distribution

    The normal distribution has distinct characteristics that form the foundation for parametric statistical tests…

  • How to describe a statistical population using R - Part 1: Location and variability

    How to describe a statistical population using R - Part 1: Location and variability

    Measures of location and variability play a fundamental role in describing a statistical population. They are equally…

  • Hypothesis testing

    Hypothesis testing

    Hypothesis testing is, according to my opinion, analogous to the scientific method. It follows a logical structure that…

  • Variables and scale

    Variables and scale

    Data used in research and statistical tests can be obtained by measuring stuff directly (such as height), collecting…

  • Population vs sample

    Population vs sample

    Population and sample are two fundamental concepts of statistical theory. In every statistical test, you deal with at…

    1 条评论

社区洞察

其他会员也浏览了