Data Visualization of Nobel Prizes by Age using R in Kaggle
Jagadish K.
Experienced SAS/R/Shiny developer | 12+ Years in SDTM, ADaM, TFLs, Oncology, Infectious Diseases & Therapeutic Areas | Advancing Skills in Python
In my previous article we have seen the data of Nobel Prizes by Gender, Countries and Categories. However we did not check the average age of the people awarded the Nobel prize. In the present article we will focus on the analyzing the same data by age.
In order to analyze the data by age we need to get the Nobel prize data and derive the age. Before that we need to call all the required libraries and then read the Nobel prize data from csv file.
library(readr) library(sqldf) library(dplyr) library(lubridate) install.packages("ggstatsplot") library(ggstatsplot) nobel <- read_csv('../input/nobel-prize/complete.csv') nobel
Here in the nobel prize data we have the following variables like date when the prize was awarded (dateAwarded) and date when the recipient was born (birth_date). Then the age is derived by derivation (dateAwarded-birth_date)/365.25, however in the data we see that for some of the subjects the dateAwarded is missing but the year when award was given is available, so I derived the dateAwarded assuming that every year in the month of October the award is given, so imputed the missing dateAwarded using 1st of October. The conclusion of taking the October as month of award is decided by checking the data over the years and months as displayed in the Graph 1.
nobel2 <- nobel %>% select(awardYear,dateAwarded,category,gender,birth_continent,birth_country,birth_date,givenName) nobel3 <- nobel2 %>% mutate(newdate=ifelse(is.na(dateAwarded),paste0(awardYear,'-10-01'),as.character(dateAwarded)),newdate2=as.Date(newdate), age2=ifelse(!is.na(birth_date),(newdate2-birth_date)/365.25,birth_date),age=round(age2,digits=1)) maxage2 <- nobel3 %>% left_join(.,maxage,by=c('category','age')) %>% mutate(outlier=ifelse(flag==1,paste0(givenName,' ','Age=',age),flag),month=month(newdate2)) ggplot(maxage2 , aes(x=awardYear,y=month)) + geom_bar(stat="identity", width=.5, position = "dodge") + labs(title = "Month of every year when the Nobel Prizes are given out") + xlab("Year") + ylab("Month")
Graph 1
Once the age is derived as per the code above, we can use the age variable to generate the graphs. The best way to identify the average age of the recipients when they actually received the award by category is boxplot. Please find below the boxplot with average age by Categories. Also we can see I annotated the outliers using geom_label(), as a input dataset for the same, I used the source dataset and flagged the outliers in outlier variable which I used to label. The overall average age is ranging between 45 - 75.
maxage <- nobel3 %>% filter(!is.na(age)) %>% group_by(category) %>% arrange(category,age) %>% slice(c(1,n())) %>% mutate(flag=1) %>% select(flag,category,age) maxage maxage2 <- nobel3 %>% left_join(.,maxage,by=c('category','age')) %>% mutate(outlier=ifelse(flag==1,paste0(givenName,' ','Age=',age),flag),month=month(newdate2)) maxage2 ggplot(maxage2,aes(x=category,y=age,fill=category)) + geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4) + labs(title = "Average Age of Nobel Prize Recipients by Categories") + xlab("Category") + ylab("Age") + theme(legend.position='none',axis.text.x = element_text(angle = 45,hjust = 1)) + geom_label(data=maxage2 %>% filter(flag==1 & category %in% c('Economic Sciences', 'Chemistry', 'Peace','Physics') & age>90 | age <20), aes(label=outlier),size=2)
Graph 2
Now lets see the average age of recipients by gender. Both male and female average age seems to be same, but for females it is 40-75 and males it is 45-75.
ggplot(maxage2 %>% filter(!is.na(gender)),aes(x=category,y=age,fill=category)) + geom_boxplot(outlier.colour="red", outlier.shape=7,outlier.size=2) + labs(title = "Age wise Nobel Prizes by Category") + xlab("Category") + ylab("Age") + theme(legend.position = "none",axis.text.x = element_text(angle = 45,hjust = 1)) + geom_label(data=maxage2 %>% filter(flag==1 & category %in% c('Economic Sciences', 'Chemistry', 'Peace','Physics') & age>90 | age <20), aes(label=outlier),size=2) + facet_wrap(~gender)
Graph 3
Note that I am not checking if the Recipient was alive when they received the award.
We can also confirm the average age of Nobel prize recipients by scatter plot as well as below, with scatter plot when compared with boxplot the average age is much clear it appears as 35-85
ggplot(maxage2, aes(x = age, y = category,col=category)) + geom_point(position='jitter') + xlab('Age') + ylab('Categories') + guides(alpha=FALSE) + scale_x_continuous(breaks=seq(15,100,by=5)) + theme(legend.position='bottom',legend.box = "vertical",axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1)) + scale_color_discrete(name="Country") + labs(title = "Scatter Plot of Nobel Prize Categories by Age")
Graph 4
We will also check by gender, the average age is same in both. But for males the data is more dense due to more awards compared to females.
ggplot(maxage2 %>% filter(!is.na(gender)), aes(x = age, y = category,col=category)) + geom_point(position='jitter') + xlab('Age') + ylab('Categories') + guides(alpha=FALSE) + scale_x_continuous(breaks=seq(15,100,by=10)) + theme(legend.position='bottom',legend.box = "vertical",axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1)) + scale_color_discrete(name="Country") + labs(title = "Scatter Plot of Nobel Prize Categories by Age and Gender") + facet_wrap(~gender)
Graph 5
While writing this article I learned few new things in R, of how we can convert the numeric date to character date by as.character(), convert the character date to numeric date by as.Date(), use of ifelse() in mutate(), paste0() to concatenate the character variables, rounding of numeric variables to required digits while deriving the age variable round(x,digits=.), also the annotation in graphs to display the outliers.
P.S. The opinions and views expressed here are mine and not of anyone else's.