Fun with R for everyone -- updated with the U.S. state-specific dataset!
June Weintraub
Deputy Director for Environmental Health and State Environmental Health Director, CDPH
UPDATE: April 29, 2020. All I care about these days is what's happening in the various states, so I've taken to running the following using a different library that is just for US data. This code is pretty easily edited to just put whatever state(s) you want to look at. You can look at cases or deaths, and you can even divide by number of tests.
library(covid19us) US <- get_states_daily() US_df <- US %>% group_by(state) #subset of States -- add or subtract as you wish #to make a plot of cases or deaths states <- c("GA", "FL", "MI") ss_us <- subset(US_df, state %in% states) d2_us <- ggplot(ss_us, aes(x = date, y = positive_increase)) + geom_line(aes(color = state), size = 0.2) + theme_minimal() d2_us d2_us <- ggplot(ss_us, aes(x = date, y = death_increase)) + geom_line(aes(color = state), size = 0.2) + theme_minimal() #Deaths each day in some different states Calif <- subset(US_df, state == "CA") d3 <- ggplot(data = Calif, aes(x = date, y = death_increase)) + geom_bar(stat="identity", width=0.5, fill = "#00AFBB") d3 + labs(y="Deaths", x = "Date") + ggtitle("California Covid-19 Deaths") Georgia <- subset(US_df, state == "GA") d3 <- ggplot(data = Georgia, aes(x = date, y = death_increase)) + geom_bar(stat="identity", width=0.5, fill = "#009900") d3 + labs(y="Deaths", x = "Date") + ggtitle("Georgia Covid-19 Deaths") Michigan <- subset(US_df, state == "MI") d3 <- ggplot(data = Michigan, aes(x = date, y = death_increase)) + geom_bar(stat="identity", width=0.5, fill = "#FC4E07") d3 + labs(y="Deaths", x = "Date") + ggtitle("Michigan Covid-19 Deaths") NewYork <- subset(US_df, state == "NY") d3 <- ggplot(data = NewYork, aes(x = date, y = death_increase)) + geom_bar(stat="identity", width=0.5, fill = "#000066") d3 + labs(y="Deaths", x = "Date") + ggtitle("New York Covid-19 Deaths") Florida <- subset(US_df, state == "FL") d3 <- ggplot(data = Florida, aes(x = date, y = death_increase)) + geom_bar(stat="identity", width=0.5, fill = "#FF40A0") d3 + labs(y="Deaths", x = "Date") + ggtitle("Florida Covid-19 Deaths")
Is it lazy or efficient to just name everything the same thing and overwrite the objects each time? I know not...
Original piece from March 26 2020:
Instead of roaming around the web looking at the same statistics over and over in hopes of finding some revelation about the Covid-19 pandemic, I've decided to pass the time improving my rudimentary R skills so I can look at the data myself. If you’ve installed R, you should be able to copy the code below and just paste it into your R console to look at some data and experiment with different shapes, colors, and subsets. Have fun!
If you don't have the following packages installed, you'll need to install.packages first, and then call the libraries that we'll use:
#for example if I didn't have ggplot2 package already I'd need to type: install.packages("ggplot2") library("ggplot2") library("dplyr") library("tidyr") library("RCurl") library("lubridate")
This person Rami Krispin is maintaining a dataset that is updated daily, and you can download the latest data from GitHub and name a data frame covid19. (He also has a bunch of way more sophisticated tools including a Coronavirus Dashboard that does all the things I'm trying to do here, only much more elegantly and comprehensively.)
download <- getURL("https://raw.githubusercontent.com/RamiKrispin/coronavirus-csv/master/coronavirus_dataset.csv?accessType=DOWNLOAD") covid19 <- read.csv (text = download)
List the variables in the data set, make the date variable a date and check the dates covered. Since the dataset is updated daily the max date will change each day.
names(covid19) covid19$date <- as.Date(covid19$date) summary(covid19$date)
Now I'm going to make a dummy variable for died (=1 if type is a death, 0 otherwise). Up until March 22, these data had three levels for the type variable: confirmed, death and recovered. Starting on March 23 this just became a two level variable, but I'm leaving this code here so I can remember how to make a dummy variable.
covid19$died <- ifelse(covid19$type == "death", covid19$cases, 0)
#####################
##Ready to see some data!
#####################
Start with just total cases by country in the whole dataset to date (which is 3/24/2020 for all the results below):
summary_df <- covid19 %>% group_by(Country.Region) %>% summarize(total_cases = sum(cases)) %>% arrange(-total_cases) summary_df
Now let's just look at how many deaths there are by country in the whole dataset throught 3/24/2020:
deaths <- covid19 %>% group_by(Country.Region) %>% summarize(total_deaths = sum(died)) %>% arrange(-total_deaths) deaths
What about cases on just the latest day of the dataset?
covid19 %>% filter(date == max(date)) %>% select(country = Country.Region, type, cases) %>% group_by(country, type) %>% summarize(total_cases = sum(cases)) %>% pivot_wider(names_from = type, values_from = total_cases) %>% arrange(-confirmed)
You can substitute any particular day instead of date == max(date). Here's what happened on March 1:
covid19 %>% filter(date == "2020-03-01") %>% select(country = Country.Region, type, cases) %>% group_by(country, type) %>% summarize(total_cases = sum(cases)) %>% pivot_wider(names_from = type, values_from = total_cases) %>% arrange(-confirmed)
This was the code to show cases in the US to-date, by State, but it won’t work as of March 23 when the source data stopped having state-by-state data. Maybe this will change!
covid19 %>% filter(Country.Region == "US") %>% select(Province.State, type, cases) %>% group_by(Province.State, type) %>% summarize(total_cases = sum(cases)) %>% pivot_wider(names_from = type, values_from = total_cases) %>% arrange(-confirmed)
#################
#Fun with plots below
#################
Here's a website with some hex colors https://www.colorhexa.com/web-safe-colors. In the code below you can change colors or size of the bars, limit to deaths only, or just pick one or two countries. Use guides(fill=FALSE) to remove the legend if it’s just one series.
Plot all the cases to date, by date
ggplot(data = covid19, aes(x = date, y = cases)) + geom_bar(stat="identity", width=0.5, fill = "#000066") + labs(y="Count of Cases", x = "Date") + ggtitle("Covid-19 Cases per Day to 3/24/2020") + guides(fill=FALSE)
Now I'll make some new dataframes with different subsets (1) deaths to date, by date; (2) cases in China to date, by date; (3) cases in Italy to date, by date. I played with this code for a while to experiment with different colors and widths. Notice the "fill=" command is inside geom_bar for the case counts, but inside ggplot for the deaths. Why does it matter? I do not know. But the cases colors just defaulted to a salmon color for the China and Italy plots until I moved the fill command.
ss_deaths <- subset(covid19, type == "death") ggplot(data = ss_deaths, aes(x = date, y = cases)) + geom_bar(stat="identity", width=0.3, fill = "#000066") + labs(y="Deaths", x = "Date") + ggtitle("All Countries Covid-19 Deaths to 3-24") + guides(fill=FALSE) ssC <- subset(covid19, Country.Region == "China") ggplot(data = ssC, aes(x = date, y = cases)) + geom_bar(stat="identity", width=0.5, fill ="#00cc99") + labs(y="Cases", x = "Date") + ggtitle("China Covid-19 Cases to 3-24") + guides(fill=FALSE) ssI <- subset(covid19, Country.Region == "Italy") ggplot(data = ssI, aes(x = date, y = cases)) + geom_bar(stat="identity", width=0.8, fill = "#00AFBB") + labs(y="Cases", x = "Date") + ggtitle("Italy Covid-19 Cases to 3-24") + guides(fill=FALSE)
Now I'll plot a bunch of countries on the same plot. First make a vector with some countries I'm interested in, (add or subtract whatever countries you want), then use ggplot in various ways:
countries <- c("China", "Italy", "US", "Iran", "Spain", "Korea, South", "Germany", "France", "Switzerland", "United Kingdom") ss <- subset(covid19, Country.Region %in% countries & type == "death") d1 <- ggplot(ss, aes(x = date, y = cases, fill = Country.Region)) + geom_bar(stat="identity", position=position_dodge(), width=1.5) d1 + labs(y="Count of Deaths", x = "Date") + ggtitle("Covid-19 deaths per Day in 10 Countries") ss <- subset(covid19, Country.Region %in% countries) d2 <- ggplot(ss, aes(x = date, y = cases, fill = Country.Region)) + geom_bar(stat="identity", position=position_dodge(), width=1.5) d2 + labs(y="Count of Cases", x = "Date") + ggtitle("Covid-19 Cases per Day in 10 Countries")
Looking at the plot on the right, you can see that the cases jumped in China around mid-February. There's a reasonable explanation for this, which is that the criteria for a case was expanded and they lumped a bunch of people who hadn't previously been considered cases into the case counts on a single day. That said, it's interesting that my usual trusted sources seem to have been silent about this artifact. In any event, let's see what it looks like if we set the x axis on that graph to just start after 2/14, or set the y-axis to just truncate at 7000. Different pictures! I think this is a nice example of how easy it is to change the story just by truncating a few axes.
c(min, max) min <- as.Date("2020-02-15") max <- max(ss$date) d2 +scale_x_date(limits = c(min, max)) + ggtitle("Covid-19 Cases after 2/14") d2 + ylim(0, 7500) + labs(y="Count of Cases", x = "Date") + ggtitle("Covid-19 Cases with y-axis truncated")
And for the finale, let's just look at a few countries in a finite time frame so the plot is actually a little bit readable.
countries <- c("Italy", "US", "Iran", "Spain") c(min, max) min <- as.Date("2020-03-01") max <- max(ss$date) ss <- subset(covid19, Country.Region %in% countries & type == "death") d3 <- ggplot(ss, aes(x = date, y = cases, fill = Country.Region)) + geom_bar(stat="identity", position=position_dodge(), width=1.5) d3 + scale_x_date(limits = c(min, max)) + labs(y="Count of Deaths", x = "Date") + ggtitle("Covid-19 deaths/day after 3/01 in 4 countries")
So that'll do for now, but luckily there’s still more for me to learn. Like how can I animate these? Can I get my own US state-by-state data from somewhere else? What about denominators? Is this where Standardized Mortality Ratios would come in handy? Can I even remember how to do those?
Even though I won't be able to figure out how to use these skills to predict how this pandemic will unfold, I still feel good predicting that we humans will come out on the other side having learned more than just how to make plots in R.