Fun with R for everyone -- updated with the U.S. state-specific dataset!
I find myself wandering around the internet looking for someone to have done The Definitive Analysis of What is Going On. Here's my own attempt to understand things, or at least to practice my R skills! Image/Collage is copyright June Weintraub 2020

Fun with R for everyone -- updated with the U.S. state-specific dataset!

UPDATE: April 29, 2020. All I care about these days is what's happening in the various states, so I've taken to running the following using a different library that is just for US data. This code is pretty easily edited to just put whatever state(s) you want to look at. You can look at cases or deaths, and you can even divide by number of tests.

library(covid19us)
US <- get_states_daily()
US_df <- US %>% group_by(state)

#subset of States -- add or subtract as you wish
#to make a plot of cases or deaths 
states <- c("GA", "FL", "MI")

ss_us <- subset(US_df, state %in% states)
d2_us <- ggplot(ss_us, aes(x = date, y = positive_increase)) +
geom_line(aes(color = state), size = 0.2) +
theme_minimal()
d2_us
d2_us <- ggplot(ss_us, aes(x = date, y = death_increase)) +
  geom_line(aes(color = state), size = 0.2) +
  theme_minimal()


#Deaths each day in some different states

Calif <- subset(US_df, state == "CA")
d3 <- ggplot(data = Calif, aes(x = date, y = death_increase)) +
  geom_bar(stat="identity", width=0.5, fill = "#00AFBB")
d3 + labs(y="Deaths", x = "Date") + ggtitle("California Covid-19 Deaths")


Georgia <- subset(US_df, state == "GA")
d3 <- ggplot(data = Georgia, aes(x = date, y = death_increase)) +
  geom_bar(stat="identity", width=0.5, fill = "#009900")
d3 + labs(y="Deaths", x = "Date") + ggtitle("Georgia Covid-19 Deaths")


Michigan <- subset(US_df, state == "MI")
d3 <- ggplot(data = Michigan, aes(x = date, y = death_increase)) +
  geom_bar(stat="identity", width=0.5, fill = "#FC4E07")
d3 + labs(y="Deaths", x = "Date") + ggtitle("Michigan Covid-19 Deaths")


NewYork <- subset(US_df, state == "NY")
d3 <- ggplot(data = NewYork, aes(x = date, y = death_increase)) +
  geom_bar(stat="identity", width=0.5, fill = "#000066")
d3 + labs(y="Deaths", x = "Date") + ggtitle("New York Covid-19 Deaths")


Florida <- subset(US_df, state == "FL")
d3 <- ggplot(data = Florida, aes(x = date, y = death_increase)) +
  geom_bar(stat="identity", width=0.5, fill = "#FF40A0")
d3 + labs(y="Deaths", x = "Date") + ggtitle("Florida Covid-19 Deaths")

Is it lazy or efficient to just name everything the same thing and overwrite the objects each time? I know not...

Original piece from March 26 2020:

Instead of roaming around the web looking at the same statistics over and over in hopes of finding some revelation about the Covid-19 pandemic, I've decided to pass the time improving my rudimentary R skills so I can look at the data myself. If you’ve installed R, you should be able to copy the code below and just paste it into your R console to look at some data and experiment with different shapes, colors, and subsets. Have fun!

If you don't have the following packages installed, you'll need to install.packages first, and then call the libraries that we'll use:

#for example if I didn't have ggplot2 package already I'd need to type:
install.packages("ggplot2")

library("ggplot2")
library("dplyr")
library("tidyr")
library("RCurl")
library("lubridate")

This person Rami Krispin is maintaining a dataset that is updated daily, and you can download the latest data from GitHub and name a data frame covid19. (He also has a bunch of way more sophisticated tools including a Coronavirus Dashboard that does all the things I'm trying to do here, only much more elegantly and comprehensively.)

download <- getURL("https://raw.githubusercontent.com/RamiKrispin/coronavirus-csv/master/coronavirus_dataset.csv?accessType=DOWNLOAD")

covid19 <- read.csv (text = download)

List the variables in the data set, make the date variable a date and check the dates covered. Since the dataset is updated daily the max date will change each day.

names(covid19)
covid19$date <- as.Date(covid19$date)
summary(covid19$date)

Now I'm going to make a dummy variable for died (=1 if type is a death, 0 otherwise). Up until March 22, these data had three levels for the type variable: confirmed, death and recovered. Starting on March 23 this just became a two level variable, but I'm leaving this code here so I can remember how to make a dummy variable.

covid19$died <- ifelse(covid19$type == "death", covid19$cases, 0)

#####################

##Ready to see some data!

#####################

Start with just total cases by country in the whole dataset to date (which is 3/24/2020 for all the results below):

summary_df <- covid19 %>% group_by(Country.Region) %>%
summarize(total_cases = sum(cases)) %>%
arrange(-total_cases)
summary_df
No alt text provided for this image

Now let's just look at how many deaths there are by country in the whole dataset throught 3/24/2020:

deaths <- covid19 %>% 
 group_by(Country.Region) %>%
 summarize(total_deaths = sum(died)) %>%
 arrange(-total_deaths)
 deaths
No alt text provided for this image

What about cases on just the latest day of the dataset?

covid19 %>%
 filter(date == max(date)) %>%
 select(country = Country.Region, type, cases) %>%
 group_by(country, type) %>%
 summarize(total_cases = sum(cases)) %>% 
 pivot_wider(names_from = type, values_from = total_cases) %>%
 arrange(-confirmed)
Cases and Deaths on a single day -- 3/24/2020, by country

You can substitute any particular day instead of date == max(date). Here's what happened on March 1:

covid19 %>%
 filter(date == "2020-03-01") %>%
 select(country = Country.Region, type, cases) %>%
 group_by(country, type) %>%
 summarize(total_cases = sum(cases)) %>%
 pivot_wider(names_from = type, values_from = total_cases) %>%
 arrange(-confirmed)
Cases on March 1, 2020 by country

This was the code to show cases in the US to-date, by State, but it won’t work as of March 23 when the source data stopped having state-by-state data. Maybe this will change!

covid19 %>% 
 filter(Country.Region == "US") %>%
 select(Province.State, type, cases) %>%
 group_by(Province.State, type) %>% 
 summarize(total_cases = sum(cases)) %>%
 pivot_wider(names_from = type, values_from = total_cases) %>%
 arrange(-confirmed)
No alt text provided for this image

#################

#Fun with plots below

#################

Here's a website with some hex colors https://www.colorhexa.com/web-safe-colors. In the code below you can change colors or size of the bars, limit to deaths only, or just pick one or two countries. Use guides(fill=FALSE) to remove the legend if it’s just one series.

Plot all the cases to date, by date

ggplot(data = covid19, aes(x = date, y = cases)) + 
 geom_bar(stat="identity", width=0.5, fill = "#000066") +
 labs(y="Count of Cases", x = "Date") + ggtitle("Covid-19 Cases per Day to 3/24/2020") + guides(fill=FALSE)
No alt text provided for this image

Now I'll make some new dataframes with different subsets (1) deaths to date, by date; (2) cases in China to date, by date; (3) cases in Italy to date, by date. I played with this code for a while to experiment with different colors and widths. Notice the "fill=" command is inside geom_bar for the case counts, but inside ggplot for the deaths. Why does it matter? I do not know. But the cases colors just defaulted to a salmon color for the China and Italy plots until I moved the fill command.

ss_deaths <- subset(covid19, type == "death")
 ggplot(data = ss_deaths, aes(x = date, y = cases)) + 
 geom_bar(stat="identity", width=0.3, fill = "#000066") +
 labs(y="Deaths", x = "Date") + ggtitle("All Countries Covid-19 Deaths to 3-24") + guides(fill=FALSE)

ssC <- subset(covid19, Country.Region == "China")
 ggplot(data = ssC, aes(x = date, y = cases)) + 
 geom_bar(stat="identity", width=0.5, fill ="#00cc99") +
 labs(y="Cases", x = "Date") + ggtitle("China Covid-19 Cases to 3-24") +
 guides(fill=FALSE)

ssI <- subset(covid19, Country.Region == "Italy")
 ggplot(data = ssI, aes(x = date, y = cases)) + 
 geom_bar(stat="identity", width=0.8, fill = "#00AFBB") +
 labs(y="Cases", x = "Date") + ggtitle("Italy Covid-19 Cases to 3-24") +

 guides(fill=FALSE)
No alt text provided for this image

Now I'll plot a bunch of countries on the same plot. First make a vector with some countries I'm interested in, (add or subtract whatever countries you want), then use ggplot in various ways:

 countries <- c("China", "Italy", "US", "Iran", "Spain", "Korea, South", "Germany", "France", "Switzerland", "United Kingdom") 

 ss <- subset(covid19, Country.Region %in% countries & type == "death")
 d1 <- ggplot(ss, aes(x = date, y = cases, fill = Country.Region)) + 
 geom_bar(stat="identity", position=position_dodge(), width=1.5)
 d1 + labs(y="Count of Deaths", x = "Date") + ggtitle("Covid-19 deaths per Day in 10 Countries") 


 ss <- subset(covid19, Country.Region %in% countries)
 d2 <- ggplot(ss, aes(x = date, y = cases, fill = Country.Region)) + 
 geom_bar(stat="identity", position=position_dodge(), width=1.5)
 d2 + labs(y="Count of Cases", x = "Date") + ggtitle("Covid-19 Cases per Day in 10 Countries") 

No alt text provided for this image

Looking at the plot on the right, you can see that the cases jumped in China around mid-February. There's a reasonable explanation for this, which is that the criteria for a case was expanded and they lumped a bunch of people who hadn't previously been considered cases into the case counts on a single day. That said, it's interesting that my usual trusted sources seem to have been silent about this artifact. In any event, let's see what it looks like if we set the x axis on that graph to just start after 2/14, or set the y-axis to just truncate at 7000. Different pictures! I think this is a nice example of how easy it is to change the story just by truncating a few axes.

 c(min, max)
 min <- as.Date("2020-02-15")
 max <- max(ss$date)
 d2 +scale_x_date(limits = c(min, max)) + ggtitle("Covid-19 Cases after 2/14") 


 d2 + ylim(0, 7500) + labs(y="Count of Cases", x = "Date") + ggtitle("Covid-19 Cases with y-axis truncated") 

No alt text provided for this image

And for the finale, let's just look at a few countries in a finite time frame so the plot is actually a little bit readable.

countries <- c("Italy", "US", "Iran", "Spain") 
c(min, max)
 min <- as.Date("2020-03-01")
 max <- max(ss$date)

ss <- subset(covid19, Country.Region %in% countries & type == "death")
 d3 <- ggplot(ss, aes(x = date, y = cases, fill = Country.Region)) + 
 geom_bar(stat="identity", position=position_dodge(), width=1.5)
 d3 + scale_x_date(limits = c(min, max)) + labs(y="Count of Deaths", x = "Date") + ggtitle("Covid-19 deaths/day after 3/01 in 4 countries") 
No alt text provided for this image

So that'll do for now, but luckily there’s still more for me to learn. Like how can I animate these? Can I get my own US state-by-state data from somewhere else? What about denominators? Is this where Standardized Mortality Ratios would come in handy? Can I even remember how to do those?

Even though I won't be able to figure out how to use these skills to predict how this pandemic will unfold, I still feel good predicting that we humans will come out on the other side having learned more than just how to make plots in R.

要查看或添加评论,请登录

June Weintraub的更多文章

  • Personal thoughts on #coronavirus, redux

    Personal thoughts on #coronavirus, redux

    One of my colleagues was excited to announce he was able to calculate the relationship between COVID-time, dog-years…

    1 条评论
  • Why I am not leaving my groceries in the garage for three days

    Why I am not leaving my groceries in the garage for three days

    First of all, I don't have a garage. A lot of my social and other media have been talking about The Study that showed…

    7 条评论
  • Personal thoughts on #Coronavirus

    Personal thoughts on #Coronavirus

    Because I'm a public health professional, many of my friends have been reaching out asking for my informed opinion…

    11 条评论

社区洞察

其他会员也浏览了