Build your own COVID-19 projections model with R language

Build your own COVID-19 projections model with R language

I'm not R programmer but I always wanted to see what that language is capable of. Until I was grounded at home due to covid19 outbreak, I had never had a chance to look at it closer. Anyway, being a bit bored and, like everyone, wondering when this will end, I thought, what if I could have a crystal ball that would give me the answer to the question when that crazy time is over. I know, there're plenty of things you can do while staying home but it seems like R powered crystal ball could be one of them. DIY!

Ok, let's try to predict the unpredictable. I hope you will enjoy it as much as I did and will learn some data analytics by completing this short exercise with me.

Installation

I didn't cover the R installation process but don't worry, if you don't have R on your PC, have a look at this tutorial: https://www.datacamp.com/community/tutorials/installing-R-windows-mac-ubunt

Feed the model with real data

Data frame is a two-dimensional data structure in R that stores vectors of equal length. I'll initiate a new data frame called covid19 with the raw data about the new daily case in Poland.

# New Cases in Poland
covid19 <- data.frame(
Dates = as.Date(c("2020-03-04","2020-03-05","2020-03-06","2020-03-07","2020-03-08","2020-03-09","2020-03-10","2020-03-11","2020-03-12","2020-03-13","2020-03-14","2020-03-15","2020-03-16","2020-03-17","2020-03-18","2020-03-19","2020-03-20","2020-03-21","2020-03-22","2020-03-23","2020-03-24","2020-03-25","2020-03-26","2020-03-27","2020-03-28","2020-03-29","2020-03-30","2020-03-31","2020-04-01","2020-04-02","2020-04-03","2020-04-04","2020-04-05","2020-04-06","2020-04-07","2020-04-08","2020-04-09","2020-04-10","2020-04-11","2020-04-12")),
NewCases = c(1,0,4,1,5,6,5,9,20,17,36,21,52,61,49,68,70,111,98,115,152,150,170,168,249,224,193,256,243,392,437,244,475,311,435,357,370,380,401,318)
)

As a warm-up, let's also calculate the Total Cases for each day. Since that data is already available, we could easily load it in the previous step, but I think it's a cool exercise to see how for loop works as well as to conduct a common data manipulation task to merge two data frames with Column Bind (cbind). Feel free to cross-check it with the public data. Both data sets should be the same.

# let's calculate Total Cases for each day

Total <- c()
count <- 0
for (val in covid19$NewCases){
    count = count+val
    Total <- c(Total, count)
}

# and then add to two our data
covid19 <- cbind(covid19,TotalCases=Total)

Experimenting

I was trying to find any clue that can help me with finding the trend. After a few experiments, I've discovered that the ratio of New Cases to Total Cases for a given day might be a good indicator. So let's see and add it to the model as NewInTotal.

NewInTotal <- c()

for (val in nrow(covid19)){
    count = covid19$NewCases / covid19$Total
    NewInTotal <- c(NewInTotal, count)
}

covid19 <- cbind(covid19,NewInTotal=NewInTotal)

Would you like to see how our data structure looks like? Try the head command to return only the first 20 rows of our new data structure:

> head(covid19,n=20)
        Dates NewCases TotalCases NewInTotal
1  2020-03-04        1          1  1.0000000
2  2020-03-05        0          1  0.0000000
3  2020-03-06        4          5  0.8000000
4  2020-03-07        1          6  0.1666667
5  2020-03-08        5         11  0.4545455
6  2020-03-09        6         17  0.3529412
7  2020-03-10        5         22  0.2272727
8  2020-03-11        9         31  0.2903226
9  2020-03-12       20         51  0.3921569
10 2020-03-13       17         68  0.2500000
11 2020-03-14       36        104  0.3461538
12 2020-03-15       21        125  0.1680000
13 2020-03-16       52        177  0.2937853
14 2020-03-17       61        238  0.2563025
15 2020-03-18       49        287  0.1707317
16 2020-03-19       68        355  0.1915493
17 2020-03-20       70        425  0.1647059
18 2020-03-21      111        536  0.2070896
19 2020-03-22       98        634  0.1545741
20 2020-03-23      115        749  0.1535381

Building a model

Let me state the hypothesis: since the TotalCases reached 11 on day 5 in Poland, the ratio of New Cases to Total Cases for a given day might be described with an exponential trendline. If so, let's try to model it. lm is used to fit linear models. Hence we need to use a small trick. Do you remember that exp(log(x))=x?

# model will be built on the data since TotalCases reached 11 (day 5)
model <- covid19[covid19$Total > 10,]

lp <- seq(1, nrow(model),1)
model <- cbind(model,nr=lp)

# fit exp model
exponential.model <- lm(log(model$NewInTotal)~ model$nr)

# we will use data for first 4 days later
covid19.firstdays <- covid19[covid19$Total < 10,]
lp <- seq(1, nrow(covid19.firstdays),1)
covid19.firstdays <- cbind(covid19.firstdays,nr=lp)

And now, let's assess how good the mode is by running summary command:

> summary(exponential.model)

Call:
lm(formula = log(model$NewInTotal) ~ model$nr)

Residuals:
     Min       1Q   Median       3Q      Max
-0.39616 -0.10746 -0.02575  0.13879  0.32368

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.973888   0.065378  -14.90   <2e-16 ***
model$nr    -0.051718   0.003081  -16.78   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1921 on 34 degrees of freedom
Multiple R-squared:  0.8923,    Adjusted R-squared:  0.8891
F-statistic: 281.7 on 1 and 34 DF,  p-value: < 2.2e-16

R-squared is 0.8923. It indicates that ~89% of the variation in the output variables are explained by the input variables so I think it's worth to explore this model further. Let's visualize a trend with plot command and add a trend line with lines command. Meanwhile, let's also define two one-line functions that will help us with further calculations:

# It returns the date for given day since 10 Total Cases
model.date <- function (i) return(as.Date("2020-03-07")+i);

#It calculates predicted NewInTotal for given day since 10 Total Cases
model.predictedNewinTotal <- function(i) return(exp(-0.973888 + (-0.051718*i)))
# Let's visualize it
model.NewInTotal <- exp(predict(exponential.model,list(model$nr)))

plot(model$nr, model$NewInTotal,
type="b",
col = "red",
lwd = 2,
xlab = "Days since 10 Total Cases",
ylab = "Daily New Cases In Total Cases",
main = "Daily New Cases In Total Cases \nsince more than 10 Total Cases in Poland"
)
lines(model.NewInTotal, col = "blue", lty="dashed", lwd = 3,)

legend("topright", legend=c("Real", "Model"),
       col=c("red", "blue"), lty = 1:2, cex=1)

And this how it looks like:

No alt text provided for this image

For further calculations, we need to create new data frames. expmodel will store values as per model, and expmodeladjusted will store values as per model calibrated to the TotalCases for yesterday (Apr 12, 2020). Calculations will be performed in an infinite while loop that will be broken if daily NewCases has reached 0.

#Daily NewCases as per model since the day with more than 10 Total Cases
expmodel <- data.frame(Dates = c(model.date(1)), NewCases =c(model$NewCases[1]), TotalCases = c(model$TotalCases[1]), NewInTotal = c(model$NewInTotal[1]),  nr = c(1))

i=2
M.NewCases <- NA
while (1) {
    M.PrevTotalCases = expmodel$TotalCases[i-1]
    M.Dates = model.date(i)
    M.NewInTotal = model.predictedNewinTotal(i)
    M.TotalCases = round(M.PrevTotalCases/(1-M.NewInTotal))
    M.NewCases = M.TotalCases - M.PrevTotalCases
    M.nr = i
    expmodel = rbind(expmodel,list(M.Dates ,M.NewCases,M.TotalCases,M.NewInTotal,M.nr))
    i = i+1
    
    if (M.NewCases == 0) break
}

i = nrow(model)

#Daily NewCases as per model since the day with more than 10 Total Cases, calibrated as per Easter Sunday's value.
expmodeladjusted <- data.frame(Dates = c(model.date(i)), NewCases = c(model$NewCases[i]), TotalCases = c(model$TotalCases[i]), NewInTotal = c(model$NewInTotal[i]),  nr = c(i))

while (1) {
    i = i+1
    M.PrevTotalCases = expmodeladjusted$TotalCases[i-nrow(model)]
    M.Dates = model.date(i)
    M.NewInTotal = model.predictedNewinTotal(i)
    M.TotalCases = round(M.PrevTotalCases/(1-M.NewInTotal))
    M.NewCases = M.TotalCases - M.PrevTotalCases
    M.nr = i
    expmodeladjusted = rbind(expmodeladjusted,list(M.Dates ,M.NewCases,M.TotalCases,M.NewInTotal,M.nr))
    
    if (M.NewCases == 0) break
}

Let's visualize Total Case and finally find out when this madness will end:

model = unique(rbind(covid19.firstdays, model))
model = unique(rbind(model, expmodeladjusted))
barplot(model$TotalCases,
main = "Total Cases in Poland",
xlab = "Date",
ylab = "Total Cases",
col=c("darkblue","white"),
beside = TRUE,
ylim=c(0,25000),
names.arg = model$Dates)

Hmm, Looks like we need to be patient...

No alt text provided for this image

But perhaps Daily New Cases will give us some sense of hope, let's have a closer look:

plot(model$Dates, model$NewCases,type="b",ylim=c(0,500),xlab="Date", ylab = "New Cases", main = "Daily New Cases in Poland", pch=15)

expmodel = unique(rbind(covid19.firstdays , expmodel))

# Red dotted line shows Easter Sunday
points(x=as.Date(c('2020-04-12','2020-04-12')), y=c(0,500), col='red', type='l', lty="dotted",lwd=2)

# Blue dashed line shows the model data
points(x=model$Dates[1:length(expmodel$NewCases)], y=expmodel$NewCases, col='blue', type='l', lty="dashed",lwd=2)


legend("topright", legend=c("Real/Calibrated Model", "Model", "Easter Sunday"), col=c("black","blue", "red"), lty = 1:3, cex=1)

That's cool! If this model is correct, Poland is almost on its peak. In the second half of April, Daily New Cases should start to go down:

No alt text provided for this image

Key takeaways

  • I appreciate it's a simple mathematical model that works under the assumption of ceteris paribus, i.e. the weather is not a major factor in virus spreading, social distancing and travel restrictions will remain the same as now, the presidential election that is scheduled for May in Poland won't trigger the second wave, etc.
  • Experts say COVID-19 confirmed cases are a fraction of the real total number. I think that's a fair statement. Unfortunately, there is no better data available, so I reckon we need to be more focused on trend analysis rather than actual numbers.
  • But if this model is correct, Poland is almost on its peak. Shortly after 15 Apr 2020, Daily New Cases should start slowly to go down. Even though, it is not even the beginning of the end. But it is, perhaps, the end of the beginning because the coronavirus will likely stay with us until Summer. So stay safe! ??
  • Programming in R is fun! Feel free to reuse my source code and build a model for another country. But to be frank, you don't need to use R to create a similar model. Perhaps you'll build your model by excelling your skills in MS Excel.??

要查看或添加评论,请登录

Marek Zielinski的更多文章

社区洞察

其他会员也浏览了