登录查看更多内容

Build your own COVID-19 projections model with R language

Marek Zielinski

Director of Software Engineering at GSK | Exec MBA

发布日期: 2020年4月13日

I'm not R programmer but I always wanted to see what that language is capable of. Until I was grounded at home due to covid19 outbreak, I had never had a chance to look at it closer. Anyway, being a bit bored and, like everyone, wondering when this will end, I thought, what if I could have a crystal ball that would give me the answer to the question when that crazy time is over. I know, there're plenty of things you can do while staying home but it seems like R powered crystal ball could be one of them. DIY!

Ok, let's try to predict the unpredictable. I hope you will enjoy it as much as I did and will learn some data analytics by completing this short exercise with me.

Installation

I didn't cover the R installation process but don't worry, if you don't have R on your PC, have a look at this tutorial: https://www.datacamp.com/community/tutorials/installing-R-windows-mac-ubunt

Feed the model with real data

Data frame is a two-dimensional data structure in R that stores vectors of equal length. I'll initiate a new data frame called covid19 with the raw data about the new daily case in Poland.

# New Cases in Poland
covid19 <- data.frame(
Dates = as.Date(c("2020-03-04","2020-03-05","2020-03-06","2020-03-07","2020-03-08","2020-03-09","2020-03-10","2020-03-11","2020-03-12","2020-03-13","2020-03-14","2020-03-15","2020-03-16","2020-03-17","2020-03-18","2020-03-19","2020-03-20","2020-03-21","2020-03-22","2020-03-23","2020-03-24","2020-03-25","2020-03-26","2020-03-27","2020-03-28","2020-03-29","2020-03-30","2020-03-31","2020-04-01","2020-04-02","2020-04-03","2020-04-04","2020-04-05","2020-04-06","2020-04-07","2020-04-08","2020-04-09","2020-04-10","2020-04-11","2020-04-12")),
NewCases = c(1,0,4,1,5,6,5,9,20,17,36,21,52,61,49,68,70,111,98,115,152,150,170,168,249,224,193,256,243,392,437,244,475,311,435,357,370,380,401,318)
)

As a warm-up, let's also calculate the Total Cases for each day. Since that data is already available, we could easily load it in the previous step, but I think it's a cool exercise to see how for loop works as well as to conduct a common data manipulation task to merge two data frames with Column Bind (cbind). Feel free to cross-check it with the public data. Both data sets should be the same.

# let's calculate Total Cases for each day

Total <- c()
count <- 0
for (val in covid19$NewCases){
    count = count+val
    Total <- c(Total, count)
}

# and then add to two our data
covid19 <- cbind(covid19,TotalCases=Total)

Experimenting

I was trying to find any clue that can help me with finding the trend. After a few experiments, I've discovered that the ratio of New Cases to Total Cases for a given day might be a good indicator. So let's see and add it to the model as NewInTotal.

NewInTotal <- c()

for (val in nrow(covid19)){
    count = covid19$NewCases / covid19$Total
    NewInTotal <- c(NewInTotal, count)
}

covid19 <- cbind(covid19,NewInTotal=NewInTotal)

Would you like to see how our data structure looks like? Try the head command to return only the first 20 rows of our new data structure:

> head(covid19,n=20)
        Dates NewCases TotalCases NewInTotal
1  2020-03-04        1          1  1.0000000
2  2020-03-05        0          1  0.0000000
3  2020-03-06        4          5  0.8000000
4  2020-03-07        1          6  0.1666667
5  2020-03-08        5         11  0.4545455
6  2020-03-09        6         17  0.3529412
7  2020-03-10        5         22  0.2272727
8  2020-03-11        9         31  0.2903226
9  2020-03-12       20         51  0.3921569
10 2020-03-13       17         68  0.2500000
11 2020-03-14       36        104  0.3461538
12 2020-03-15       21        125  0.1680000
13 2020-03-16       52        177  0.2937853
14 2020-03-17       61        238  0.2563025
15 2020-03-18       49        287  0.1707317
16 2020-03-19       68        355  0.1915493
17 2020-03-20       70        425  0.1647059
18 2020-03-21      111        536  0.2070896
19 2020-03-22       98        634  0.1545741
20 2020-03-23      115        749  0.1535381

Building a model

Let me state the hypothesis: since the TotalCases reached 11 on day 5 in Poland, the ratio of New Cases to Total Cases for a given day might be described with an exponential trendline. If so, let's try to model it. lm is used to fit linear models. Hence we need to use a small trick. Do you remember that exp(log(x))=x?

# model will be built on the data since TotalCases reached 11 (day 5)
model <- covid19[covid19$Total > 10,]

lp <- seq(1, nrow(model),1)
model <- cbind(model,nr=lp)

# fit exp model
exponential.model <- lm(log(model$NewInTotal)~ model$nr)

# we will use data for first 4 days later
covid19.firstdays <- covid19[covid19$Total < 10,]
lp <- seq(1, nrow(covid19.firstdays),1)
covid19.firstdays <- cbind(covid19.firstdays,nr=lp)

And now, let's assess how good the mode is by running summary command:

> summary(exponential.model)

Call:
lm(formula = log(model$NewInTotal) ~ model$nr)

Residuals:
     Min       1Q   Median       3Q      Max
-0.39616 -0.10746 -0.02575  0.13879  0.32368

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.973888   0.065378  -14.90   <2e-16 ***
model$nr    -0.051718   0.003081  -16.78   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1921 on 34 degrees of freedom
Multiple R-squared:  0.8923,    Adjusted R-squared:  0.8891
F-statistic: 281.7 on 1 and 34 DF,  p-value: < 2.2e-16

R-squared is 0.8923. It indicates that ~89% of the variation in the output variables are explained by the input variables so I think it's worth to explore this model further. Let's visualize a trend with plot command and add a trend line with lines command. Meanwhile, let's also define two one-line functions that will help us with further calculations:

# It returns the date for given day since 10 Total Cases
model.date <- function (i) return(as.Date("2020-03-07")+i);

#It calculates predicted NewInTotal for given day since 10 Total Cases
model.predictedNewinTotal <- function(i) return(exp(-0.973888 + (-0.051718*i)))
# Let's visualize it
model.NewInTotal <- exp(predict(exponential.model,list(model$nr)))

plot(model$nr, model$NewInTotal,
type="b",
col = "red",
lwd = 2,
xlab = "Days since 10 Total Cases",
ylab = "Daily New Cases In Total Cases",
main = "Daily New Cases In Total Cases \nsince more than 10 Total Cases in Poland"
)
lines(model.NewInTotal, col = "blue", lty="dashed", lwd = 3,)

legend("topright", legend=c("Real", "Model"),
       col=c("red", "blue"), lty = 1:2, cex=1)

And this how it looks like:

For further calculations, we need to create new data frames. expmodel will store values as per model, and expmodeladjusted will store values as per model calibrated to the TotalCases for yesterday (Apr 12, 2020). Calculations will be performed in an infinite while loop that will be broken if daily NewCases has reached 0.

#Daily NewCases as per model since the day with more than 10 Total Cases
expmodel <- data.frame(Dates = c(model.date(1)), NewCases =c(model$NewCases[1]), TotalCases = c(model$TotalCases[1]), NewInTotal = c(model$NewInTotal[1]),  nr = c(1))

i=2
M.NewCases <- NA
while (1) {
    M.PrevTotalCases = expmodel$TotalCases[i-1]
    M.Dates = model.date(i)
    M.NewInTotal = model.predictedNewinTotal(i)
    M.TotalCases = round(M.PrevTotalCases/(1-M.NewInTotal))
    M.NewCases = M.TotalCases - M.PrevTotalCases
    M.nr = i
    expmodel = rbind(expmodel,list(M.Dates ,M.NewCases,M.TotalCases,M.NewInTotal,M.nr))
    i = i+1
    
    if (M.NewCases == 0) break
}

i = nrow(model)

#Daily NewCases as per model since the day with more than 10 Total Cases, calibrated as per Easter Sunday's value.
expmodeladjusted <- data.frame(Dates = c(model.date(i)), NewCases = c(model$NewCases[i]), TotalCases = c(model$TotalCases[i]), NewInTotal = c(model$NewInTotal[i]),  nr = c(i))

while (1) {
    i = i+1
    M.PrevTotalCases = expmodeladjusted$TotalCases[i-nrow(model)]
    M.Dates = model.date(i)
    M.NewInTotal = model.predictedNewinTotal(i)
    M.TotalCases = round(M.PrevTotalCases/(1-M.NewInTotal))
    M.NewCases = M.TotalCases - M.PrevTotalCases
    M.nr = i
    expmodeladjusted = rbind(expmodeladjusted,list(M.Dates ,M.NewCases,M.TotalCases,M.NewInTotal,M.nr))
    
    if (M.NewCases == 0) break
}

Let's visualize Total Case and finally find out when this madness will end:

model = unique(rbind(covid19.firstdays, model))
model = unique(rbind(model, expmodeladjusted))
barplot(model$TotalCases,
main = "Total Cases in Poland",
xlab = "Date",
ylab = "Total Cases",
col=c("darkblue","white"),
beside = TRUE,
ylim=c(0,25000),
names.arg = model$Dates)

Hmm, Looks like we need to be patient...

But perhaps Daily New Cases will give us some sense of hope, let's have a closer look:

plot(model$Dates, model$NewCases,type="b",ylim=c(0,500),xlab="Date", ylab = "New Cases", main = "Daily New Cases in Poland", pch=15)

expmodel = unique(rbind(covid19.firstdays , expmodel))

# Red dotted line shows Easter Sunday
points(x=as.Date(c('2020-04-12','2020-04-12')), y=c(0,500), col='red', type='l', lty="dotted",lwd=2)

# Blue dashed line shows the model data
points(x=model$Dates[1:length(expmodel$NewCases)], y=expmodel$NewCases, col='blue', type='l', lty="dashed",lwd=2)


legend("topright", legend=c("Real/Calibrated Model", "Model", "Easter Sunday"), col=c("black","blue", "red"), lty = 1:3, cex=1)

That's cool! If this model is correct, Poland is almost on its peak. In the second half of April, Daily New Cases should start to go down:

Key takeaways

I appreciate it's a simple mathematical model that works under the assumption of ceteris paribus, i.e. the weather is not a major factor in virus spreading, social distancing and travel restrictions will remain the same as now, the presidential election that is scheduled for May in Poland won't trigger the second wave, etc.
Experts say COVID-19 confirmed cases are a fraction of the real total number. I think that's a fair statement. Unfortunately, there is no better data available, so I reckon we need to be more focused on trend analysis rather than actual numbers.
But if this model is correct, Poland is almost on its peak. Shortly after 15 Apr 2020, Daily New Cases should start slowly to go down. Even though, it is not even the beginning of the end. But it is, perhaps, the end of the beginning because the coronavirus will likely stay with us until Summer. So stay safe! ??
Programming in R is fun! Feel free to reuse my source code and build a model for another country. But to be frank, you don't need to use R to create a similar model. Perhaps you'll build your model by excelling your skills in MS Excel.??

要查看或添加评论，请登录

Marek Zielinski的更多文章

How to predict who wins the FIFA World Cup 2022 in Qatar? ??

2022年12月3日

How to predict who wins the FIFA World Cup 2022 in Qatar? ??

So we are about to start the playoffs and I guess many experts choose Brazil ???? as an anticipated champion of the…

3 条评论
How can R help you with planning international travel in ‘new normal’?

2020年7月8日

How can R help you with planning international travel in ‘new normal’?

So, we're getting into the summer months, but the virus is still around. That triggers many questions including: is it…

3 条评论
Are you keen to meet up with us?

2019年8月14日

Are you keen to meet up with us?

Who are we? Executive MBA Poznan Atlanta Student Alumni Association is a non-profit society which performs research and…

1 条评论

Build your own COVID-19 projections model with R language

Marek Zielinski

Director of Software Engineering at GSK | Exec MBA

Installation

Feed the model with real data

Experimenting

Building a model

Key takeaways

Marek Zielinski的更多文章

社区洞察

其他会员也浏览了

Is it going to rain ? A Pyomo-SVM Model

Check Regional Information via Coarsen

Grind 75 - 7 - Valid Anagram

Grind 75 - 21 - Diameter of Binary Tree

The Raw Output of Llama3.1 on Last Topic

Shiny App for Mapping Quandl Data

Exploring Univariate Combo Charts

C++ Core Guidelines: More Rules to Performance

AI/ML : Customers Behavior Analysis To Increase Subscriptions for Financial App...From Scratch [Part 1]

Numeric and Character Functions in R

Installation

Feed the model with real data

Experimenting

Building a model

Key takeaways

Marek Zielinski的更多文章

How to predict who wins the FIFA World Cup 2022 in Qatar? ??

How can R help you with planning international travel in ‘new normal’?

Are you keen to meet up with us?

社区洞察

其他会员也浏览了

Is it going to rain ? A Pyomo-SVM Model

Check Regional Information via Coarsen

Grind 75 - 7 - Valid Anagram

Grind 75 - 21 - Diameter of Binary Tree

The Raw Output of Llama3.1 on Last Topic

Shiny App for Mapping Quandl Data

Exploring Univariate Combo Charts

C++ Core Guidelines: More Rules to Performance

AI/ML : Customers Behavior Analysis To Increase Subscriptions for Financial App...From Scratch [Part 1]

Numeric and Character Functions in R