Spearman correlations between the UV Index and COVID-19 per region using IBM PAIRS & Jupyter Notebooks.

Authors (alphabetical): @Marc Fiammante, @Merijn Weiss, @Wiktor Mazin, PhD, MMT

Introduction

In the last article we introduced 1) IBM PAIRS that provides highly localized, geospatial and temporal data and 2) data sources that, on a regional level, give access to COVID-19 data. In this week’s article we will start to explore our first analysis.

In data science, there are many analyses and algorithms to choose from. The choice depends on the purpose of the analysis and the available data. To facilitate the data science process, you normally follow a series of steps via e.g. the CRISP-DM framework.

In this article we take a look at the CRISP-DM framework and explore the Spearman’s correlation coefficient. Using Spearman, we explore correlations between the UV Index from IBM PAIRS with regional COVID-19 data.

Inspired from the CRISP-DM framework, our exploration is structured into three agile phases:

1.     Challenge understanding

2.     Data understanding and data preparation

3.     Correlation modeling

The results of these phases are presented in this article.

Please note: our exploration is only to identify & create assets for data scientists to explore geospatial-temporal data. The examples should not be taken as any interpretation of the results. We are not trained epidemiologists and therefore leave all interpretations to those that have the professional expertise.

Phase 1: Challenge understanding

Why Spearman correlations

Spearman correlations have been used in various articles such as Asyary and Veruswati, Bashir et al and Sahin. These articles present analyses that look into weather factors / geospatial-temporal features and the spread of and impact on COVID-19.

In the articles, correlations are explored between COVID-19 statuses like positive cases, deaths and recovered with weather factors like sunlight duration, temperature, dew point, humidity, rainfall, wind speed and air quality on a city level. Spearman correlations were utilized either at 1%, 5% or 10% significance levels or by not considering the significance level at all. Time lags of 1, 3, 7 and 14 days have also been investigated.


What is the Spearman correlation

The Spearman’s rank correlation coefficient is the nonparametric version of the Pearson correlation coefficient. Spearman's correlation coefficient measures the strength and direction of association between two ranked variables with values between -1 and 1. 1 denotes a perfect positive correlation between ranks, -1 a perfect negative correlation between ranks, while 0 denotes no correlation between ranks.

Spearman's coefficient is appropriate for both continuous and discrete ordinal variables and is used if data does not follow a normal distribution. If your data does follow a normal distribution, use the Pearson correlation coefficient.

Purpose of our exploration

Our first objective is to examine the correlation between one weather factor, the UV Index, and COVID-19 statuses on a regional level. This simple correlation is done to get a better understanding of IBM PAIRS, and build an analysis and visualization pipeline in a Jupyter Notebook. We were, of course, also curious to see if regional differences could be observed.

Given the fact that IBM PAIRS also contains the other described weather factors, additional correlation analyses with COVID-19 statuses can be carried out adapting and expanding the pipeline.

Phase 2: Data understanding and data preparation

To explore the regional impact of weather on COVID-19, we require three sources of data:

1.     Definition of the region, for this we used the vector data of Natural Earth. This dataset provides us with a geospatial definition of the country and its regions.

2.     COVID-19 data on confirmed, hospitalized and deceased individuals per day. These datasets are country specific and provide us the outcome we use in the Spearman correlation.

3.     Weather data per region per day. This dataset is obtained from IBM PAIRS and provides us the predictor we use in the Spearman correlation.

We use the weather factor UV Index as our first predictor. The UV Index indicates the strength of sunburn-producing UV radiation, however in the UV-B range it is a proxy to Vitamin D and DNA/RNA damages. The UV Index has been part of several COVID-19 studies, see e.g. here and here. The last reference states that a higher UV Index assists in slowing the growth rate of new cases, but the overall impact on COVID-19 spread remains modest.

We obtained daily UV numbers from the UV Index layer in IBM PAIRS and calculated the average UV Index for each region. Then we applied a 7-day rolling sum, since we hypothesized that a single day of exposure has a limited impact.

We have looked into COVID-19 regional data for a few countries. In the example, we use the COVID-19 data from The Netherlands. The Dutch COVID-19 open data includes people confirmed, hospitalized or deceased.

After collecting, cleansing and merging the UV data with the COVID-19 data on a regional level, it is important to evaluate the data before moving to the next phase. Visualization of the data can provide a quick overview.

The following plot shows the UV Index (our predictor) and the various COVID-19 metrics (our outcome) for the Dutch region South-Holland.

No alt text provided for this image


Phase 3: Correlation modeling

As explained, the objective of a Spearman correlation is to understand if two variables have a relationship. Scatter plots matrices are a great way to plot bivariate relationships between combinations of variables.

A scatter plot shows, for each combination of variables, the relationship between the values. If the scatter is random, the relationship will be weak (or none existent). If there are (curved) lines visible, this is an indication for a relationship.

The following Scatter Plot Matrix looks again at the Dutch region South-Holland.

No alt text provided for this image

The visuals give an indication for some correlations, but a calculation of the coefficient and associated p-value will need to confirm this impression and determine the significance of the correlation.

Spearman correlations

Spearman correlations between the COVID-19 statuses and the 7-day rolling sum of the UV Index were calculated together with the associated two-sided p-value. The p-value is for a hypothesis test whose null hypothesis is that two sets of data are uncorrelated.

In the case of The Netherlands, we are carrying out multiple hypotheses tests as we are correlating the 7-day rolling sum with the three variables (confirmed, hospitalized and deceased) for each of the regions. This corresponds ~40 hypothesis tests.

As we want to limit the amount of spurious / non-significant correlations, we apply a conservative method to account for multiple comparisons, the Bonferroni correction. In our calculations, we assess correlation significance at the 0.1% level.

Choropleths

Visualizations play an important part in conveying the results, so we decided to map the correlations results onto a regional map of The Netherlands, using a choropleth map. In a choropleth map, areas are shaded or patterned in proportion to a statistical variable that is applicable for that area.

No alt text provided for this image

Shown in green are the regions in The Netherlands where the COVID-19 cases are negatively correlated with the 7-day rolling sum of the UV Index on the 0.1% significance level. Negative correlation means that when UV Index goes up, the cases go down and vice versa.

Correlation was calculated taking different time shifts into account. This is done to explore the impact of different time lags and can, for example, be used to consider the incubation times into an analysis. We see how different time shifts impact the regional correlation results.

An impact on the results is clear when we change the time window. Using the same correlation, but now from April onwards, changes the outcome quite a bit:

No alt text provided for this image

And with our first analysis completed, we are back at Phase 1 again. We need to reevaluate our purpose / assess the hypothesis, evaluate the data and reconsider the analysis used. However, we do have our initial data sets, a data pipeline and visualizations. Assets we can re-use for other countries and for further exploration.

Coming next

In the next article we will describe some of the assets used to build the data pipeline, run the analysis and create the visualizations. We will describe the data, the Python libraries and the structure of the Notebook with the intend to make these available for others to re-use.

Hilmar Hamann

Head of Information Management Division at European Medicines Agency

4 年

great!

回复
Hendrik Hamann

Professor and AI Chief Scientist EBNN

4 年

interesting

回复
Marc Fiammante

Inventor à Paris Brain Institute / AP-HP - leader of Newborn Neurodigital AI Convergence project. Retired IBM Fellow.

4 年

Stay tuned next we shall make code available for reuse.

回复

要查看或添加评论,请登录

Wiktor Mazin, PhD, eMBA的更多文章

社区洞察

其他会员也浏览了