Accelerate analyzing the influence of regional factors on COVID-19 using IBM PAIRS & Jupyter Notebooks.
Marc Fiammante
Inventor à Paris Brain Institute / AP-HP - leader of Newborn Neurodigital AI Convergence project. Retired IBM Fellow.
Authors (alphabetical): @Marc Fiammante, @Merijn Weiss, @Wiktor Mazin, PhD, MMT
Introduction
Recent studies have looked globally at the influence of various climatic factors and occurrences of COVID-19 ([uk uvindex],[indonesia sunlight], [global temp, humidity, latitude]). Even though these studies show some correlation, consistently interpreting, or even reproducing the results, is a challenge.
The available COVID-19 data, the measures to contain the spread and the behaviour of people change over time. They are different between countries or even different for regions in a country. In addition, the granularity of available data greatly differs.
However, given the fact that the spread and impact of COVID-19 is dependent upon local factors there is a need for multi-factor, low granularity (regional) data. Data that can be consolidated and analyzed in an iterative and agile fashion, so that more precise influences can be detected.
With my colleagues Merijn and Wiktor, we decided to start an exploratory research on assets that can help data scientists analyze influences of geospatial-temporal data on the current pandemic.
In a series of articles, we will share our progress as we continue with our exploration. Re-usable assets will be made available where applicable. We will discuss data access, review some of the recent studies and explain the analytical approaches included (eg Spearman, GAM). These studies and correlations will feed into practical examples on how to gather, analyze and visualize results.
Our exploration is only to identify & create assets for data scientists to explore geospatial-temporal data. The examples should not be taken as any interpretation of the results. We are not trained epidemiologists and therefore leave all interpretations to those that have the professional expertise.
IBM PAIRS
Climatic data can be found online from diverse sources, but getting access to all possible influencing factors, including non-climatic, often is a lengthy process. Many sources need to be integrated, coordinates aligned, and licensing must be taken care of.
However, the availability of such a consistent, fine-grained dataset is a pre-requisite for any geospatial-temporal analysis. A pre-requisite that IBM fulfills with the IBM PAIRS Geoscope platform.
IBM PAIRS Geoscope is a platform specifically designed for massive geospatial-temporal data. Data is ingested from a wide variety of sources and prepared for search-friendly access. The platform provides access to a rich, diverse, and growing catalog of continually updated, geospatial-temporal aligned information.
The current catalog has over 4 petabytes of data collected, curated, and ready to use, available in various categories, aligned geographically with consolidated resolution and coordinates.
- Application: Agriculture, Rapid response, Wildfire
- Domain: Atmosphere, Land surface, Oceans/lakes/rivers, Urban,
- Sector: Animals/livestock, Economic, Energy, Geologic/soil, Political, Social, Transportation/infrastructure, Vegetation/crops, Weather/climate
- Source: (IoT) sensor, Aerial/drone, Radar, Satellite, Survey
- Type: Analytics product, Data product, Forecast, Measurement/survey
The platform is available on IBM Cloud and accessible via a GUI and API. A free edition with a subset of data is available to everyone via the GUI on https://ibmpairs.mybluemix.net/. For API access a Python SDK is available on https://github.com/IBM/ibmpairs
In our exploration we will use IBM PAIRS for climatic factors such as UV Index, Temperature, Humidity and Wind Speed.
Accessing COVID-19 pandemic country data
IBM PAIRS also includes data from John Hopkins University for the global spread of COVID-19. This data includes confirmed cases and deaths, however tracked on a country-level.
In our approach we wanted data at the regional level, and if available hospitalized and intensive care daily figures. We found that the metrics differ widely between the countries in terms of metrics tracked, the granularity, the definition and the quality. Nevertheless an attempt is made to harmonize the data where possible.
In our exploration ideally the following metrics are obtained on a regional level:
- confirmed: individual tested positive for COVID-19
- hospitalized: individual admitted to a general hospital and tested positive for COVID-19
- hospitalized_icu: individual admitted to a ICU unit in the hospital and tested positive for COVID-19
- recovered: individual confirmed to have recovered from COVID-19
- deceased: individual confirmed to have passed away with COVID-19 infection
The current sources we are using are:
France
- Official Open Data: https://www.data.gouv.fr/fr/datasets/chiffres-cles-concernant-lepidemie-de-covid19-en-france/
- Data path: https://raw.githubusercontent.com/opencovid19-fr/data/master/dist/chiffres-cles.csv
Netherlands
- Open Source Data Initiative: https://github.com/J535D165/CoronaWatchNL
- Data Path: https://raw.githubusercontent.com/J535D165/CoronaWatchNL/master/data-json/data-provincial/RIVM_NL_provincial_latest.json
Denmark
- Official data (zip file) from Statens Serum Institut: https://www.ssi.dk/sygdomme-beredskab-og-forskning/sygdomsovervaagning/c/covid19-overvaagning/arkiv-med-overvaagningsdata-for-covid19
Sweden
- Official data (zip file) from the European Data Portal: https://www.europeandataportal.eu/data/datasets/https-free-entryscape-com-store-360-resource-12 ("Number of cases of coV-19 in Sweden per day and region")
We are in the process of adding more countries and have colleagues looking at regions in other continents.
Coming next…
The first analysis we will look at in the next article is a Spearman correlation. We will explore the correlation between a single predictor, UV Index, on various outcomes, such as the incidence of hospitalized COVID-19 patients.
We will make use of public data sources, IBM PAIRS, Jupyter Notebooks and various Python libraries to ingest the data, calculate the Spearman correlation coefficient, access the significance and visualize the outcomes such as in this example:
In this next article exploration we look at Spearman's correlation coefficient for different countries and when applying different Time Slices, Rolling Windows and Time Shifts.
Following article dives into the code and points to the public github with sample data for testing and details on how to get a 30 days free trial on PAIRS.
#ibm, #ibmpairs, #datascience, #resuableassets