A new dataset for the characterization of Covid-19 outbreak in Europe: a correlation study among different geo-statistical areas
1. Summary
In this article, we present a new study for the characterization of COVID-19 outbreak in Europe. The goal of this analysis is to provide a first investigation on the correlation between different environmental and demographic factors and the impact of COVID-19 in different European areas. A first trial to answer some significant questions as: why have some areas been particularly hit by COVID-19 whereas others no? What are the main factors which have influenced the spreading of the virus and the death rate? These questions are very complex inquiries whose answer is of course beyond the goals of this study and has to be investigated in future research. Our personal contribution is mainly focused on the collection of a new European dataset constructed by merging different other databases provided by the public authorities and by the European Union. In this database, every single geo-statistical area (also called NUTS3) is described by three parameters (“targets”) related to the COVID-19 diffusion in that particular area; they are respectively: y1= “number of positive per 100.000 inhabitants per day”, the y2= “number of deaths per 100.000 inhabitants per day” and y3= “death rate” (better explained in the next paragraphs). Meanwhile, every single NUTS3 is described by a set of variables called “features”, which can be viewed as numerical descriptions of different factors that could have influenced the spreading of the virus and its effects. Such variables are for example the distribution of the population by age group, the female/male ratio, the impact of different diseases on the population, the pollution level, some climate descriptors and the type of measures adopted by the authorities etc. . Once we collected such a dataset, we performed a FIRST ANALYSIS on the data, by studying the correlation of every single feature with the targets using the Pearson and Kendall coefficient. Lastly, we tried to interpret these coefficients providing some possible explanations of the results. For a quick reading of this article, skip the sections 3, 4.b and 4.c . All the results are summarized in the conclusions.
2. Motivations and limits
Correlation studies constitute a first step in understanding the main impactive factors influencing the diffusion and the mortality of COVID-19 epidemic. In this perspective, such studies are essential in the development of robust forecasting models able to consider environmental conditions, population characteristics and government responses. Nonetheless, this goal is quite challenging due to four main reasons. The first one is the current availability of open-source and discrete quality datasets. Although a considerable number of databases have been published so far, most of them are unusable for correlation study purposes. On one hand, a part of them presents a lack of scalability because the data are collected only in some very specific regions. On the other hand, another part considers larger areas but shows a very low granularity providing data only at a national level as for example the well-known Johns Hopkins University Dataset (US excepted). The second main difficulty deals with the method in which data are collected. Each nation has adopted different methods to count COVID-19 cases and deaths; moreover, the number of tests per habitant is far from being uniform and of course this has a direct impact on the counting of the positives. The third challenge is an intrinsic limit of each correlation study: correlation does not imply causality. Two variables can be statistically correlated one to another and, simultaneously, no causal-effect relationship exists between them. Lastly, given the structure of our dataset, another limit is the impossibility to apply the results we obtained analyzing groups to any single person constituent of said groups; doing otherwise would constitute an ecological fallacy. For such reasons, this study is mainly focused on underlying higher level correlations between the features (descriptors) and the targets, hoping that the output results can inspire and direct more specific studies in the next future. In one sentence we could say: “the study is more useful for asking the good questions rather than having the definitive answers”.
3. Related works
Multiple COVID-19 Data Repository have been published so far by different institutions: the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [16], World Health Organization, Chinese Center for Disease Control and Prevention, National Health Commission of different states, European centre for disease prevention and control (ECDC) [17]. In an effort to readily provide information necessary to understand the pandemic diffusion, Samrat K Dey et Al. [1] compiled and analyzed epidemiological outbreak information on COVID-19 based on the data from the above sources. In [19], Mariotti et Al. investigated the spreading of the virus in Italy using the well-known SIR model; we invite you to read this article for a better understanding of the mathematic techniques used to describe and predict the time evolution of the epidemic. A correlation study has been produced by E. Bontempi et Al. [2] that analyzed air pollution and its possible effect on the virus diffusion, examining the PM10 situation in Lombardy before the sanitary emergency explosion in Italy. This investigation has been taken as reference by us to compare part of the results related to the environmental features. The possible correlations between temperature and COVID-19 spreading were instead compared with those of Robert L. Johnson et Al. [3]; the impact of humidity was controlled with the findings of Jingyuan Wang et Al. [4]. In both the two cases we found coherent results. Also, some outcomes related to Diabetes were confirmed by the results found by Swati Sharma et Al. and S. Knapp et Al. [5] [6]. Lastly, it is fair to mention the data science COVID-19 challenge hosted by Kaggle at this link. This work has been partially inspired on this.
4. Methods
4.a. Overview
From a high-level point of view, the study can be divided into two main phases: dataset building and data analysis. From this point on, the single geo-area, technically called NUTS3, will be identified by l for brevity. As pointed out by Eurostat, NUTS3 are defined as geographical areas whose population is included in the range 150’000 - 800’000 number of people (province, department and/or small region).
Dataset building consists of other three sub-phases: 1) Features Computation, intended as the extraction of the features table containing the different descriptors for a given l. We consider around 100 features which can be grouped in four different classes: demographical, authority measures, environmental, Causes of death. 2) Target Computation, intended as the collection and processing of the COVID-19 targets [y1,y2,y3] for each l. As described in the summary, such data have been taken from multiple sources related to the national health authorities of each state. 3) Cross-reference, intended as the joining of the two tables generated in the previous two phases to construct the definitive dataset. As better explained in section 4.c, this is done to cope with a non-uniformity in the nomenclature of each NUTS3 among the different sources of our data. Once the dataset is completed, the single l is described by a record as represented below in Fig. 1. Such a record includes the list of features [x1 = 'Pollution Level' , x2, .... xN] and the list of targets [y1,y2,y3]. The features are better described in the section 4.b while the COVID-19 targets are respectively: y1=" number of positive per 100.000 inhabitants per day", y2=" number of deaths per 100.000 inhabitants per day" and y3="death rate" as pointed out in the summary. Such indicators have been chosen for their capacity to explain different aspects of the COVID-19 outbreak over a certain l region. The parameter y1 explains how much a certain geo-area has been hit by COVID-19 and how rapidly the number of positives has grown over the time. The parameter y2 explains instead the effect of the COVID-19 on the populations in terms of number of deaths, always considering its development over time. The target y3 tells us the rate between the absolute number of positives and the absolute number of deaths; the last can be useful for underlying which factor could affect in positive or in negative the survival of a certain population taking constant the number of positives. As stated above, all these indexes do not take into account the number of tests as it should be done in an ideal scenario where data related to the tests are available even at NUTS3 level. In this respect we can state that y2 is the parameter less affected by such a problem due to the fact that it can be assumed that the majority of dead people were tested if they presented COVID-19 symptoms. The European nations considered are: Italy, UK, Germany, Sweden, Greece, Denmark, Finland, Austria, Croatia and Poland.
Fig. 1. An example of how a NUTS3 (l) is identified inside the records.
Data Analysis: Once the dataset has been collected and properly structured, the data has been analyzed by calculating the correlation between every single feature x_n with a single target y_n. In this study, this is performed simply by calculating two different coefficients, the Pearson coefficient and the Kendall coefficient (with related p-values). Based on the values of such a coefficient, we then assess the quality of the correlation. This procedure has been applied considering different subsets of the European dataset, specifically: the Whole dataset (all the states mentioned about), Italy Dataset, Germany Dataset , Sweden dataset, Italy + Germany + Sweden dataset (from now on indicated with Subset) and, finally, Italian Regions dataset (better explained in 4.c).
4.b. Dataset Building (detailed)
Features Computation: Most of features are taken from databases provided by Eurostat at the following link; if the data source is different, it is specified here below. Demographic features class includes "age group distributions" intended as the portion of population in the following age groups: 0 - 14 (1), 15 - 29 (2), 30 - 49 (3), 50 - 64 (4), 65 - 84 (5), 85+ (6), "gender distribution" (7) ( %Male and %Female), "number of family kernels with n members'' where n is: 1 (8),2 (9),3 (10),4 (11), 4+ (12) , "total population" (13) , "population density" (14) , "number of deaths in 2018" (15). In authorities responses features class, it is included a set of parameters related to the government responses adopted over the period 31/01/2020 - 31/03/2020. To elaborate such parameters we use the Coronavirus Government Response Tracker produced by Oxford University where a "response-score" has been assigned to each nation with a daily cadence, taking into account the different restrictions the single government has assumed (here). The first element of this class is the "lock-time-rate" (16) intended as the rate between "lock down time" and the "time of COVID-19 presence" in l . Specifically, "lock down time" represents the time passing from the date on which the COVID-19 target data has been measured in l and the date when the "response-score" was greater than 0 ( starting of the restriction ). Differently "time of COVID-19 presence" represents the time passing from the date on which the COVID-19 data in l has been measured and the date of the first COVID-19 case. In this sense "lock-time-rate" represents the fraction of time in which that specific l has been subjected to COVID-19 in a controlled environment; it can be greater than 1. The other two features are : "lock-score-sum-normalized" (17) and "lock-score-mean-normalized" (18). The former is obtained by multiplying the "lock-time-rate" with the sum of the "response-scores'' on the period cited above. Similarly, "lock-score-mean-normalized" has been obtained by multiplying "lock-time-rate" with the mean of the "response-scores''. In the authorities response class , it has been included even the number of tests performed on the population. Specifically we take into account the "number of tests done per 100000 inhabitants" (19) and the “number of people tested per 100000 inhabitants” (20). Such features are taken from the database here and are available only at national level. In any case they remain still valuable features to consider. Finally, in this authorities response class, we decided to include the "number of intensive care beds per 100000 inhabitants" (21), taken as an indicator of the development of the healthcare system in that specific l. Environmental features class comprehends a set of features describing the environmental conditions in l. They are: the "average year temperature", intended as the mean temperature measured over the 2020 (22) (here), the "average year humidity" (23) intended as the mean humidity measured over the 2020 (here) , the "air pollution" (24) intended as the concentration of pollutant PM10 per meter cube of air measured in l (here). “Causes of death for pathology X” features class , are taken from the EUROSTAT dataset (COD, here). Each single feature represents the number of people who died for a given pathology X in 2018; as possible to see in the link above, around 80 pathologies have been considered. The data are obtained from the medical certificate of cause of death underlying cause of death, in accordance with the ICD-10 definition i.e. "the disease or injury which initiated the train of morbid events leading directly to death, or the circumstances of the accident or violence which produced the fatal injury". This data have been considered as valid indicators of the health state of the population whose specific genetic can influence the probability of dying for a specific pathology. We used this data under the assumption that there would not be any statistical difference in the causes of death in 2020, since the population would have been largely the same, if not by the pandemic caused by COVID-19.
Targets Computation: The table of targets have been collected by merging different datasets provided by the single national authorities. Here you can find the list of the official web sites we use to get the data. As possible to see, in almost all the cases, the total (cumulative) absolute number of positives and the total total number of deaths are the numbers reported by the sites. Such raw data cannot be directly used for this correlation study because they do not take into account the number of people living in the specific geo-area and the amount of time that COVID-19 is present. For this reason we normalize the targets as indicated in the equation represented here below. We scale the target to 100000 inhabitants and then divide it by the time COVID-19 is presented in that specific area (t_COVID19).
This time is obtained subtracting the date on which the data (number of positive and/or number of deaths) has been measured and the date on which the first case of COVID-19 has been recorded in that specific area (t_start). For what concerns y3 (death rate) it has been simply obtained by dividing the absolute number of positives in l by the absolute number of deaths in l.
Cross-reference: One of the main problems consists in the fact that the data contained in the two tables are taken from different sources without a uniformity of the names and so of the keys identifying the single l. This non-uniformity implies that the same NUTS3/region (l) can have different names in the features table and in the target table. Such a problem has been solved by creating an association map where each record in the target table has been connected to one and only one record of the features table. This has been performed through the Google Maps API which allows to find the longitude and latitude of a specific place (the coordinates of the center) by passing the name of a location. Once this calculation is done for each l of the two tables, each record in the target table has been connected to the nearest record to the features table (nearest neighbor approach).
4.c Data Analysis (detailed)
To find possible correlations between features and targets, we based our analysis on two methods: Kendall Coefficient and Pearson Coefficient. Given two variables X and Y, the Kendall rank correlation coefficient τ will be high when the observations of the two variables have a similar rank, while it will be low when observations have a dissimilar rank. The Pearson correlation coefficient, instead, measures the linear correlation between X and Y, and has a value in the interval [-1,1], with 1 meaning total positive linear correlation, -1 meaning total negative linear correlation and 0 meaning no linear correlation. In both cases, the p-value, the probability of obtaining test results at least as extreme as the results actually observed, is a factor that validates the coefficient it is related to. The potentially relevant results were those in which the p-value was under a standard threshold (5%) and the absolute value of the coefficient over another threshold (25%). As stated above, we conducted the same study on different types of dataset. The first analysis is called "European Analysis" and it is performed considering the Whole dataset (created by merging the data related to the nations cited above) and some sub-parts of it: Italian Dataset, German Dataset, Sweden Dataset. In this analysis the COVID-19 data are sampled at a single date included in the temporal range between 31/03/2020 and 16/04/2020. The second analysis is called "Italian Region Analysis" where we conducted the same procedure using the data related to the Italian regions. This dataset presents the same characteristics described above although in this case we have an historical record of values for the COVID-19 targets, referred to the whole period between the 25/02/2020 and 13/06/2020. By observing whether or not the correlations between the features and the targets are maintained during the time, we enforce/dismiss the results found in the "European Analysis". For completeness, in most of the dataset we conduct the analysis both scaling and not scaling the features dataset to keep all the descriptors referred to the same scale. Scaling analysis is collected in folders with the “_normalized” suffix, while not scaling analysis is collected in the other ones (the ones without that suffix). Scaling has been performed consistently with the type of feature as better explained in the readme.
5. Results
As following we propose the main outcomes of the Data Analysis performed, in association with some interpretations. At this phase of research, we limit our investigation to the study of the correlation between the single feature and the single target, although a complete analysis should take at least other two main steps: analysis of the autocorrelation among the features themselves, analysis of the multiple correlation between the features and targets (multiple regression). We leave this as future work. The subject of the study is the geo-statistical area (NUTS2 or NUTS3). For this reason the correlation between a single feature x and a single target y has to be interpreted correctly: i.e. a high correlation between x (for example: the % of population) and y (for example: death rate) cannot be read as “x is a risk factor for y”, but as: “at regional level, a higher/lower x seems to be correlated (statistically) with a regional higher/lower y”. This statement is the only consideration that can be done at this stage of the analysis using our dataset. Further considerations and speculations can be done on a certain domain knowledge (how the correlation is reasonably explicable), but can only be confirmed through future targeted investigations. All the plots and graphs can be found at the following link in the subfolder “Results”. In each subfolder (corresponding to the different parts of the dataset considered) it is possible to find different typologies of plots and graphs such as the tables of the different coefficients calculated per each feature (and per each target), the geographical distribution of the values of the different features and targets (Fig. 2) and 3D scatter plots. ** Further details are given in the ReadMe document on the folder. Here below we summarize the most important outcomes; those features not presented were discarded because they didn’t show any statistically meaningful data or were just subclasses. We preferred to focus attention on major class features.
Target names: - y1: Total Positive per 100’000 inhabitants per day (also called positive_Normalized_Time) - y2: Total Deaths per 100’000 inhabitants per day (also called died_Normalized_Time) - y3: Death Rate (died positive rate)
**Together with the code all such plots are open and available for the scientific community; whether you are a scientist or a free-lance researcher, we would appreciate if you let us know you are using such data **
Demographical features
Age distributions: With respect to the age group it is possible to underline the following results: the groups 0-14 , 50-64 and 85+ seems to not have a statistical significance in all the data sets considered. The 15-29 group shows a negative statistical correlation in the Italian dataset both for y1 and y2. Differently in Germany Dataset the correlation is opposite. Having such a high variance we cannot attribute statistical meaning to this feature. The 30-49 group shows a positive statistical correlation both for y1 and y2 in all the datasets. A possible explanation of this outcome could be associated with the fact that this age group is the most active between all the others, even during the lockdown period. An higher percentage of population of this group means more people moving, traveling and/or commuting. In this case, the positive correlation with y2 can be a direct consequence for a great number of people that were infected, and therefore a certain amount of deaths could be expected. The 65-84 group shows a negative correlation for y1 in the Germany and Subset dataset, while a positive one with the Italy dataset; a possible explanation of this discrepancy between the two datasets can be the higher share of elderly people and the higher median age in Italy, compared to the other European countries. The Subset dataset gave unexpected results, regarding age groups, in antithesis with common sense: 0-14 group and 30-49 group showed a positive correlation with y2, 50-64 group and 65-84 group showed a negative correlation with y2. Our data did not allow us to produce any likely explanation for this result.
Population Density: In contrast with the common sense, density seems to not have played key role in the diffusion of the virus. The correlation index is always lower than 0.15 both for y1 both for y2 with an high p-value.
Family Kernels: It is possible to see that family kernels features show quite interesting trends, especially considering the Italy dataset. The features "number of family kernels with 1 member", "number of family kernels with 2 members", "number of family kernels with 3 members", seems to be statistically positively correlated with both y1 and y2. On the other hand, the “number of family kernels with 4 members", and the “number of family kernels with 5 members" show a negative statistical correlation. Here below on Fig. 4 you can find the map with the geographical distribution of the "number of family kernels with 3 members" and "number of family kernels with 4 members" in Italy . The negative correlation for family kernels with 4 and 5 members can be ascribed to the higher diffusion of this kind of families in those regions that were least affected by the virus diffusion. In fact, it is possible to see that the two maps are quite complementary one to the other and, secondly, the distribution of "number of family kernels with 3 members" is very similar to the one of y1 or y2. This fact is quite impressive and further investigations should be directed in this direction. Some hypothesis could be formulated at this regard: 1) There is indeed a causal-effect relationship between these variables and the target, motivated by the fact that family kernels composed by a lot of people are more efficient social structures in a lockdown scenario both from the point of view of the family care, both for minimizing the number of movings. 2) There is no direct cause-effect relationship between the features and the targets, but there can be some hidden variables correlated with both of them as for example the latitude. This fact has to be verified.
Authorities responses features
Restriction scores: This subclass is represented by the features "lock-time-rate" "lock-score-sum-normalized" and "lock-score-mean-normalized". Looking at the coefficients over the different datasets, it is evident that "lock-time-rate" is for sure the most important variable of the triple (Fig. 5). Concretely, we can see that "lock-time-rate" has negative correlation with all the targets in the Italian dataset being both the Kendall and Pearson coefficients bounded in the range -0.37 ÷ -0.28. In the Swedish dataset it has been found a negative correlation of -0.54 between the “lock-time-rate” and the “died positive rate”. A similar but lower correlation between the same indexes has been found even in the European Dataset (-0.19). Such data could be explained by the fact that the restrictive measures adopted by the governments have been useful more in containing the effects of the virus rather than the spreading of it. Moreover it seems that promptness in adopting containment measures in an early stage of the virus spreading ("lock-time-rate") is quite more important than very restrictive measures.
People tested per 100’000 inhabitants: This feature was considered to understand how possibly the testing on the population can have affected the virus diffusion as a containment policy. As repeated many times in this article, such data are unfortunately available only at national level making impossible the integration on the target computation. Regarding the target y1, we found a discrepancy between the Whole and the Subset dataset. The Whole dataset confirms the trivial correlation between the number of tests and the positive cases: higher number of tests will reveal a higher number of infected people. The second, instead, doesn’t seem to show the same trend but we didn’t consider it because of the high p-value. The data from the y2 target show a negative correlation, thought to be a direct consequence of higher number of tests: more tests lead to a better screening of the population and, thus, finding positive people in the early stages of the disease may have possibly increased their chances of surviving.
Number of intensive care beds per 100’000 inhabitants: Data on the numbers of available beds shows a positive correlation in Italy and Germany, while a negative one in the Subset and the Whole dataset, for both y1 and y2. The positive correlation seems to be in contrast with what the common sense would suggest: more beds should mean a reduced likelihood of succumbing because of a disease, especially in countries with an advanced health care system like Italy and Germany. The likely motive for these results is the great spread COVID-19 had in Italy and Germany, compared to smaller European countries, where results are in accordance with our expectations. So, this finding is probably more a consequence of the virus diffusion, which seems to be confirmed by adding y3 to our considerations (strong positive correlation in Italy, while there is a negative one in the Whole and the Subset databases).
Environmental features
The features analyzed were: mean temperature (°C), mean humidity (% rel.) and air pollution (μg/m3). Such features are particularly interesting because they can be supposed to be quite uniform over a specific l and they are therefore less subject to statistical fallacies as pointed out in the section 2.
Air Pollution: Looking at the data from the analysis on y1, the air pollution was showing an impact only in the Italian dataset. To better understand this, it is helpful to look at the map of the air pollution recorded in Italy and the diffusion of the virus in the peninsula (Fig. 6). High air pollution levels are found in the Po plain. The COVID-19 diffusion map can easily be superimposed onto the pollution one (below). On the left the geographical distribution of the feature air_pollution. On the right the diffusion of the virus in Italy as positive over 100’000 inhabitants on 21/06/2020 [7]. As for the mortality, strong positive correlation persists in the Italian data but also it emerges on the other datasets. As stated by Robert L. Johnson et Al. [3], air pollution may be a contributor in increasing mortality when also lung diseases are present. Our data suggest that mortality is higher where air pollution is higher, as in northern Italy, thus indicating that poor air quality may impact on the ability of people to recover COVID-19.
Humidity: The most important correlation values were found for the diffusion. In fact, all three data sets gave us negative correlation values, thus suggesting a lower degree of virus diffusion where the humidity levels are higher. In fact, as found out by Jingyuan Wang et Al. [4], just a small increase in humidity levels can deeply reduce the infective index of the virus, in the range of an R value reduction of 0.008 per percent of humidity increase.
Temperature: The temperature feature showed a not so strong and defined correlation, but interestingly all correlations we found are negative. This trend suggests that there is a higher diffusion when the temperature is lower. Two main reasons: virus diffusion is temperature sensitive and human social behaviors. The first reason is justified by the fact that temperature plays a key role in virus infective index. As found out by Jingyuan Wang et Al. [4], COVID-19, as also many other viruses, is susceptible to temperature increases, leading to a diminished diffusion with higher temperatures. For the second reason, in colder periods we tend to stay in closed environments rather than outside, increasing the chances of being in contact with many others. Social distancing becomes much more difficult thus giving a higher chance to the virus to spread easily.
Causes of Death features
In our study, we investigated possible correlations between the already existing Causes of Death in the population and COVID-19 targets. As a general consideration, it is possible to see that in all the different datasets, data on pneumonia, HIV, dementia and neoplasms showed positive correlation both for y1 and y2, with significant p-values. As it was stated above, these findings are not concerned with the single person constituent of the groups we analyzed. Pneumonia: This was an expected result, due to pneumonia being a common respiratory disease, known to promote other pathological conditions (especially viral pneumonia, due to the resultant immunodepression)[8]. As a consideration, a population in which pneumonia is a high impacting cause of death, is more likely to get infected and die because of COVID-19. It would be interesting to be able to determine if there is a significant difference between different kinds of pneumonia (e.g. bacterial or viral) in the spread and/or mortality from COVID-19, but the data we evaluated didn’t mention any pneumonia subcategory. HIV: HIV targets mostly a person’s CD4+ lymphocytes[9], causing a progressive diminished activity of the immune system, leading to opportunistic infections and, ultimately, AIDS[10]. Thus, for the same considerations done above, the positive correlation was expected. Though, in the Whole dataset, we didn’t find confirmation regarding a correlation between HIV and y2. Dementia: With our study, we also found a positive correlation, even though less strong than other features of this section, between dementia and contagions and death from COVID-19 (y1,y2) . After a search in the literature, we didn’t find any likely explanation to this finding, nor our data allowed us to provide one. Neoplasms: This could be an explicable result, due to the increase likelihood and frailty of a group of people affected by any neoplasm to opportunistic infections, due to numerous factors, such as the malignancy itself, the treatment used, the age and other Causes of death[11]. This finding is particularly important due to the high prevalence oncologic pathologies have in populations. Diabetes mellitus: Our study showed strong negative correlations between diabetes mellitus and COVID-19, with very low p-values, both with the spread and with the mortality. Such correlation was an unexpected result, due to diabetes mellitus and its Causes of death being conditions known to promote infections [12]. A possible explanation for this correlation was found in the literature. As stated in the findings of Swati Sharma et Al. [13], Metformin is a first line treatment for type 2 diabetes, who seems to activate AMPK. SARS-CoV-2 uses angiotensin-converting enzyme 2 (ACE2) as its receptor to enter the human body. AMPK has been shown to increase the expression of ACE2 as well as phosphorylating it. It is possible that this addition of a phosphate group would cause conformational changes in the ACE2 receptor that, in turn, could lead to decreased binding with SARS-CoV-2 due to steric hindrance and conformational changes. Nonetheless, once the virus is inside, there is a down-regulation of ACE2 receptors. This leads to a rise to lethal cardio-pulmonary complications. By upregulating ACE2, the rise could be averted. Hence, metformin would not only prevent the entry of SARS-CoV-2 as described above, but also prevent the harmful events caused by activation of ACE2 through AMPK-signalling.
Italian Regions Study
As previously anticipated, we also developed a study focused only on the Italian region dataset where the COVID-19 data (targets) were available with a daily cadence. Specifically such a dataset includes the data for the period from 25th February 2020 to the 13th June 2020. This has been done mainly to analyze possible changes in time, not testable through the European Dataset. On the other hand this dataset includes a quite small amount of samples (Italian regions are only 21, since Trentino Alto Adige is divided into two different “regions”, Bolzano and Trento) compared to the other one, making the correlations less statistically relevant. For these reasons in the table below we reported only the results that can be considered statistically significant (coefficients represent the average over the whole period). Data on the age groups had relevant significance only on the group 15 - 29 years old. Both with positive and with died data, we found negative correlation. This outcome agrees with what we found previously in the general study, and it could suggest that this part of the population seems to be less affected by the virus. The other groups of ages didn’t show any statistically relevant data, as in the general study. In agreement with the analysis done on the European Dataset, “Number of intensive care beds per 100’000 inhabitants” feature shows positive correlation both with y1 and y2, in contrast with what the common sense could suggest: more beds means usually more medical care. The Italian scenario is particularly interesting because the virus spread more in the Northern regions whose number of available beds per hundred inhabitants is on average greater. So, this finding is probably a consequence of the virus spreading as shown by the positive correlation with y1. Temperature and number of positives are negatively correlated, meaning that higher temperatures seems to imply a slower diffusion of the virus, as stated above. It is important to note that this influence is more evident in the positive data, more than in the number of deaths, highlighting a potential correlation with the diffusion of the virus, more than with its mortality (as expected). Pneumonia and HIV have both positive correlation, with essentially the same correlation in the positive and in the died targets. About the diabetes mellitus, the Italian scenario confirms the trend observed in the general study with strong negative correlations.
7. Conclusions
In this study a new open-source dataset for the studying of COVID-19 has been published. In such a dataset, each specific geo-statistical area (NUTS3) has been described with a set of features (divided in different classes) and a list of three targets used to characterize the COVID-19 outbreak in the region. The correlation between each feature to the three targets has been analyzed considering the Whole dataset (Italy, UK, Germany*, Sweden, Greece, Denmark, Finland, Austria, Croatia, Poland) and different subsets of it. The results have been collected and are now available to the scientific community at this link. Different ages are often correlated with different levels of affectedness to various diseases. In the COVID-19 case, only the 15-29 age group showed some negative correlation, supposedly due to the fact that this part of the population seems to be less hit by the virus, at least in the period in analysis. The number of family kernels with 1,2 and 3 members seems to be statistically positively correlated with both datasets. On the other hand, the family kernels with 4 and 5 members show a negative statistical correlation. In this respect, no direct cause-effect relationship probably exists between these variables and the targets; nonetheless there can be a hidden variable correlated with both of them. The results for population density didn’t show any indication of correlation; difficult to understand as a result as it goes against what we could expect from a pandemic scenario like this. As claimed in the results, the data related to the authority responses seems to underline the importance of adopting immediately the restrictions rather than evidencing a correlation with how restrictive and intensive such restrictions are. Nonetheless, it is worth noting that this study compares only the different types of authority measures adopted concretely from the governments. It is not therefore possible to state anything about what could have been the effects if such measures had not been taken. With respect to environmental class, several past studies have demonstrated a positive association between air pollution and pulmonary diseases. The Italian dataset seems to confirm this assumption, and even though the correlation in the other datasets is not so evident, a certain positive trend is confirmed. Higher humidity levels can deeply reduce the infective index. Thus, in those countries and in the periods of the year with higher humidity values, the diffusion of the virus should indeed be lower. Colder temperatures make the virus diffusion easier; both because it is temperature sensitive and because of human behavior. Our findings confirm that trend, even though the correlation coefficients are not so high. Regarding the Causes of Death features class, many high impacting diseases correlate positively with infection and mortality of COVID-19 . This consideration is valid for the majority of Causes of Death features class, with the noticeable exception of diabetes mellitus. In this respect, it is fair to mention that among the drugs used to treat this condition, metformin seems to show a protective effect against infection and the deleterious effects of the COVID-19. As pointed out at the beginning, the analysis here presented can be seen as a first step towards the answer of the enquiries reported in the summary. Each of the hypotheses made in the results are just possible interpretations of high level correlations whose reliability have to be confirmed through more specific and advanced investigations. Moreover, we are perfectly conscious that we exploited only a small percentage of potentiality of our dataset. A future work should try to investigate firstly the existing correlations among the features themselves in order to spot dependencies among them and reduce therefore the number of features only to the principal ones. Secondly, it would be possible to calculate the multiple correlations existing among the features and the targets using Multiple Regression Analysis, combined with some significance test as ANOVA. Other “constraint techniques” such as Lasso or Ridge Regression could be used as well. Nonetheless, we hope that providing the dataset in open-source will stimulate future investigations and studies of other researchers.
DISCLAIMER: All the considerations and the data reported in this article have not been reviewed. This research has been done in our free time so we cannot exclude the presence of some errors.
7. Literature
[1] Samrat K Dey - Analyzing the epidemiological outbreak of COVID-19: A visual exploratory data analysis approach (https://doi.org/10.1002/jmv.25743)
[2] E. Bontempi - First data analysis about possible COVID-19 virus airborne diffusion due to air particulate matter (PM): The case of Lombardy (Italy) (https://dx.doi.org/10.1016%2Fj.envres.2020.109639)
[3] Robert L. Johnson - Relative Effects of Air Pollution on Lungs and Heart (https://doi.org/10.1161/01.CIR.0000110643.19575.79)
[4] Jingyuan Wang - High Temperature and High Humidity Reduce the Transmission of COVID-19 (https://dx.doi.org/10.2139/ssrn.3551767)
[5] Swati Sharma - Metformin in COVID-19: A possible role beyond diabetes (https://dx.doi.org/10.1016%2Fj.diabres.2020.108183)
[6] S. Knapp - Diabetes and Infection: Is there a link? A mini-review (https://doi.org/10.1159/000345107)
[7] https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Italy
[8] Li Xiang - Immunodepression induced by influenza A virus (H1N1) in lymphoid organs functions as a pathogenic mechanism (https://doi.org/10.1111/1440-1681.13358) [9] David C. Chan - HIV Entry and Its Inhibition (https://doi.org/10.1016/S0092-8674(00)81430-0) [10] E. N. Vergis - Natural history of HIV-1 infection (https://doi.org/10.1016/S0891-5520(05)70135-5)
[11] L. Cooley - Consensus guidelines for diagnosis, prophylaxis and management of Pneumocystis pneumonia in patients with hematological and solid malignancies (https://doi.org/10.1111/imj.12599)
[12] M. P. Moutschen - Impaired immune responses in diabetes mellitus: analysis of the factors and mechanisms involved. Relevance to the increased susceptibility of diabetic patients to specific infections
[13] Swati Sharma - Metformin in COVID-19: A possible role beyond diabetes (https://doi.org/10.1016/j.diabres.2020.108183)
[15] Eurostat data - https://ec.europa.eu/eurostat/web/main/home
[16] Center for Systems Science and Engineering (CSSE) at Johns Hopkins University - https://coronavirus.jhu.edu
[17] ECDC Covid data - https://www.ecdc.europa.eu/en/covid-19/data
[18] Modelling the CoVID-19 outbreak in Italy, https://www.dhirubhai.net/pulse/modelling-covid-19-outbreak-italy-ettore-mariotti/
Who we are:
- Dr. Matteo Ciprian(@cip_mat), Linkedin: https://www.dhirubhai.net/in/matteo-ciprian-ba30ab122/
- Dr. Davide Rigon (@davi_rigon). Linkedin: https://www.dhirubhai.net/in/davide-rigon-08072719b/
- Dr. Jacopo Marullo Muscianisi. Linkedin: https://www.dhirubhai.net/in/jacopo-marullo-muscianisi-2322bb178/
- Dr. Enrico Virgilio (@e_virgi) . Linkedin: https://www.dhirubhai.net/in/enrico-virgilio-9462901b6/
- Send us an e-mail: [email protected]
Senior Machine Learning Engineer presso Focal Point Positioning
4 年CoVstat_IT
Specialista Digital Transformation presso Snaitech S.p.A.
4 年Thanks for sharing. Very interesting
Senior Machine Learning Engineer presso Focal Point Positioning
4 年Roberto Ascione . Trying to apply data science to study Covid-19 outbreak.
AI Researcher
4 年Nice work!!
Data and Policy Analyst III - Acumen LLC / SPHERE Institute
4 年Great work, congrats to everybody! Can't wait to get a hold of the dataset to test some further hypothesis