COVID-19: Important lessons in Data Management and Data Science
Paul Jones
Strategic Data & AI Leader @ Baringa ?? CDO ?? Consultant ?? Author ?? Speaker
What Data Management and Analytics lessons can we learn from the way information is being reported about the Covid-19 pandemic?
Please read my Disclaimer by clicking here.
The Covid-19 pandemic has turned the world upside down. As the grim reality of this terrible disease started to become apparent, the media jumped into action and started reporting “the numbers”: numbers of cases, numbers of deaths, numbers recovered, numbers of PPE equipment delivered, numbers of care homes affected… the list goes on.
Along with these numbers came the comparisons. Comparisons between countries, between states and regions, between demographic groups and so on. The increases in deaths were terrifying and the comparisons with other countries did nothing but add to the fear, especially when it started to look like the country “we’re” living in is experiencing more deaths, or when you realise that you or a loved one is in one of the “higher risk” groups.
Yet, along with these reports, there were also the voices of challenge and doubt. The ways in which different countries performed testing and the ways in which they counted cases and deaths as being Covid-19 related or otherwise, were varied and inconsistent, resulting in messages that were confused at best, or at worst, in some cases, potentially downright misleading.
There is no doubt that this pandemic has been nothing short of horrific and the loss of human life is devastating. However, from out of this tragedy, there’s also opportunity for us to learn for the future. How do we track and respond to global events like this, in a way that is clear and enables proportionate and targeted action?
I’ve seen various posts and articles from ex-colleagues on LinkedIn and many of them make really important points about how we manage and interpret the results of large-scale data analytics, especially when it’s used for something that’s so critical for the protection of human lives and for educating and informing the public on risks and associated actions.
In this post, I’ve summarised some of my observations on this. There are lessons that can be learned for epidemiologic reporting purposes, plus many of these lessons are equally applicable to other scenarios including various business contexts. I’ve deliberately used or referred to data management terminology rather than dumbing things down too much, in order to help illustrate why “textbook” approaches, which can be easy to dismiss as theoretical, are in fact absolutely crucial in real-world settings, if applied in the right way.
From data capture through to analytics
Before I start, I’m going to introduce the basic data management steps and concepts that I’ll be using to structure my observations. These steps are common to any data management activity. Sometimes there are slight variations in terminology to describe the same things, but no matter what terminology you use, these core concepts are the same in any activity that falls under umbrella terms such as “data management”, “data science”, “statistical analysis” and the like:
- Data capture approach (data sourcing)
- Data definition standards (metadata)
- Data classification (master data / reference data / categorisation)
- Data limitations reporting (data quality / data validation)
- Reporting (Management Information (MI) / volumes)
- Insights and analysis (data analysis / statistics)
- Making decisions and taking action (including the ability to explain the output of the models)
- Data Governance – making all of the above work!
For the purposes of this post, I’m not commenting on the data privacy considerations related to this process, because that could be an article in its own right! Instead, I’m focusing on the approach related to large-scale, anonymised/aggregated reporting and analytics and am assuming that the privacy concerns have already been taken into account.
1. Data capture approach (data sourcing)
In March 2020, the Director General of the World Health Organisation (WHO) made a very clear recommendation about testing people for COVID-19: “We have a simple message for all countries: test, test, test.”
If you’re going to perform any kind of data analysis, you need data that’s consistent and reliable. One really important factor in determining whether or not this is the case, is to look at the way the data is captured.
In later parts of this post I’ll cover more about what data you’re capturing, which is a key aspect of data sourcing, but given the issues with the varying approaches different countries followed to test for Covid-19 cases, let’s start with the steps taken in the acquisition of a “sample population” for analysis.
If two datasets consist of totally different sample populations, it can make it difficult to compare them in a meaningful way. This is an important consideration in the choice of a dataset for any kind of statistical analysis but given the fast moving nature of the Covid-19 crisis, numbers very quickly started appearing in the media, with very little commentary about the reasons that comparisons between countries were in some cases virtually meaningless, due to the differences in the way that people were being tested.
Scaling this concept down to something closer to home can make it easier to understand without needing to have a background in statistical analysis. Let’s say there are three schools, which have each had a single case of Covid-19 identified within them. To keep it simple, these schools all have exactly the same number of pupils and the distribution of ages and demographics are identical.
The first school decides to test all pupils on the day that the first case is identified. The second tests 50% of pupils, per class, over a period of a week. The third starts a phased testing approach, moving from one class to the next, getting through 10% of pupils per week until they’ve tested everyone.
Would you be surprised if the numbers of identified cases appear to be different, in each school?
In the first school, it may initially look like they have far more cases, because they identify all cases at once; but what about when the infection spreads? In a matter of days there could be many more cases, so will they perform more tests to identify these? It’s great that this school were so fast to create a holistic picture of all cases, but the accurate picture that was established so early on will quickly become less and less useful, unless it’s refreshed with new, up-to-date testing data.
Even if the second school had exactly the same number of cases on day one as the first school, they are likely to initially identify less cases because they tested less pupils. However, over the week, if the infection spreads quickly, they may identify more cases, to the point where it looks like they’re in a similar position to the first school; when in fact, it could be in a far worse position and won’t know because it’s not tested all of its pupils yet.
Finally, if the third school reports its numbers regularly throughout the weeks, it could initially appear to be in a far better position to the other two schools, when this view is unfortunately just due to the lack of testing, meaning that the sample dataset that’s being used doesn’t provide a realistic picture of the real number of cases.
So, the way in which data is being sourced, when it’s being captured and how frequently it’s being updated, all have a significant impact on the results of any analysis performed on them, and the greater the difference between data sources, the less useful any comparisons are likely to be.
This is why many comparisons made between Covid-19 infection rates in different countries, given the very different approaches used by each country, were not nearly as meaningful as they have been portrayed. If a country has been performing massively more tests than another country, it’s not surprising if they’ve identified more cases. If another country has performed barely any tests and is seeing similar numbers, that’s likely to be a far more worrying situation, and it’s this kind of basic assessment that’s really important to enable any kind of meaningful conclusions to be drawn from analytical results.
Now we’ve covered the basic approach to acquiring a dataset, we can look at how to make sure the right data is captured as part of that dataset…
2. Data definition standards (metadata)
Let’s consider for a moment how Covid-19 deaths are being reported.
Headline numbers, broken down by countries, have been splashed across mainstream and social media in dramatic fashion: “this country’s now had the most deaths in the world”, “that country’s death rate is increasing faster than any others”, and so on.
If we assume that the challenges with inconsistent testing regimes has been addressed, the next challenge with these kinds of headlines is: how are Covid-19 deaths defined?
Here are some of the possible definitions:
- Location-specific definitions:
- Deaths in hospital
- Deaths “in the community” (what does “in the community” mean?)
- Deaths in care homes
- “Cause of death”-specific definitions:
- Deaths where Covid-19 is the main cause on the death certification
- Deaths where someone had been tested and confirmed as having Covid-19 when they died (whether it’s the main cause or not)
- Deaths where the deceased had Covid-19 symptoms (whether or not they’ve been tested)
Each country has been using one or more of these definitions, at different times, in some cases switching between them as the weeks progress.
Once again, this makes it totally impossible to compare the numbers meaningfully. A country that counts deaths in all locations (hospital, the community, care homes, other) as well as all causes of deaths related to Covid-19, will be counting a lot of deaths that other countries, which only count deaths in hospitals where patients have tested positive, will totally miss.
This is also very confusing for the public. If a government is publishing numbers, which only include a small subset of the actual total number of deaths, it is misleading and could give people a false sense of security. Likewise, if all deaths with symptoms are reported as the headline number, it could present an unrealistically pessimistic picture, which could result in disproportionate and unnecessary fear amongst the population.
The key to resolving this is very simple: clearly define the data and metrics, before the data is captured and when they’re reported. This “metadata” needs to be agreed and communicated everywhere.
Instead of using the label of “total deaths”, terms like “Confirmed hospital deaths” or “Care home deaths with unconfirmed Covid-19 symptoms” should be used to be clearer about what is being reported. This will also start to make comparisons more meaningful, because it will mean that you know that you’re comparing numbers related to the same thing. This doesn’t overcome the challenges associated with testing, but it does mean that you’re more likely to be comparing apples with apples, as the saying goes.
With clear definitions it’ll be possible to start capturing data in a more precise and consistent way. However, how do we know that we’re capturing all of the data we need?...
3. Data classification (master data / reference data / categorisation)
So far we’ve only talked about headline numbers, such as the number of Covid-19 cases and the total number of deaths. However, this on its own doesn’t provide much insight into what’s actually going on, to enable identification of specific risks and opportunities that need addressing.
For example, in order to perform more meaningful analysis, it could be useful to know, for each person tested, their:
- Age
- Sex
- Ethnicity
- Underlying health conditions such as diabetes, cancer or heart conditions
- Possibly other health features such as height, weight, BMI, smokers or not, etc
- Where they are at the time of testing
- Date that this information about the individuals was captured
Also, it may be useful to correlate analysis of the cases and their status, with information such as:
- Total population of the area the people are in at the time of testing
- Population density of that area
- Diet of people in that area (especially if you don’t collect this data on the individuals above)
- Average health and fitness of people in that area
- The interventions that have been implemented in the area such as social distancing, wearing masks etc
It doesn’t take long to come up with quite a long list of data classifications that would be useful to derive further meaning from the data. This is where reference data becomes absolutely critical.
For every one of the categories above, a standardised list of allowable values needs to be defined, agreed, and then used consistently for every dataset, for it to be possible to reliably perform analysis across datasets.
This needs to go beyond just the allowable lists of values: the categories need to be accompanied with clear and consistent definitions (metadata), so that they can be interpreted and implemented correctly, for example what units of measurement will be used to capture weight, if that’s one of the dimensions that’s measured? Comparing metric and imperial scales is an easy mistake to make, which can result in significant inaccuracies in results and conclusions.
The granularity or detail of the lists also needs to be agreed up-front. For example, will location be captured at a country, state/county, town, or street level? The lower the level of detail, the more precise and insightful levels of analysis that could be performed; but if one dataset is captured at one level and another at a different level, you’ll only be able to perform reliable cross-dataset analysis at the higher levels of granularity available. This can frustrate efforts to generate the levels of insights needed.
It can also be frustrating for individuals who want to know information about the area they live. For example, where I live in the UK, I wasn’t able to obtain information about the number of cases in the town that I live, despite the fact that some cases were reported at a town level in other parts of the country. This is because the standards for capturing and reporting information were not consistently set and enforced across all parts of the health service.
The solution to all of this is the use of consistent reference data, across all datasets.
What does this mean?
In simple terms, the list of allowable values, for each reporting dimension or category, need to be agreed centrally. These lists must be established as the “master” lists of allowable values and published globally for everyone, across all countries, to use. Then, there needs to be tracking and enforcement of the use of these standards, so that when the results of Covid-19 testing and Covid-19 deaths are recorded, they are recorded against this totally standardised set of reference data values. This will result in datasets that are categorised consistently, so that the categories can be compared easily and meaningfully when it comes to analysis.
If this isn’t done, a load of extra data wrangling, cleansing and re-classification will need to be performed before analysis can be conducted, which will be less efficient and open to errors. Hence the situation we’ve had with Covid-19!
Recognising the potential shortcomings in data leads to the next step in our data management process…
4. Data limitations reporting (data quality / data validation)
No matter what’s done to ensure that data is complete, clean and correct, there will always be some anomalies and quality issues, especially when the data is gathered in a distributed and manual way at large scale. This is where data quality management comes in.
During the process of gathering and preparing data to be analysed, the data will need to be assessed against the defined metadata and reference data standards to ensure they have been conformed to and actions will need to be taken to ensure the data is validated and in a format that’s suitable for analysis. This also includes identifying gaps and other errors in the data (data quality analysis and cleansing).
Whilst several of the points which could lead to poor data quality and impact the results of the analysis have already been covered (inconsistent or ineffective data capture, data definition and data classification), acknowledging known limitations with the data at this stage is an important part of the analysis process. It’s unlikely that you’ll ever have a perfect dataset with 100% quality levels, so understanding the shortcomings in the data is important when performing the analysis, because it can influence the analytical techniques that you use, and it’s also important when the results are explained, so that any potential weaknesses are known.
Where Covid-19 numbers have been presented in the media, I haven’t heard or seen any mention of the margin for error in them, but understanding the limitations of the metrics is critical to being able to effectively assess their meaning and to help evaluate the proportionality of actions taken.
Whether you support the lockdown measures or not, the shortcomings in the data and metrics make it difficult to assess whether more or less action should be taken, sooner or later. Whilst those presenting the numbers clearly want to project a sense of confidence and don’t want to confuse the broad audience that’s watching, if the shortcomings in the data are more clearly explained, up front, alongside the metrics themselves, it could help people to calibrate the insights and conclusions that are being made.
Now we’ve covered the basic steps for identifying and gathering the data needed to derive insights, we can move onto the analysis itself…
5. Reporting (basic Management Information (MI) / volumes)
Once the data has been sourced and prepared for analysis, it’s possible to start analysing it to generate some insights.
The first step here is always to start with some basic statistics and the kinds of high level metrics that are often presented as top-level “Management Information” (MI). By this I’m talking about overall volumes such as numbers of cases and deaths, plus things like averages, ranges and trends.
Despite everything I’ve said about the shortcomings of these numbers, especially in the context of the numbers reported for Covid-19, they’re still a necessary starting point and do provide an initial sense of the scale of the problem. They also establish some of the headline numbers against which other, more detailed analysis can be performed against.
This is the level at which some of the charts, trends and forecasts have been reported, where governments started talking about concepts such as “flattening the curve”. Of course, the actual analysis that would have been performed to get to the super-high-level diagrams that have been presented in the news, would have been far more complex than the final results that were displayed, but the headline numbers and high level visualisations are an extremely important top-level view to help get a grasp on the overall situation.
However, if the steps already outlined in this post were applied: consistent data capture, consistent definitions, consistent reference data and a clear explanation of the limitations of the data; it would result in far more meaningful and comparable results.
Once the high level statistics have been established, it’s the next level of analysis that can start to inform decisions about what actions need to be taken. This next level of analysis comes when a broader set of more detailed data points are brought together…
6. Insights and analysis (data analysis / statistics)
We’ve finally reached the point that most people see: the point where people have performed more sophisticated analysis on the Covid-19 data and drawn conclusions from their analysis, which they’ve published as insights, to either support or challenge the actions that governments have taken. These conclusions can start to become either really useful, or really dangerous…
One simple example of analysis, which has been all over the news and social media, is the comparisons made across different datasets, such as comparisons across countries. As I’ve outlined above, unless some basic data management structures are in place over the data, these comparisons can be virtually meaningless and can lead to some totally false conclusions, such as the idea that one country has reacted “better” than another, when in fact the differences in cases and deaths in the countries may be due to other factors, such as how tests are being performed and how the results are being reported, with different definitions and different categorisations.
However, it’s when the next levels of analysis are performed that a range of ideas can start to surface that are potentially incorrect.
For example, let’s look at the number of deaths in relation to whether or not the country habitually use face masks.
It’s absolutely possible that the use of face masks could have a material impact on the spread of Covid-19 and I am not challenging the idea, plus I will absolutely support the use of them if it helps.
However, taking a really high level idea such as the use of face masks, plotting it in chart and then announcing that it’s a data-driven insight, without actually basing this idea on any real data analysis, is not an example of good data management practice and risks both communicating ideas that are incorrect and undermining the integrity of other analysis, even if appropriate procedures have been followed.
There are several reasons for this, but two of the most obvious ones are:
- This kind of conclusion is based on a general idea, not on data. You’d need to capture actual data on mask usage, both across the populations and amongst those who have been tested positive, to be able to perform any real analysis on the idea (i.e. you need to apply some proper scientific approaches to the collection and analysis of data, not just stick the idea into a chart and say “tah-dah”);
- You’re only looking at a single factor, which could have some kind of impact, but you could be inferring that it’s the one cause, when it may not be. In isolation the correlation may appear to be statistically significant, but without looking at other factors, you’re making a conclusion based on an incomplete understanding of what else could be at play.
This is where data analysis has the potential to generate real insights and value, if an appropriate range of factors are considered and the analysis is performed properly.
What I’d be really interested to see in this space is a greater breakdown of other factors that could influence the spread of the disease and the impact on the affected population.
For example, what impact does population density have on the spread of the disease? I bet it has quite a big impact but it’s not widely considered in the media: the lower the physical concentration of people across a country, for example where there are less densely populated urban areas and more open spaces and countryside, the less likely the disease is to spread through that population, regardless of the interventions made by the government. Could this be used as a factor in assessing the effectiveness of interventions and also the different types of interventions used, even allowing for adjustments of interventions within different parts of countries, where people are more or less spread out?
What other demographics impact the spread of the disease? Where there are people from particular ethnic backgrounds that seem to be more affected than others, does this suggest a genetic vulnerability to the disease or are there some other factors within that population of individuals that we’re not also considering, such as their diet and lifestyle? Without a scientific approach to the analysis of these hypotheses, coupled with real-world assessment of the effectiveness of interventions, it would be very easy to jump to unfounded conclusions, which could be really dangerous and costly. When the results of misdirected actions really are life or death, it’s more important than ever to be disciplined in the application of professional approaches to these problems.
Of course, all of this analysis will only be useful if it’s used to take some kind of action, whether that be governments deciding what interventions to implement, or individuals using the information to help guide their own day-to-day decisions about how they live their lives…
7. Making decisions and taking action (including the ability to explain the output of the models)
Where decisions need to be made and actions need to be taken, both the way in which the data is interpreted and the way in which it’s explained become extremely important.
The charts and graphs that have been shared in the media, have been helpful in enabling the public to build some level of understanding of the data, but I wonder how much better the reports could have been if the basic data management disciplines outlined in this post were applied, consistently, across all countries. Also, if a bit more time were spent explaining the data, what the data it includes and doesn’t include, and what this means for the conclusions that are being drawn, would the increased data literacy of the public help drive better behaviours?
For example, in the UK, maybe some of the unhelpful comparisons with other countries could have been avoided or made with more context; and maybe the government could have avoided the deterioration in trust that they experienced when people realised that the numbers of deaths that were being reported only included hospital deaths, not those deaths that were occurring in the community and in care homes.
Also, given the clear shortcomings in the data and associated analysis, it does raise questions about whether the right interventions were made by our governments, at the right times. I have no doubt that our nation’s leaders have attempted to make the best decisions and take the best actions that they could, based on the information that they did have, but when it’s been clear that the data that’s been used has had such errors in its quality and consistency, it makes it quite likely that some mistakes have been made, some of which could, tragically, have cost lives.
The key question is: are lessons being learned and are they being applied to avoid future mistakes?
The importance of Data Governance for all of the above!
Given that this is a post about data management lessons learned, it would be impossible to finish without mentioning Data Governance, which is absolutely critical for every part of the above to work.
Data Governance establishes accountability and responsibility, for the ways that data is managed overall and for each step in the data management and analytics process. It answers questions such as: who’s accountable if data is of poor quality and if incorrect insights and conclusions are drawn from it?
Data Governance encompasses all of the actions required to make sure the topics covered in this post happen, both initially and on an ongoing basis, including:
- Deciding what data to collect and how to collect it (data sourcing strategy) – then making sure it’s captured that way;
- Deciding how to describe the data consistently: defining and agreeing definitions (metadata management) – then making sure it’s captured, processed and analysed in line with these definitions;
- Deciding how to categorise and group data consistently: defining and agreeing reporting dimensions (reference data management) – then making sure everyone knows what the latest allowed values are, and that they are applying them consistently;
- Making sure data quality is measured and monitored and that appropriate action is taken to correct data quality issues (including adjusting rules and governance at data capture), and including reporting known data quality issues where they’ve not yet been corrected, so they can be factored into any analysis or insights;
- Making sure good practices are applied in the analysis and reporting processes, including only using data in line with its metadata definitions and purpose (and without violating anyone’s data privacy).
Of course, implementing all of these things, globally, across all countries and jurisdictions, is undoubtedly challenging and exactly how to do it is beyond the scope of this post… but data governance is a well-established discipline, which would be absolutely possible to implement at this scale for Covid-19 and other epidemiological purposes, if there were sufficient political will and sponsorship to do so.
Conclusion
So there you have it… a few Data Management lessons that we all take from the Covid-19 pandemic. I hope you’ve found my observations interesting, even if you’re already familiar with the concepts I covered. I always find it useful to consider these disciplines from various different angles, because it can only help us better understand what we’re doing and help identify new ways to improve, on whatever data management initiative we’re working on.
Let me know your thoughts… do you agree with the points I’ve made? Anything I’ve missed? Anything you want to correct me on?
Thanks for reading!
https://www.pauldanieljones.com/2020/05/covid-19-important-lessons-in-data.html