A Plea for Open Government Big Data
Imagine, it is 2022. Covid-19 is gone, but Covid-22 is on the verge of spreading. Similar to the current Corona-Virus, it has a land of origin, and predictions tell us that in case it spreads over the world, it brings a considerable death toll.
But assuming in 2022, we have all power of Big Data. In our fictive example, Patient 0 lands in Milan, and two days later, he gets diagnosed.
In 2019 it was still impossible to track Patient 0, but in 2022, we have a patient 0s movement data. Thanks to his mobile phone, we know where Patient 0 was every minute down to his visits to the restrooms. And thanks to all geolocation data of mobile phones stored in a data lake, we can track down every person who might have been affected by patient 0. Instead of thousands of people dying in hospitals and a complete social and economic lockdown, we prevent the scenario that we face today.
Public Health vs. Big Brother
If you ask the average person about his opinion in Big Data and government, the majority quotes "Big Brother." Some also refer to horror stories from countries where AI is already used to track people. We know the discussions that follow up: "Did you know that via facial recognition, AI can identify a single individual in a stadium full of people? What if this technology is in the wrong hands?"
However, if you ask people if a country has the right to use technology to protect its people, no one disagrees. It is annoying if you receive a speeding ticket, but at the same time, everyone understands the benefits of the police regulating traffic. If there were no one watching drivers on the road, the death toll gets unacceptably high. So, even if we have to pay a speeding ticket, we are happy that the police stop drunken drivers and arrests them.
What makes Big Data scary is the question of what else an analyst can if he has unrestricted access to all personal data. We all know the scenarios from science fiction movies.
Open Government for Big Data to the Rescue
What if we made fully transparent how we collect data from citizens so that in case of a pandemic outbreak, we can react fast? Furthermore, what if we make misuse impossible with this full transparency and governance of data processing? Would we like that if we can prevent pandemics like Covid-19? If the answer is yes, what do we need to put this into practice?
(1) A transparent list of all data sources used for Big Data Analytics for Government
We can collect from, for example:
- geolocation data from mobile providers,
- CCTV,
- medical records.
- and many more
The number of potential sources is high, and the above list is, by far, not complete. Each interface to each concrete data provider has to be listed, conforming to data protection standards.
In concrete terms, there is a documentation for every data source that states something like
We collect from <fill in the name of the telecommunication provider> we collect RAN-data provided by <fill in the name of the provider> with the following fields that contain personalized data <fill in the PII data>. Federal officers will only access this data in case of a pandemic emergency.
This document has to be publicly accessible.
(2) A public code repository hosting all source code on how to collect data
Once we have a list of data sources, we can also provide the source code on how to collect data. One of the essential parts of this is also the data filtering and obfuscation. We should not collect data not needed for pandemic scenarios, and the most important thing is an algorithm to ensure that personalized data is 100% protected.
There are many data pipelines to be implemented. Let us take a reference example; we collect data from radio network stations from a telco in batch or streaming processes. All collected data is encrypted while in motion or on rest. While loading the data, we also need to remove every attribute unrelated to any use case.
With an open-source codebase, we can review every workflow. In addition to the algorithms to track how to fetch data from the data sources and to store it in a data lake, we also need a supervisory board of data protection experts that approves the selection of encryption algorithms by its ability to protect data.
(3) Storage and Compute Infrastructure
The most expensive part is the infrastructure. This data needs to be stored, and wherever this data is stored, the access to servers has to be limited and governed.
Engineers who have already explored public cloud providers or on-premise data centers have a good understanding of the requirements. The storage and compute costs are above the budget of an average enterprise to provide that.
One option is to store all data in data centers provided by a political union, such as the European Union, for all EU related data. As taxes finance this infrastructure, it is, therefore, at least indirectly in the ownership of all citizens.
(4) Governance
So, we collect data, and we analyze it on demand. We still need a governance approach around it. Who approves the analysis of PII data? Which data retention times do we have? How do we react in case there is a suspicion that the data platform hosting all data has become insecure? How can we ensure that retention algorithms delete data permanently once storing PII data is not purposeful any more? (In other words, referring to a pandemic use case, we do not need geolocation data that is several months old, and therefore it must be deleted).
There have to be policies and regulations that clarify ownership, responsibilities, and processes. The documentation of these processes has to be public as well.
Thoughts on Data Protection
One key question is that we need to store PII data. If we want to track potentially infected people, someone needs to be able to access the personal data of that potential inflected people. How do we guarantee maximum data protection and, at the same time, the ability to track people based on PII data if needed?
The solution is to encrypt all the collected data. For data to each individual, there is a key to decrypt the PII data. As everyone who works with encryption knows, only people who have an encryption key can read the data, those who have no key, cannot read it.
It would be possible to store two copies of the key
- The individual itself owns one copy
- The other one is locked away in a vault.
The critical question and most likely deciding factors about this initiative is the question of how to unlock the vault to access the keys to decrypt PII data. Who approves this? How is this tracked, and how are people informed about this usage? Public access to the documentation of this workflow and decision criteria needs to be available.
Possible next steps
Only a large organization like a country or political union can implement such an initiative. Only they have the resources. One aspect of putting this into practice is the division of powers. Different, independent NGOs might be in charge of various dimensions, and of course, the realization should also involve academic institutions.
The most crucial aspect, however, lies with everyone who reads this. We are still far away that decision-makers seriously consider making Big Data for Governments open source. The most vital step is, therefore, to spread the word and to demand this transparent form of Big Data from the government. As if no one expects the governments to go with this open approach, they will - in the name of security - set up a closed procedure. And we will never know if, in the end, Big Brother will be watching us.