Realizing the Promise of Big Data Requires New Forms of Data Sharing
Thoughts about digital transformation and AI for enterprise leaders and their legal & compliance advisors
These posts represent my personal views on enterprise governance, regulatory compliance, and legal or ethical issues that arise in digital transformation projects powered by the cloud and artificial intelligence. Unless otherwise indicated, they do not represent the official views of Microsoft.
We all know that the present age of big data is full of promise not only for the economy but for society as a whole. The opportunity to create human and social value from this electronic manna is unprecedented—but seizing it requires breaking down many barriers, both technical and institutional. Big data, like the oil to which it is often compared, is not always found where it is most useful but must be laboriously extracted, refined, and transported to where it can be of most value. Legal and compliance professionals in organizations that produce or consume large quantities of data must learn how to craft agreements for data protection and sharing that make this process of data valorization feasible.
Certainly we are awash in data. One website reports that more data was created in the year 2017 than in the previous 5,000 years of human history. Given the rate of data growth, it’s a sure bet that 2019 will surpass the total for all of history (unless that already happened in 2018). Another website tells us that in 2020 the average inhabitant of the planet will create 1.7 megabytes of data per second, although frankly, I’m not sure I believe that number. But even if it’s too high, the 2.5 billion Facebook users who as of late 2019 upload 136,000 photos and post 800,000 comments or status updates per minute are certainly doing their part to make it come true. All in all, IDC conservatively estimates that we humans will collectively own some 40 trillion gigabytes by 2020.
It’s easy to see where all this digital data is coming from. In addition to 1.5 billion PCs, 75 million Internet-connected servers and some 500 hyperscale cloud data centers, there are 5 billion cell phones in the world, or about one for every person over the age of 15. More than half of these latter devices are smartphones, and they are not confined to the wealthy west or booming Asia. MIT’s Andrew McAfee in his important new book More from Less reports that more people in Africa have cell phones today than have access to electricity. And all this doesn’t count the tens of billions of Internet of Things devices that are busy creating data without the supervision of a human user.
But sheer quantity alone does not make data into useful information. According to the same IDC study, only 3% of the world’s data is tagged, that is, marked with the right kind of metadata to make it readily exploitable for analysis. Today, the vast majority of the data we create is still unstructured, dumped as virtual waste bits into the cybersphere in the same way that our material economies pump waste heat into the atmosphere. It’s true that even 3% of 40 trillion gigabytes is still a lot of data—more than a trillion gigabytes, in fact. Yet even the data that is already in a usable format is all too often locked away in institutional silos that prevent the people who could make useful discoveries with it from getting access to it.
We don’t want all data to be accessible to everyone. We don’t want random data miners poking through the digital sludge of our old emails or Facebook posts, even if (or precisely because) they might detect interesting patterns. Companies don’t want to expose the data that circulates internally among their employees, just as financial institutions don’t want to expose the accounts of their customers. And governments and healthcare systems that store vast amounts of the most sensitive personal data are obliged by law and common decency to protect it.
But it remains indisputable that there is tremendous value for humanity buried in our vast and growing datasphere that we as a society are failing to exploit. The best example is medical data. Around the world, millions of patients suffer from and are treated for thousands of conditions, trivial or life-threatening. Electronic health information systems are capturing detailed descriptions of these treatments and their outcomes. We have the technical means to make this data widely available to researchers while still protecting patient privacy—the key is a statistical obfuscation technique called differential privacy. But we are only just beginning to build the digital archives and research portals that will make this possibility a reality.
Here are two examples of medical data portals that point to a promising future:
- Microsoft is working with researchers from Johns Hopkins and MIT to build a repository containing exhaustive medical data from 1,000 patients with ALS (amyotrophic lateral sclerosis), a devastating disease of progressive paralysis. The data is voluntarily shared by 1,000 patients being treated in hospitals all over the United States. The repository, without equivalent elsewhere, is being pored over by researchers looking for ways to treat a disease that until now has proven invariably fatal.
- France is launching a national repository called Health Data Hub that is not only extraordinarily ambitious but may be unique in the world. The project will combine 18 separate databases covering nearly every patient and every medical condition and its treatment in every healthcare facility in France, all on an integrated cloud platform with rigorous privacy protections in place. The hope is that researchers from universities and industry will apply AI and other analytical tools to improve the effectiveness of the nation’s healthcare system and discover new treatments.
Such positive examples of data sharing are still too rare. But data scientists and policymakers are well aware of the data silo problem. That’s why many in the tech industry, in the research world, and in government are working hard to develop standards and platforms for effective data sharing while also preserving privacy. This effort extends far beyond just the healthcare sector. The idea is that ultimately all kinds of data sets should be made open or partially open, with appropriate privacy protections where necessary. Last year the US Congress unanimously passed a new law called the OPEN Government Data Act mandating that, whenever possible, data sets belonging to the Federal government “shall be published as machine-readable data…in an open format, and…under open licenses.”
Building such systems will require a carefully designed infrastructure of legal and privacy protections of all sorts. These protections will need to be embedded in technical protocols for the exchange of data and in contracts governing its allowable uses. At Microsoft our legal department is developing open source contracts for exactly this kind of data sharing. Our top Intellectual Property lawyer, Erich Andersen, has even called for the appointment of a Federal Chief Data Officer to spearhead the goals of the new open data law.
Thanks to these legal innovations, more and more public and private sector organizations are going to be sharing their data for AI-powered research that benefits everyone while respecting privacy and commercial property rights. This is the payoff of living in the era of big data.
Microsoft has published a book about how to manage the thorny cybersecurity, privacy, and regulatory compliance issues that can arise in cloud-based Digital Transformation—including a section on AI and Big Data. The book explains key topics in clear language and is full of actionable advice for enterprise leaders. Click here to download a copy. Kindle version available as well here.