Big Data or Big Brother?
According to latest “IDC Digital Universe Study” sponsored by EMC, “in 2011, the amount of information created and replicated will surpass 1.8 zettabytes (1.8 trillion gigabytes) - growing by a factor of 9 in just five years”[1]. As nicely told by The Economist, we now live in a “Data Deluge”[2].
One might ask: why such a deluge? Paradoxically, most of this “data deluge” is not produced by us as human beings and the amount of information we are creating ourselves, like authoring documents, downloading music or recording a movie, is far less than the amount of information created about us in the digital ocean. Let’s take a very easy example: when we use our smartphone, we produce without even realizing it, through a whole set of sensors, a lot of data: geographic location through a GPS or via multilateration[3] of radio signals between radio towers of the cellular network, motion and direction through the use of GPS and accelerometers, browsing history,… This doesn’t take into account all the information we (voluntarily or not) publish to social networks like Facebook, Foursquare or Tweeter.
So, this has been a rather quiet revolution, but the new digital world in which we live is now composed of vast oceans of information which are getting more and more “water” through the proliferation of connected devices. And there are more and more of those connected devices: Cisco IBSG predicts there will be 25 billion devices connected to the Internet by 2015 and 50 billion by 2020.[4] Obviously, all those connected devices which make what is also referred as the “Internet of Things” or “Internet of Objects”, are generating a continuous flood of new raw data which could be used through sophisticated computing and statistical tools to create new information, then new knowledge in order to gain more insight and/or to take action. Through this chain of continuous transformation which gains value at each and every step: from signal to data, from data to information, from information to knowledge, a new knowledge era is emerging: “The Big Data era”. The implications of this new era for businesses, governments, democracy and culture are incommensurable. This is why I strongly believe it is vital that every citizen starts to understand what the potential consequences of such a new trend are.
But let’s start by the beginning and try to define what Big Data is.
Big Data
People usually relate Big Data to big volume. According to IDC, “Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.” So, Big Data is really about volume, variety and velocity (the “3 Vs” of Big Data).
Volume
As implied by the term “Big Data,” organizations are facing massive volumes of data. The conversation about data volumes has changed from terabytes to petabytes with an inevitable shift to zettabytes, and all this data cannot be stored in traditional systems. Twitter alone generates more than 7 terabytes of data every day, Facebook 10 terabytes, and some enterprises generate terabytes of data every hour of every day of the year. Often, organizations that don’t know how to manage this data are overwhelmed by it. But the opportunity exists, with the right technology platform, to analyze almost all of the data (or at least more of it by identifying the data that’s useful) to gain a better understanding of your business, your customers, and the marketplace.
Variety
It represents all types of data—a fundamental shift in analysis requirements from traditional structured data to include raw, semi-structured and unstructured data as part of the decision-making and insight process. The truth of the matter is that 80 percent of the world’s data is unstructured (like video or audio files) or semi-structured (like Word documents) at best. To capitalize on the Big Data opportunity, organizations must be able to analyze all types of data, both relational and non-relational: text, sensor data, audio, video, transactional, and more.
Velocity
Velocity refers to how quickly the data comes at you and so that incorporates into the scope of Big Data the notion of capturing a stream of data. The increase in GPS sensors and other information streams has led to a constant flow of data at a pace that has made it impossible for traditional systems to handle. Dealing effectively with Big Data requires that you perform analytics against the volume and variety of data while it is still in motion, not just when it is at rest.
But what might Big Data be useful for? As mentioned in a very interesting report from the Aspen Institute[5]:
“Using advanced correlation tech-niques, data analysts (both human and machine) can sift through mas-sive swaths of data to predict conditions, behaviors and events in ways unimagined only years earlier. As the following report describes it:
- Google now studies the timing and location of search-engine queries to predict flu outbreaks and unemploy-ment trends before official government statistics come out. Credit card companies routinely pore over vast quantities of census, financial and personal informa-tion to try to detect fraud and identify consumer pur-chasing trends.
- Medical researchers sift through the health records of thousands of people to try to identify useful correlations between medical treatments and health outcomes.
- Companies running social-networking websites con-duct “data mining” studies on huge stores of personal information in attempts to identify subtle consumer preferences and craft better marketing strategies.
A new class of “geo-location” data is emerging that lets companies analyze mobile device data to make intriguing inferences about people’s lives and the economy. It turns out, for example, that the length of time that consumers are willing to travel to shopping malls—data gathered from tracking the location of people’s cell phones—is an excellent proxy for mea-suring consumer demand in the economy”.
So, as you may imagine, there are a lot of new business scenarios that might be enabled through this emerging Big Data approach. Spurred by a continuous improvement of computation and storage costs enabled by the Moore law, with the emergence of affordable solutions for manipulating large data sets using cloud computing and commodity hardware, enterprises and organizations are seeking new ways to get more insights from each and every bit of data they have access to. Such insights will be able to providing competitive advantages to businesses, as well as improving tactical decisions and controlling costs.
According to IDC, “this period of ‘space exploration’ of the digital universe will not be without its challenges. But for the "astronauts" involved — CIOs and their staff—– it represents a unique, perhaps once-in-a-career opportunity to drive growth for their enterprises. They will need to lead the enterprise in the adoption of new information-taming technologies, best practices for leveraging and extracting value from data, and the creation of new roles and organizational design. Each step will require organizational change, not just a few new computers or more software. The success of many enterprises in the coming years will be determined by how successful CIOs are in driving the required enterprise wide adjustment to the new realities of the digital universe.”[6]
Big Data in the real world
As mentioned in a previous post, Big Data is already playing a key role in e-Science and Data-intensive science promises breakthroughs across a broad spectrum. Big Data is also starting to play a key role in some very diverse business environments:
- Financial services: Modeling True Risk, doing Threat Analysis, Fraud Detection, Trade Surveillance, Credit scoring and analysis,…
- Online retail services: Recommendation Engines, Ad Targeting, Search Quality, Abuse and click fraud detection,…
- Retail store services: Point of Sales Transaction Analysis, Customer Churn Analysis, Sentiment Analysis,…
- Telecom operators: Customer Churn Prevention, Network Performance optimization, Call Detail Record (CDR) Analysis, Analyzing Network to Predict Failure,…
At the end of the day, Big Data allows you to get new business insight from data you already have (and are likely throwing away).
In itself, the volume of data is a societal phenomenon that nobody can ignore. And, obviously, many citizens around the world might regard this huge set of information with a lot of suspicion, seeing the data flood as an intrusion of their privacy. Let’s try to understand why.
“Six Provocations for Big Data”
On September 21, 2011, Danah Boyd from Microsoft Research and Kate Crawford from the University of New South Wales presented a paper at a symposium organized by the Oxford Internet Institute: “A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society” entitled “Six Provocations for Big Data”[7].
In this very interesting essay Danah and Kate are offering six provocations that may help us to better understand some of the potential issues raised by Big Data:
- Automating Research Changes the Definition of Knowledge: “it is a profound change at the levels of epistemology and ethics. It reframes key questions about the constitution of knowledge, the processes of research, how we should engage with information, and the nature and the categorization of reality.[8]”
- Claims to Objectivity and Accuracy are Misleading: “Interpretation is at the center of data analysis. Regardless of the size of a data set, it is subject to limitation and bias. Without those biases and limitations being understood and outlined, misinterpretation is the result. Big Data is at its most effective when researchers take account of the complex methodological processes that underlie the analysis of social data.[9]”
- Bigger Data are Not Always Better Data: “Twitter has become a popular source for mining Big Data, but working with Twitter data has serious methodological challenges that are rarely addressed by those who embrace it. When researchers approach a dataset, they need to understand – and publicly account for – not only the limits of the dataset, but also the limits of which questions they can ask of a dataset and what interpretations are appropriate.[10]”
- Not All Data Are Equivalent: “Some researchers assume that analyses done with small data can be done better with Big Data. This argument also presumes that data is interchangeable. Yet, taken out of context, data lose meaning and value. Context matters. When two datasets can be modeled in a similar way, this does not mean that they are equivalent or can be analyzed in the same way.[11]”
- Just Because it is Accessible Doesn’t Make it Ethical: “With Big Data emerging as a research field, little is understood about the ethical implications of the research being done. Should someone be included as a part of a large aggregate of data? What if someone’s ‘public’ blog post is taken out of context and analyzed in a way that the author never imagined? What does it mean for someone to be spotlighted or to be analyzed without knowing it? Who is responsible for making certain that individuals and communities are not hurt by the research process? What does consent look like?[12]”
- Limited Access to Big Data Creates New Digital Divides: “The current ecosystem around Big Data creates a new kind of digital divide: the Big Data rich and the Big Data poor”.
In this post, I am not going to comment on all of these very interesting topics. I’d rather focus on one of them, the fifth one: “Just because it is accessible, doesn’t make it ethical”. Indeed, this provocation focuses especially on privacy and, more generally on accountability. As mentioned in this paper: “what is the status of so-called ‘public’ data on social media sites? Can it simply be used, without requesting permission? What constitutes best ethical practice for researchers? Privacy campaigners already see this as a key battleground where better privacy protections are needed. The difficulty is that privacy breaches are hard to make specific – is there damage done at the time? What about twenty years hence? ‘Any data on human subjects inevitably raise privacy issues, and the real risks of abuse of such data are difficult to quantify’ (Nature, cited in Berry 2010).”
In fact, most of the criticism of Big Data focuses on the fact that Big Data might be misused and abused. An example of such abuse is that some large corporations might use their new Big Data potential to manipulate consumers or to distort competition. In general, there are some fears among the data privacy community about potential threats to personal privacy, civil liberties and consumer freedom. Let’s try to clarify this point.
The question of anonymization
One of the first authors to raise a problem on how to preserve anonymity[13] is Latanya Sweeney who authored in 2000 a very interesting article: “Uniqueness of Simple Demographics in the U.S. Population[14]”. According to this article, “It was found that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP code, gender, date of birth}. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides. And even at the county level, {county, gender, date of birth} are likely to uniquely identify 18% of the U.S. population. In general, few characteristics are needed to uniquely identify a person.”
This article illustrates a very important concept about privacy preservation which is called “Re-identification”. Re-identification is the process by which it is possible to link anonymous data to the actual identity of an individual. In other words, transform anonymous data into non anonymous data with all the potential relevant consequences. In her article, Latanya Sweeney was able to demonstrate that anonymous data sets can often be re-identified. In one experiment described in this article, Sweeney, using 1990 Census data, demonstrated that individuals often have demographic values that occur infrequently. Since these values occur infrequently, they allow the re-identification of individuals in supposedly anonymous datasets.
Since then, a lot of other examples have been found:
- Latanya Sweeney was also able to link the supposedly de-identified health records released by the Massachusetts Group Insurance Commission to the voter registration list for Cambridge Massachusetts (bought for 20$) in order to re-identify the personal medical records of the Governor William Weld[15].
- In 2006, AOL released the search records of 500,000 of its users. “Within days of the database's release, journalists from the New York Times had revealed the identity of user number 4417749 to be Thelma Arnold, a 62-year-old widow from Lilburn, Ga. Over 300 of the woman's searches were traced back to her, ranging from "60 single men" to "dog that urinates on everything."[16]”
- In 2007, 2 researchers, Arvind Narayanan and Vitaly Shmatikov, from the Department of Computer Sciences at the University of Texas at Austin have been able to identify two people out of the nearly half million anonymized users whose movie ratings were released by the online rental company Netflix in 2006[17].
According to Biomedical Computation Review, ”Sweeney’s successful re-identification attack helped prompt the adoption of the HIPAA Privacy Rule in 2000. The Privacy Rule imposes restrictions on the release of “individually identifiable health information.” These federally legislated constraints on disclosure are waived, however, if the data has been de-identified by applying the so-called “safe harbor” method, which involves removing 18 identifiers, including names, dates, and Social Security numbers. Since data that has been de-identified under the safe harbor method is no longer considered to be individually identifiable, it is no longer covered by the Privacy Rule, and can be freely shared.[18]”
Those few emblematic cases are exemplifying how difficult data anonymization is. Indeed, some seemingly unrelated attributes could potentially, when put together, increase the risk of data re-identification. Therefore, a lot of new techniques spanning multiple disciplines including security, statistics, databases, cryptography and theoretical computer science have been developed in order to mitigate this kind of privacy risk like k-anonymity, l-diversity, t-closeness, differential privacy,… but, despite all this work, privacy breaches are still common even if security and data integrity have not been compromised.
The goal of all these techniques is intuitively to ensure that for every possible output of the system, the probability of this output is almost unchanged by the addition or removal of any individual within the data set. A good summary of results on this front can be found here and here.
Big Data or Big Brother?
In his very thorough article entitled “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization[19]”, Paul Ohm from University of Colorado Law School gives the following conclusion: “Easy re-identification represents a sea change not only in technology but in our understanding of privacy. It undermines decades of assumptions about robust anonymization, assumptions that have charted the course for business relationships, individual choices, and government regulations. Regulators must respond rapidly and forcefully to this disruptive technological shift, to restore balance to the law and protect all of us from imminent, significant harm. They must do this without leaning on the easy-to-apply, appealingly nondisruptive, but hopelessly flawed crutch of personally identifiable information.”
Although Big Data is already very promising as far as providing new avenues for science and is helping a lot of different industries to reduce the time to insight and the time to action, it clearly appears that the Big Data approach is challenging traditional efforts to protect privacy. And the question of re-identification only represents a small part of the global picture.
Indeed, for many years, information privacy has been protected by the adoption of and adherence to Fair Information Principles (FIPs). Even if there are a lot of differences between different regions of the world, these principles have been either codified into laws and regulations or been used as part of self-regulatory schemas. The principles have generally focused on some basic common practices like collection limitation, notice, user consent, and security. But, as mentioned by Scott Charney in his “Trustworthy Computing Next” paper[20], “While perhaps conceptually satisfying, this model – with its heavy emphasis on notice and choice (consent) at time of collection – is under considerable strain and the burden it has put on individuals is untenable. Indeed, some have argued that it has virtually collapsed and is no longer serving society in the way that it was intended. The cloud enabled world is already marked by a proliferation of devices, an abundance of data (e.g., user created content, network generated data such as geolocation and transactional data, analytical data derived from user actions), increased storage, and better search and decision enhancing algorithms.”
In the same paper, Scott Charney gives three reasons for which this traditional approach to privacy which focuses on the collection of data and the notices provided at the time collection occurs is no more appropriate:
- “The burdens of understanding and managing choices regarding data collection and use create an enormous strain for individuals. Leaving aside that even the entity collecting the data may have a reasonable large set of understandable uses, increasingly complex business relationships invalidate the notion that a single entity will be collecting and using the data provided and that an individual and data collector will be in a strictly bilateral relationship.[21]”
- “The existing model assumes an interactive relationship between the individual and the entity collecting and using the data, a relationship that may not actually exist. For example, as recently reported, insurance agencies may look at Facebook photos to see if individuals claiming disability benefits are engaging in activities that suggest the insurance claim is fraudulent[22]. Similarly, facial recognition technology may involve matching a photo of a person in a public space with other photos of people in other public places. The actual identity of the person being “identified” may, in fact, be unknown at the time of the data collection or usage, thus making discussions about user consent impractical. In such a case, an organization collecting and using the data may be making decisions about an individual with whom they have no relationship or even knowledge of identity.[23]”
- “The true value of data may not be understood at the time of collection and future uses that have significant individual and societal benefit may be lost. This is not to suggest, of course, that unneeded data should be collected (which, in any event, leads to higher storage and security costs) or that data collected for one purpose should be used cavalierly whenever some new, beneficial purpose is later discovered. Rather, the point is that this new data-centric world will provide society with a wide range of new possibilities, some of which may be unforeseen at the time data is collected (e.g., those collecting blood samples forty years ago did not mention DNA testing as a potential use, but it has since served to exculpate the innocent). At the same time, it is certainly true that data can be abused; redlining (where loans are denied to neighborhoods comprised heavily of minorities) presents one such example. But whether data usage is “good” or “bad” is just that – a judgment about usage. While a limitation on data collection does serve an important prophylactic purpose and remains relevant, it is not without cost; the real debate should be over use and how to strike a right balance between the societal benefit and personal privacy. That has been, is, and will likely remain an issue requiring deep thought.[24]”
So, what kind of solution could be proposed in order to avoid the kind of Big Brother world I was eluding to in my title? First and foremost, the user, the “internet citizen” should be aware of this Big Data and privacy question. Not to be afraid of it, but to be fully aware of it. Because, “the real challenge, of course, is that individuals and society must reaffirm, redefine and/or fundamentally alter their expectations of privacy in this new world, particularly since the interests of data subjects and data users are not always well aligned.[25]”
Then the question that must be answered is one about the Fair Information Principles future in a Big Data world. Given the three reasons we have listed previously, it seems that there should be more emphasis on the use of data rather than its collection and the fact that the user has given their explicit consent at the time it was collected. This doesn’t imply that the principle of user notice at collection time should be abandoned but rather that the focus should be on the use of data, now and in the future.
In such a model, every party involved should be fully transparent in order to (1) systematically offer and honor the user’s choice, (2) “ensure that risks to individuals related to data use are assessed and managed.[26]” Obviously, this kind of model should be embedded within a business governance approach that might be audited by regulators if this is what a given society choses to implement.
Beyond, this new emphasis on data usage, it looks clear that Fair Information Principles should also evolve in order to include an “accountability principle” that Danah Boyd and Kate Crawford have been advocated in their paper: “Accountability to the field and to human subjects required rigorous thinking about the ramifications of Big Data, rather than assuming that ethics boards will necessarily do the work of ensuring people are protected. Accountability here is used as a broader concept that privacy, as Troshynski et al. (2008) have outlined, where the concept of accountability can apply even when conventional expectations of privacy aren’t in question. Instead, accountability is a multi-directional relationship: there may be accountability to superiors, to colleagues, to participants and to the public (Dourish & Bell 2011).”
As also mentioned by Scott Charney in his paper, the “accountability principle” means that “an entity receiving data (directly or indirectly) is responsible for ensuring that such data is collected lawfully, and used and protected in ways consistent with individual and societal expectations. To be “accountable” means that the organization has taken steps to develop and implement privacy risk assessments, policies, processes and procedures that help enforce data usage rules that honor societal norms, respect user control, and ensure data is reasonably secure. Significantly, upon request of a regulator, an organization must be able to demonstrate how they have fulfilled their responsibility under this principle.”
Again, as mentioned by Scott Charney, “those collecting data from individuals for a business purpose must provide notice indicating:
- the purposes for which that data will be used;
- with whom the data will be shared, if anyone;
- how such use or sharing benefits the individual;
- whether the user will remain non-identifiable through either business process or technological means;
- the general type and nature of the information collected;
- what control, if any, the individual can exercise to affect the use and sharing of this data, and;
- how the data is protected.
All collection and data use must, of course, meet legal requirements and be reasonably consistent with the notice provided.”
As mentioned in this post and in the previous one, Big Data looks to be very promising, paving the way for a new science and an abundance of new rich business insights. At the same time, Big Data may be misused and abused, and it could imperil personal privacy, civil liberties and consumer freedoms.
As quoted by Danah Boyd and Kate Crawford: “Technology is neither good nor bad; nor is it neutral... technology’s interaction with the social ecology is such that technical developments frequently have environmental, social, and human consequences that go far beyond the immediate purposes of the technical devices and practices themselves.” If we want to live in a rich data world where Big Data fulfills all its promises, we need to focus on data use, meaningful user control, and transparency. This last world is perhaps the most important one of this post. Because trust is built on transparency. And without trust, it is highly probable, even if one Cassandra has already announced that privacy is dead[27], that this rich data enabled world will never materialize.
[6] IDC, iView, “Extracting Value from Chaos”, June 2011, https://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf
[8] Idem.
[9] Idem.
[10] Idem.
[11] Idem.
[12] Idem.
[13] Anonymity could be defined as a way for the state of an individual's personal identity, or personally identifiable information, to stay publicly unknown.
[14] Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA: 2000. https://dataprivacylab.org/projects/identifiability/paper1.pdf
[18] https://biomedicalcomputationreview.org/content/privacy-and-biomedical-research-building-trust-infrastructure
[19] UCLA Law Review, Vol. 57, p. 1701, 2010,
[21] Scott Charney, idem.
[23] Scott Charney, idem.
[24] Scott Charney, idem.
[25] Scott Charney, idem.
[26] Scott Charney, idem.
Product Management and Innovation | ex-Amazon
10 年"Bigger Data are Not Always Better Data" ?Just perfect! Thanks for the article.
Data Science | Data Analyst | Mestre em Física Unicamp
11 年Thank you very much for this article! It helps to clarify some important points on big data. Actually I would like to ask some further readings that could be useful to someone who wants to know more about big data. Thanks again.
Intelligence Economique - Traitement de l'Information
12 年thanks for your post Fr??d??ric Marin - alfeo.fr
Consultant éditorial
12 年Riche et brillant. F??licitations pour cette remarquable analyse que je vais conserver car elle me sera utile.
Thanks for this post Bernard. It really helps foresee the challenges either as a consumer, a citizen or as an actor in this business world. This is a great piece of work & deep thought. Thanks for sharing.