Big Data: A Historical Perspective in Present Times
Dr Kumar Kaushish
Holistic Mental Health, Wellness, Lifestyle, Psychotherapist, Coach helping navigate Personal, Professional, Family Challenges and Enjoy Meaningful - Purposeful - Successful Business, Life, Career, Personality, Relations
Before the advent of writing and during the long hunter-gatherer period, humans lived in fairly small groups (20-100) of people who knew each other well. Gathering data on those around them, particularly their emotional state, was necessary and normal, as one can gleam from humanity’s empathic abilities and the varied ways in which faces and bodies communicate internal lives to others. It might not have been recorded on USB-drives, but the most intimate details would have been the subject of gossip and observation within the whole group with which humans lived. It would have been vital to know about others’ abilities, health, likes and dislikes, and kinship relations. All that shared data would now have to be called something like “distributed Big Data”.
Then came large agricultural hierarchies and their need to control populations, leading to systems of recording. The Sumerian script is the oldest known system of writing, going back at least 6,000 years, and one of its key uses was to keep track of the trades and taxes of those early kingdoms: the business of gathering taxes needed records on who had paid how much and who was yet to pay how much. One might see the hundreds of thousands of early clay tablets of the Sumerian accountants as the first instance of “Big Data”: systematic information gathered to control and manipulate a population.
Some 4,000 years ago, in both Egypt and China, the first population censuses were held, recording who lived where and how much their wealth was, with the express purpose of supporting the tax ambitions of the courts of those days. A census was the way to know how much individuals, households, villages, and whole regions could be taxed, both in terms of produce and labor time. The key initial use of Big Data was simply to force individuals into paying taxes. The use of a census to measure and tax a population has stayed with humanity ever since, including the regular censuses of the Romans, the Domesday book ordered by William the Conqueror in 1086 in Britain, up to the present day where censuses are still held in many countries. The modern countries that don’t have a census, usually have permanent population records, an even more sophisticated form of Big Data.
The Bible illustrates these early and still dominant uses of Big Data: the book of Genesis lists the genealogy of the tribe, important for matters of intermarriage and kinship claims; and the book of Exodus mentions the use of a population census to support the tabernacle (cabinet). Courts and governments were not the only gatherers of Big Data with an interest in recording and controlling the population. Organized religion and many secular organizations collected their own data. Medieval churches in Europe collected information on births, christenings, marriages, wills, and deaths. Partly this was in order to keep track of the daily business of a church, but it also served the purposes of taxation: the accounts were a means of counting the faithful and their wealth. Medieval universities also kept records, for instance of who had earned what qualification, because that is what they sold and they needed to keep track of their sales. As with churches, universities also had internal administrations where they kept track of their possessions, loans, debts, “the academic community”, teaching material, etc.
With the advent of large corporations came totally different data, connected to the need to manage long-run relations with many employees: records on the entitlements and behavior of employees, alongside identifying information (where they could be found, next of kin, etc.). These records were held to allow a smooth operation of organizations and were subsequently used as the basis of income taxation by governments, a good example of where the Big Data gathered by one entity (firms) gets to be used by another (a tax authority) for totally different purposes.
What has been said above can be repeated for many other large organizations throughout the ages: they kept track of the key information they needed to function. Traders needed to keep track of their clients and suppliers. Hospitals and doctors needed to keep track of ailments and individual prescriptions. Inns needed to keep track of their guests. Towns needed to keep track of their rights versus other authorities. Ideologies needed to keep track of actual and potential supporters, etc. There is hence nothing unusual about keeping records on individuals and their inner lives, without their consent, for the purposes of manipulation. One might even say that nothing on the Internet is as invasive as the monitoring that is likely to have been around before the advent of writing, nor is anything on the internet more manipulative than the monitoring of large empires that pressed their populations into taxes, wars, and large projects (like building the pyramids). Big Data is thus ancient. There is just a lot more of it nowadays that is not run and owned by governments, and an incomparably stronger capacity to collect, classify, analyze, and store it due to the more recent rise in computer power and the rapid development of computer science.
In the present day, governments are still large producers and consumers of Big Data, usually without the consent of the population. The individual records are kept in different parts of the government, but in Western countries they usually include births, marriages, addresses, emails, fingerprints, criminal records, military service records, religion, ethnicity, kinship relations, incomes, key possessions (land, housing, companies), and of course tax records. What is gathered and which institution gathers the data varies by country: for example, in France the data is centrally gathered in a permanent population record and it is illegal to gather data on religion and ethnicity, in the USA the various bits of data are gathered by different entities and there is no problem in measuring either religion or ethnicity.
Governments are also in the business of analyzing, monitoring, and manipulating our inner lives. This is a well-understood part of the social contract and of the socialization role of education, state media, military service, national festivities or national ceremonies: successful countries manage to pass on their history, values and loyalties to the next generation. Big Data combined with specific institutions surrounding education, information, taxation or the legal system is then used to mold inner lives and individuals’ identities. Consent in that process is ex post: once individuals are “responsible citizens” they can have some say about this in some countries, but even then only to a limited degree because opting out is often not an option.
In the Internet age, the types and volume of data are truly staggering, with data gathered and analyzed for lots of purposes, usually profit-motivated. The generic object is to get a consumer to click on a website, buy a service, sign some document, glance some direction, vote some way, spend time on something, etc. A few below-given examples illustrate the benefits and dangers.
Supermarket chains now gather regular scanner and card-data on the sales to their customers. Partly in order to improve the accuracy of their data, they have loyalty programs where customers get discounts in exchange for private information that allows the supermarkets to link names and addresses to bank cards and other forms of payment. As a result, these companies have information on years of purchases by hundreds of millions of households. One use of that data has been to support “just-in-time” delivery to individual stores, reducing the necessity for each store to have large and expensive magazines where stocks are held, making products cheaper. That system requires supermarkets to fairly accurately predict what the level of sales will be for thousands of products in stock, which not merely needs good accounting of what is still in stock, but also good forecasting of future demand which requires sophisticated analysis of previous sales. Hence supermarkets know with near-perfect accuracy how much extra ice-cream they will sell in which neighborhood if the weather gets warmer, and just how many products or SKUs they will sell at what discounted price. One might see this use of Big Data as positive, efficiency improving.
Then there is the market for personalized advertising, also called behavioral targeting. On the basis of their internet-observable history, which will often include their social communication on the internet (including their mobile phone device), it is predicted what advertising is most likely to work on them. Personalized advertising is then sold on a spot market, leading to personalized recommendations (i.e. one’s previous purchases), social recommendations (what similar people bought), and item recommendations (what the person just sought). This advertising market is enormous and has grown fast. In 2017 the Paid media content was reportedly worth over US$ 500 Billion, and digital advertising was worth some US$ 230 Billion according to industry estimates. The business model of many internet firms is to offer services for free to anyone in the world, funded by the ads attracted to the traffic on that site. The grand bargain of the Internet is thus free services in exchange for advertising. This is both well-understood and well-known, so one could say that this bargain is made under conditions of considered consent: users of free services (like Facebook) should know that the price of those services is that their personal information is sold for advertising purposes.
There is also a market for more invasive information, where access to goods and services is decided on the basis of that information. An old example from before the internet was credit-worthiness information, which could be bought off banks and other brokers. This was of course important when it came to large purchases, such as a house or setting up a new business. A good modern example is personalized information on the use of online health apps. Individuals visiting free online health apps which give feedback on, for instance, how much someone has run and where, are usually asked to consent to the sale of their information. That information is very useful to, for instance, health insurance companies interested in knowing how healthy the behavior of someone is. Those health insurance companies will look more favorably on someone known to have a fit body, not buy large volumes of cigarettes and alcohol online, and have a generally considered and healthy lifestyle. It is thus commercially important for health insurance companies to buy such data, and not really an option to ignore it.
This example also shows the ambiguity involved in both consent and the option of staying “off the grid”: it is unlikely that everyone using health apps realizes the potential uses of the data they are then handing over, and it is not realistic to expect them to wade through hundreds of pages of detailed consent forms wherein all the potential uses would be spelled out. Also, someone who purposefully stays “off the grid” and either actively hides their online behavior via specialized software or is truly not online at all, will not be unaffected by health profiling activities for the very reason that there is then no profile of them. To a health insurance company, the lack of information is also informative and likely to mean that person has something to hide. Hence, even someone actively concerned with leaving no digital footprints and having very limited data on them online, will be affected without their consent by the activities of others.
Privacy is very difficult to maintain on the Internet because nearly all large internet-site providers use a variety of ways to identify who accesses their websites and what their likely interests are. Websites use Cookies, Java scripts, Browser Fingerprinting, Behavioral Tracking, and other means to know the moment a person clicks on a website who that person is and what they might want. What helps these websites is the near-uniqueness of the information that a website receives when it is accessed: the IP address, the Browser settings, the recent search history, the versions of the programs used, and the presence of a variety of added software (Flash, JavaScript, cameras, etc.). From that information, internet sites can usually know who has accessed them, which can then be matched to previous information kept on that IP address, bought and sold in a market. Only very Internet-literate individuals can hope to remain anonymous.
The fact that the main use of Big Data on the Internet is to aid advertising should also be somewhat reassuring for those who fear the worst about Big Data: because the advertising market is worth so much, large internet companies are careful not to sell their data for purposes that the population would be highly disapproving of, whether those purposes are legal or not. It is for instance not in the interest of e-Bay, Apple, Google, or Samsung to sell information about the porn-viewing habits of their customers to potential employers and romantic partners. These uses are certainly worth something, and on the “Dark Web” (the unauthorized parts of the internet) such information can (reportedly) indeed be bought and sold, but for the “legitimate” part of the market, there is just too much to lose.
Mood analysis is very old, with consumer and producer sentiment recorded in many countries since the 1950s because it predicts economic cycles well. However, the analysis of the well-being of individuals and aggregate well-being is starting to take off as more modern forms of mood analysis develop. These include counting the positive/negative affect of words used in books or any written documents (e.g. Linguistic Inquiry and Word Count); analysis of words used in Twitter feeds, Facebook posts, and other social media data through more or less sophisticated models of sentiment analysis; outright opinion and election polling using a variety of tools (mobile phone, websites, apps). New technologies include Artificial Intelligence analysis of visual, olfactory, sensory, and auditory information. They also include trackable devices that follow individuals around for large parts of the day and sometimes even 24/7, such as Fitbits, mobile phones or credit cards.
One may wonder whether “Big Data” can improve well-being predictions, and help solve economic problems”?