Smart Shift: The information fire hydrant
Come, let us build ourselves a city, and a tower whose top is in the heavens. Genesis 11:4, The Tower of Babel
There's certainly a certain degree of uncertainty about, of that we can be quite sure. Rowan?Atkinson, Sir Marcus Browning MP
As well as being a mathematician, Lewis Fry Richardson was a Quaker and a pacifist. He chose to be a conscientious objector during the First World War: while this meant that he could not work directly in academia, he nonetheless continued to study at its fringes. As well as creating models
The discomfiting nature of the phenomenon, which became known as the coastline paradox, was picked up by fractal pioneer Benoit Mandelbrot in 1967. In his [paper] ‘How Long Is the Coast of Britain?’ he wrote, “Geographical curves can be considered as superpositions of features of widely scattered characteristic size; as ever finer features are taken account of, the measured total length increases, and there is usually no clearcut gap between the realm of geography and details with which geography need not be concerned.” In other words, it wasn’t only the measurable distance that mattered, but the phenomenon cast into doubt what the geological features actually meant. Was a rocky outcrop part of the coastline or not? How about a large boulder? Or a grain of sand?
This same phenomenon is fundamental to our understanding of what we have come to call data, in all of its complexity. Data (the plural of datum) can be created by anything that can generate computer bits, which these days means even the most lowly of computer chips. Anything can be converted to a digital representation by capturing some key information, then digitising and converting it into data points, transporting it from one place to another using a generally accepted binary format. Whenever we write a message or make use of a sensor, we are adding to the mother of all analogue to digital converters. Digital cameras, voice recorders, computer keyboards, home sensors and sports watches and, well, you name it, all can and do generate data
As a consequence, we are creating data far faster than we know what to do with it. Consider: at the turn of the millennium 75% of all the information in the world was still in analogue format, stored as books, videotapes and images. According to a?study?conducted in 2007 however, 94% of all information in the world was digital — the total amount of stored was measured as 295 Exabytes (billions of Gigabytes). This enormous growth in information shows no sign of abating. By 2010 the figure had crossed the Zettabyte (thousand Exabyte) barrier, and by 2020, it is estimated, this figure will have increased fifty-fold.?
As so often, the simplest concepts have the broadest impact: no restriction has been placed on what data can be about, within the bounds of philosophical reason. The information pile is increasing as we can (and we do) broadcast our every movement, every purchase and every interaction with our mobile devices and on social networks, in the process adding to the information mountain. Every search, every ‘like’, every journey, photo and video is logged, stored and rendered immediately accessible using computational techniques that would have been infeasible just a few years ago. Today, YouTube users upload an hour of video every second, and watch over 3 billion hours of video a month; over 140 million tweets are sent every day, on average – or a billion per week.?
It’s not just us — retailers and other industries are generating staggering amounts of data as well. Supermarket giant Wal-Mart handles over a million customer transactions every hour. Banks are little more than transaction processors, with each chip card payment we make leaving a trail of zeroes and ones, all of which are processed. Internet service providers and, indeed, governments are capturing every packet we send and receive, copying it for posterity and, rightly or wrongly, future analysis. Companies of all shapes and sizes are accumulating unprecedented quantities of information about their customers, products and markets. And science is one of the worst culprits: Alice experiment at CERN’s Large Hadron Collider generates data at a rate of 1.2 Gigabytes per second. Per second!
Our ability to create data is increasing in direct relation to our ability to create increasingly sensitive digitisation mechanisms. The first commercially available digital cameras, for example, could capture images of up to a million pixels, whereas today it is not uncommon to have 20 or even 40 ‘megapixels’ as standard. In a parallel to Richardson’s coastline paradox, it seems that the better we get at collecting data, the more data we get. Marketers have the notion of a ‘customer profile’ for example: at a high level, this cold consist of your name and address, your age, perhaps whether you are married, and so on. But more detail can be added, in principle helping the understanding of who you are. Trouble is, nobody knows where to stop — is your blood type relevant, or whether you had siblings? Such questions are a challenge not only to companies who would love to know more about you, but also (as we shall see) because of the privacy concerns they raise.?
Industry pundits have, in characteristic style, labelled the challenges caused by creating so much data as ‘Big Data’ (as in, “We have a data problem. And it’s big.”). It’s not just data volumes that are the problem, they say, but also the rate at which new data is created (the ‘velocity’) and the speed at which data changes (or ‘variance’). Data is also sensitive to quality issues (‘validity’) — indeed, it’s a running joke that customer data used by utilities companies is so poor, the organisations are self-regulating — and it has a sell by date, that is, a point when it is no longer useful apart from historically. When we create information from data, we are often experience a best-before time limit, beyond which it no longer makes sense to be informed. This is as true for the screen taps that make a WhatsApp message, as for a complex medical diagnosis.?
领英推荐
All of these criteria make it incredibly difficult to keep up with the data we are generating. Indeed, our ability to process data will, mathematically, always lag behind our ability to create it. And it’s not just the raw data we need to worry about. Computer processors don’t help themselves as they have a habit of creating duplicates, or whole new versions of data sets. Efforts have been made to reduce this duplication but it often exists for architectural reasons — you need to create a snapshot of live data so you can analyse it. It’s a good job we have enough space to store it all, or do we? To dip back into history, data storage devices have, until recently, remained one of the most archaic parts of the computer architecture, reliant as they have been upon spinning disks of magnetic material. IBM shipped the first disk drives in 1956 — these RAMAC drives could store a then-impressive four million bytes of information across its fifty disk platters, but had to be used in clean environments so that dust didn’t mess up their function. It wasn’t until 1973 that IBM released a drive, codenamed?Winchester, that incorporated read/write heads in a sealed, removable enclosure.?
Despite their smaller size, modern hard disks have not changed a great deal since this original, sealed design was first proposed. Hard drive capacity increased by 50 million times between 1956 and 2013 but even this is significantly behind the curve when compared to processor speeds, leading pundits such as analyst firm IDC going to the surprising length of suggesting that the world would “run out of storage” (funnily enough, it hasn’t). In principle, the gap could close with the advent of solid state storage — the same stuff that is a familiar element of the SD cards we use in digital cameras and USB sticks. Solid State Drives (SSDs) are currently more expensive, byte for byte, than spinning disks but (thanks to Moore’s Law) the gap is closing. What has taken solid state storage so long? It’s all to do with the transistor counts. Processing a bit of information requires a single transistor, whereas storing the same bit of information for any length of time requires six transistors. But as SSDs become more available, their prices also fall meaning that some kind of parity starts to appear with processors. SSDs may eventually replace spinning disks, but even if they do, the challenge of coping wth the data we create will pervade. This issue is writ large in the Internet of Things — as we have seen, a.k.a. the propensity of Moore’s Law to spawn smaller, cheaper, lower-power devices that can generate even more data. Should we add sensors to our garage doors and vacuum cleaners, hospital beds and vehicles, we will inevitably increase the amount of information we create. Networking company Cisco?estimated?that the ’Internet of Everything’ would cause a fourfold increase in the five years from 2013, to reach over 400 ZettaBytes - that’s 10^21 bytes.?
To technology’s defence, data management
The notion of creating data stores and using them to generate reports became a mainstay of commercial computing. As the processing capabilities of computers became more powerful, the reports could in turn become more complicated. IBM’s own language for writing reports was the Report Program Generator itself, RPG. While it was originally launched in 1961, RPG is still in use today, making it one of the most resilient programming languages of the information age. IBM wasn’t the only game in town: while it took the lion’s share of the hardware market, it wasn’t long before a variety of technology companies, commercial businesses (notably American Airlines with its SABRE booking system) and smaller computer services companies started to write programs of their own. Notable were the efforts of Charles Bachman, who developed what he termed the Integrated Data Store wen working at General Electric in 1963. IDS was the primary input to the Conference/Committee on Data Systems Languages’ efforts to standardise how data stores should be accessed; by 1968 the term database had been adopted.?
And then, in 1969, dogged by the US government, IBM chose to break the link between hardware and software sales, opening the door to?competition?from a still-nascent software industry. All the same it was another IBM luminary, this time Englishman Edgar Codd, who proposed another model for databases, based on tables and the relationships between data items. By the 1980s this relational database model, and the Structured Query Language (SQL) used to access it, became the mechanism du choix for several decades afterwards, for all but mainframe software where (despite a number of competitors appearing over the years) IBM models still dominated.?
Of course it’s more complicated than that — database types, shapes and sizes proliferated across computers of every shape and size. But even as data management technologies evolved, technology’s propensity to generate even more data refused to abate. As volumes of data started to get out of hand once again in the 1990’s, attention turned to the idea of data warehouses — data stores that could take a snapshot of data and store it somewhere else, so that it could be interrogated, the data analysed and the results used to generate ever more complex reports. For a while it looked like the analytical challenge had been addressed. But then, with the arrival of the Web, quickly followed by e-commerce, social networks, online video and the rest, new mechanisms were required yet again as even SQL-based databases proved inadequate to keep up with the explosion of data that resulted. Not least, the issue of how to search the ever-increasing volumes of web pages was becoming ever more pressing. In response, in 2003 Yahoo! colleagues Doug Cutting and Mike Cafarella?developed?a tool called Nutch, based around a indexing mechanism from Google, called MapReduce, itself “a framework for processing embarrassingly parallel problems across huge datasets.” The pair quickly realised that the mechanism could be used to analyse the kinds of data more traditionally associated with relational databases, and created a specific tool for the job. Doug named it?Hadoop, after his son’s toy elephant.?
Hadoop spelt a complete breakthrough in how large volumes of data could be stored and queried. In 2009 the software?managed?to sort and index a petabyte of data in 16 hours, and 2015 was to be the year of ‘Hadooponomics’ (allegedly). The project inspired many others to create non-relational data management platforms. MongoDB, Redis, Apache Spark and Amazon Redshift are all clever and innovative variations on a general trend, which is to create vast data stores that can be interrogated and analysed at incredible speed.?
Even with such breakthroughs, our ability to store and manage data remains behind the curve of our capability to create it. Indeed, the original strategists behind the ill-fated tower of Babel might not have felt completely out of place in present day, large-scale attempts to deal with information. And so it will continue — it makes logical sense that we will carry on generating as much information as we can, and then we will insist on storing it. Medicine, business, advertising, farming, manufacturing… all of these domains and more are accumulating increasingly large quantities of data. But even if we can’t deal with it all, we can do increasingly clever things with the data we have. Each day, the Law of Diminishing Thresholds ensures a new set both very old and very new problems that are moving from insoluble to solvable.
To do so this requires not just data processing, storage, management and reporting, but programs that push the processing capabilities of computers to the absolute limits. Enter: the algorithm.