A brief, brief history of Data Analytics
A first part of a series introducing the future of data analytics
Data has been in the news a lot recently promoting an interest not seen before. In reality though data has been part of our culture for a very long time and has simply gone through a series of innovations and transformations during its long life.
These innovations are interesting as each revolution builds upon – but doesn’t obsolete what was there before, so extends and expands the capabilities set by the earlier pioneers.
But what is new is that data has moved from the province of a small elite to being a mass-market concept, a process of democratisation that confers both benefit as well as risk. Unfortunately, this widening of awareness hasn’t gone hand in hand with a general understanding of what these innovations are – leading to distrust, naivety and cynicism.
The purpose of this paper is therefore to highlight this journey to a broader audience and to show what we can all learn from it.
2044 BCE: The invention of the database
We have been analysing data since the birth of civilisation. The earliest preserved records of writing are not ethereal things like poems, great speeches love letters or novels, but rather examples of data analysis.
In Sumeria in ancient Iraq, scribes produced lists of ploughmen employed by the state and preserved this data on clay tablets making the first database in the process. These inscriptions also calculated their wages directly from this raw data, so the discipline of data analytics was born too.
The first database / analytic tool. Also looks a bit like an Ipad that has been kicked around a bit
These scribes must have recognised the value of gathering and analysing this information and so preserved and treasured it, in what was likely then a very expensive process of storage.
Over the spanning years technology allowed first papyrus, then paper to be used instead of clay, making the process of recording and storing easier and cheaper. The formulation of algebra and the decimal system in the 9th Century CE (again in Iraq), also made both the calculations and the structure of data more efficient. But further innovations had to wait another millennia.
1960’s CE: The invention of structured data and canned reports
This level of analytics remained largely unchanged until computers in the 1950’s, 60’s and 70’s could process a larger amount of data. The first databases evolved from what was previously stored in paper journals, allowing the breadth and depth on what was reported also to be extended.
Mainframe systems developed allowing programmers to design reports that could be useful to the business. Thus the “canned report” was born. Larger than what the Sumerians developed 3000 years previously, but to an extent still as inflexible as before. If you did not like the report, too bad.
A canned AS400 report: Like a tin of beans but instead with data
These static reports then developed basic search and customisation, but really it was just still tied to what the programmer thought was useful to be presented.
In the early days, the data behind the report was often tied to the canned report itself, so the breadth of data was not really much bigger than the analyst could see on the screen. So the data behind your country report that listed some information about countries, only contained country information and nothing else. This was fine with a small number of reports but as these proliferated the number of data sets did too, which introduced a problem of maintenance (for many organisations this is still a problem 50 years later!)
Instead of one single list, data was stored in an efficient network of tables
To address this, these pockets of data were brought together into larger structures, and overlaps, duplicates and inconsistencies were ironed out in a process called “normalisation”. This immediately introduced tension between individualisation and standardisation (again still present today) but generally the need for nicely managed and ordered databases that allowed people to store more and more prevailed.
As storage increased then the way in which it was interrogated became harder and together with an increasing number of database vendors a consistent way in which this could be performed was needed. Thus a logical language called SQL (Structured Query Language or “SEE-KWEL”) was designed in the 70’s. Like any language, proficiency was required to get the right results and thus the formal discipline of the data analyst was formed.
SQL: The language of the gods
Unfortunately as the databases became larger and more complex, this became ever more prone to error, became harder to verify and also took longer to perform.
1990’s: The invention of Business Intelligence
Stacking cubes of data
As personal computer replaced mainframes, the main drawback of the canned report became apparent. Business users wanted to look at data how they want to see it, not how the programmer defined it. The visual capabilities of MS-Windows also raised expectations on how data could be presented too.
They also wanted the bigger picture, and aggregating data across divisions of a company to the group level became a challenge not just in presentation but in physically joining the data together.
And at the same time rather than always looking at a flat one-dimensional view of data (say defined per country), they wanted to see data at a more fine-grained level (say per region or town). This “drill up and down” was the first main innovation for Business Intelligence.
But the requirements kept on coming:
- Wouldn’t it be better if we could see how something changed over time?
- Or even better how something changed over time for a specific country.
- let’s now take that now to individual cities……
- But scrub that, lets slice instead across all cities but for the same period, and compare how people who are called John, do things over time
- And so on, and so on…..
The trouble with this sort of flexibility is that the nicely normalised data for the canned reports couldn’t keep up, so a whole new process of “denormalisation” building around how someone wanted to see it was born. Data became structured not in tables, but in “cubes” of data, with different dimensions covering time, product type, geography and so on.
Slice and dice: Data became 3D – now how do you store that on a clay tablet?
This really extended the development of data analytics discipline as it required both technical skills to run SQL as well as some art and insight to choose the right dimensions and measures that were useful in the cube.
A data dashboard to fly to the corporate moon
And the outcome of this analysis could give historical insight into the data that wasn’t possible by looking at a static, canned reports. Anomalies in trends could be explained by drilling into the detail of the data dynamically to pinpoint a problem or opportunity. This insight of patterns of behaviour in say profitability per region in a previous year, might give CEOs and advantage how they might want to plan the coming year. The data analyst helped guild this process by enhancing the report by a commentary or narrative to help guide executives into interpreting it.
Historical trend analysis: I see a pattern here…..
By joining these many views together, executives could now instead of seeing just one canned report at a time, they could have a sophisticated dashboard of related – and unrelated - information, much of it could be interrogated dynamically. Like a pilot flying a plane, this executive’s dashboard could help management choose the right direction, take corrective action and give warning lights on any approaching (virtual) mountains.
Cognos: An early pioneer of the “Dashboard”
But the limitations of business intelligence soon became apparent, as insights were limited as to the dimensions chosen as you could only slice and dice what was in the cube. And to slice and dice more we needed either a bigger cube (which was very unwieldy) or lots of joined up cubes (which was also unwieldy). But to remain competitive executives began to demand more and more data in these dashboards.
Modern BI vendors (such as Tableau) offer integrated tools to address this somewhat allow easier data preparation in selecting where the data is from, aligning and cleansing it where necessary. and then building reports dynamically from this.
Tableau : Self service report builder
But irrespective how easy it was to build the reports, the analysis was always retrospective based on things in the past. Cubes of data could only present what was there, they couldn’t provide insight into what was yet to occur – whilst the future could be inferred from where you have been, the data couldn’t explicitly state what wasn’t yet there.
But the biggest shock to business intelligence came suddenly in the 90’s as there came an endless source of new data: the Internet. Data in clay tablets, data in structured tables, data in cubes…. None of this was enough to handle all this. The concept of big data was born.
2010’s: The invention of Data visualisation and expansion of data science
Big data – discovering new doors
Big data is an architecture that allows the storage and retrieval of exponentially large amounts of unstructured data. The last point probably allowed the biggest innovation, as in the past, large data stores tended to require structure, which limited the scalability as the complexity to manage this standardisation increased in direct proportion to their size.
Open source tools such as Hadoop provided the engine to allow this to happen by providing dedicated load, transformation and query with the principles of volume and unstructured built from the ground up.
Previously the cornerstone of data architecture was the acronym “ETL” which stood for Extract – Transform – Load – and in that order. The transform bit was always the most complex (as we needed it in a structure first) and limited how much we could load. Hadoop reversed this. It was designed to allow stuff (and it could literally be anything) to be loaded first. And once it was ingested – well we could worry about it later. Because the tools were open source, there was a much larger amount of flexibility with the technology, which further helped with the scaling.
ETL vs ELT: What a difference a single letter makes
In fact the unstructured nature of this data provided another boon, as the very raw chaotic nature meant, with the right processing new correlations into the past - and future could be made. But more on that later.
In a similar timeframe the concept of allowing others to store your data – Such as Microsoft Azure and Google’s cloud storage (“cloud” is a bit misleading as it is still physically stored – its just that you don’t have to worry about it) – worked to support this architecture as all parts of this big (and virtual) data store could be accessed by any part of your organisation. Again these were also open source allowing a great amount of scalability.
As hosting data became cheaper and cheaper the need to hold it efficiently became less important. We could hold more and more of it together in one place, and with the advent of cloud hosting we didn’t even need to store it anywhere. And unlike the drill down we had previously that was limited to small bits of data, our analytics was (to human comprehension at least) infinite.
Data visualisation – opening new doors
Our three dimensional drill down, and slice and dice tools we had before, couldn’t cope with this unbounded information. We needed a new way to comprehend and spot patterns in the endless sea. Data visualisation was born. This led to a proliferation on vendors, with many niche players running alongside the more traditional ones. The following are just a small variety of what is there:
Cognos again kept up on the data visualisation bandwagon
Visualisation showing several economic indices within recessionary cycles (this might be useful to know!)
Visualisation overlaying transport statistics over geographical areas
Visualisation showing links between BMI, obesity and communities within a city
Data Science – stepping beyond the door
Whilst the unstructured nature of big data provided an opportunity to support this visualisation, the cost was these patterns were not always easy to spot and became increasingly mathematical. The role of the data analyst gave way to that of the Data Scientist.
Data science could start to do more with this data, by building a scenario or hypothesis that predicted a certain outcome of seeming unrelated behaviour in the past, then compared how this held up in reality during the present day using a small test set, could in theory be used to predict the future.
From a commercial, social and political perspective this was like alchemy. But the trouble is which scenarios to build? With big data there is simply too much information to manage and too many combinations. Instead the data scientists programmed the parameters of learning algorithms and let the AI do the rest.
This process became very complex and opaque, and the models required a good understanding of both complex statistics and mathematically modelling, the previously generalist skills of a data analyst who could see both the data and the business became less desirable than someone who specialised in advanced levels of mathematics only (we will find this departure of the real world introduced problems itself which we will cover in our next paper).
A great example of how data science transformed a discipline is in financial crime. Big data allowed the unifying of previously separate data sets containing customer details, customer transactions, accounts and geography for the first time. Data scientists then would build models predicting patterns of behaviour that could be attributed to individuals. and by looking at the “shapes” of the analysis machine learning could spot both patterns and deviations, highlighting potential criminal activity, all in real time.
Data visualisation then provided a good way in which a skilled operative could check these potential alerts and then take action as required. Quantexta is one of the leading vendors within this space and developed the concept of network link analysis where patterns of linked behaviours could visualised easily.
Network Link Analysis – fighting crime with big data
2020’s Infinity and Beyond
So here we are in the current day. Even the internet isn’t enough data, and so the “Internet of things” is emerging, including in the big data mix every single bit of data connected to every single device connected to the web (which is pretty much everything).
This includes all transactions on everything we ever bought, every location that every person has ever visited, and the very genetic information that defined all of us. Pretty scary depending on your point of view.
So where next. Well if I knew for sure what the next innovation was I certainly wouldn’t be writing it here, but taking it to the patent office and getting ready for my Zuckerberg lifestyle. But who knows, perhaps completely integrated reality/virtual reality, transference of organic life into wholly digital forms (and vice versa), non-deterministic outcomes for us as everything is predicted in real time. This is all very science fiction stuff – and why not.
But before we get to this new weird new world, we need to fix the one we have. And unfortunately, despite some successes data analytics is failing to live to its hype promised in a big way, and to a greater extent than any of the previous innovations before it.
Why this is happening and what we can do about it will be the subject of future papers!
? Deryck Brailsford, September 2020.
Associate Partner at Carbon8 Consulting
4 年A great synopsis on how data has evolved from clay tablets to huge amounts of data being stored, virtualised and applied in making predictive forecasts. Looking forward to Deryck’s next papers, in particular as to how data analytics will play out in the next 10 years.