Entity Relationships To Event Streams
Historically data was mostly related to “Operational data” sitting nicely into a well structured database. Now most of the data relates to “User data” which lives in unstructured buckets/ ponds / lakes. Focus of data industry has moved from Process management to product personalisation, and from business reports to real time analytics. This has changed the data industry yet again. Challenges are high to capture this ever-increasing unstructured data but also opportunities are even higher to create great value out of it.
Data is the very base of everything we do now a days and it’s evolving faster than we expect. GAFA (google, amazon, facebook, apple) and all other top companies are essentially banking on data. Their daily business is shaped by monitoring every event that user generates. Data is being produced and consumed at an unprecedented scale, and it’s expected to increase with concepts of smart cities, connected digital lifestyle, IoT and many more new technologies to evolve into general consumer base. This has now led to a venture into “polyglot data sphere”, where data has a very wide definition, and now lives and moves into various shapes and sizes. Traditionally data used to be for storage and reporting purpose. But now it is creating wealth in real time and changing societies, with high level algorithm running on real time data, generated by events from everyone and everything.
But to understand this transition lets go back to 1970. EF Codd's ideas were first published in a seminal paper 'A Relational Model of Data for Large Shared Data Banks'. His ideas changed the way people thought about data stores. In his model, the database’s schema, or logical organization, is disconnected from physical information storage, and this became the standard principle for database systems. It gave birth to structured data and stable “Schema On Write” approach. Meaning every piece of data that needs storing must have a structure that complies with pre-defined rules in data storage. People embraced the system that abstracted data storage and processing into single unit called relational database such as MySQL, Oracle, SQL Server and so on. These database systems were made for primarily one thing, managing relationships among entities of a system.
But as early adopter of relational database systems were banks and financial institutions, and there main concern was transactions. So with the need of time most of the relational databases ended up ensuring data consistency and integrity throughout system, which is also referred as ACID properties (Atomicity, Consistency, Isolation, Durability). These were properties defined for transactional systems/ ledgers and most of relational database were developed to support these transactional systems, hence the words relational database and transactional database became interchangeable. The main focus in relational databases was on per transaction and it’s relation to other transactions executing within same environment. This limited the horizontal scalability of the systems running on isolated nodes.
To overcome the issue of horizontal scalability, many technologies were proposed such as sharding/ portioning etc., but there were limitations as compared to good old-fashioned single node relational database system. In light of this Eric Brewer proposed a theorem that appeared in 1998, famously called as CAP theorem. It states that it is impossible for a distributed data store to simultaneously provide more than two out of the three guarantees namely Consistency, Availability and Partition-Tolerance.
Within Relation Databases, there were set of commands to create data and set of commands to query data. Often those commands were abstracted to high level language such as Structured Query Language (SQL) which came handy to manage data. But at the same time it restricted the scope of what can and cannot be done with data. A strict pre-defined set of rules for data to exist in the first place, hence a very authoritative ecosystem. It was also challenging because with schema-on-write, you must do an extensive data modelling job and develop and uber-schema that covers all the datasets that you care about. Then you must think about whether your schema will handle the new datasets that you’ll inevitably want to add later. Practically speaking this gets expensive and complicated with time, and to some extent unmanageable.
As databases increased in size, it was good for day to day data management but processing of vast amount of data in database for analytical purposed became difficult and less efficient. This created need for Data Warehouse. A data warehouse is a system that pulls together data from many different sources within an organization for reporting and analysis. The reports created from complex queries within a data warehouse are used to make business decisions. Having said that, data warehouses were yet another repository for structured, filtered data that has already been processed for a specific purpose. Hence the constrained that came with databases followed through to data warehouses.
Early 2000s saw internet boom. Everyone was on internet and the whole world was in Dot-Com era. Google allowed people to ask any question and myspace allowed people to create new class of netizens. And Then came smart Ideas, yes everything started to become smart; smart phones, smart cars, smart watch, smart home appliances, smart cities and so on. As smartness evolved so did data. As internet and digitally connected ecosystem gained extreme popularity, and relational databases and warehouses simply could not keep up with the flow of information demanded by users, as well as the larger variety of data types that occurred from this evolution. This led to the development of non-relational databases, often referred to as NoSQL. The acronym NoSQL was first used in 1998 by Carlo Strozzi while naming his lightweight, open source “relational” database that did not use SQL. NoSQL databases helped to translate strange data quickly and avoid the rigidity of SQL by replacing “organized” storage with more flexibility. We now entered age of semi-structured, poly-structured, and unstructured data, which is the vast majority by volume. This changing nature of data made it more and more difficult to maintain an age old yet stable “schema on write” approach. Well structured relational data was about to see its millennial counterpart, “NoSQL”.
In light of all the various issues surrounding relational database systems, as a potential solution, in mid 2000s a new way of exploring data came into existence which we know by Hadoop, a Distributed File System and gave birth to what we now know “Schema On Read”. A year after Google published a white paper describing the MapReduce framework (2004), Doug Cutting and Mike Cafarella created Apache Hadoop. In contrast to tightly controlled data storage and processing in old relational databases, Hadoop brought a new paradigm: It had two main parts – a data processing framework and a distributed filesystem for data storage. A flexibility that was needed to handle high volume as well as parallel processing of the data.
HDFS was designed as a scalable distributed file system to support thousands of nodes within a single cluster. With enough hardware, scaling to over 100 petabytes of raw storage capacity in one cluster can be easily and quickly achieved. You dump in your data and it sits there all nice and cosy until you want to do something with it, whether that’s running an analysis on it within Hadoop or capturing and exporting a set of data to another tool and performing the analysis there. Over the years, Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a huge variety of tasks that all share the common theme of variety, volume and velocity of structured and unstructured data. Hadoop soon became de-facto framework for large-scale, data-intensive systems.
Scalable data storage systems opened gates for ingesting whatever organisations needed to ingest in order to move beyond just basic transactions. And new wave of data started to come into every organisation by name of ‘Events’. Events were associated loosely with everything and the mode of reporting of events kept changing with digital evolution. Meaning an event data can be generated automatically by electronic devices but also can be reported by someone maintaining a excel file regarding factory worker’s overtime or an email to customer care. Banks were not only big consumers of data anymore. Insurance companies were now interested in day to day activities of people and gauging risk on daily basis. Stock markets were listening everyone and everything possible to learn market trend. Data started flowing left right and Centre. With so much data in play, the business were expecting to aggregate and summarise the data, basically refine it, so that they can make business decisions based on information extracted out of it. Those business decisions must either increase efficiency or generate money. This gave rise to a new phrase, Clive Humby, UK Mathematician and architect of Tesco's Clubcard, 2006: “Data is the new oil. It's valuable, but if unrefined it cannot really be used”.
The catch phrase for new ear of high volume data was “Big Data”. Most of the available data has been created in the last few years but the term Big Data has been around 2005, when it was launched by O’Reilly Media in 2005. However, the usage of Big Data and the need to understand all available data has been around much longer. But what is “Big” about Big Data? The answer was first summarised in an Oracle White paper by Laney Douglas in a 2001 report for the META group named: 3D Data Management: Controlling Data Volume, Velocity and Variety. He proposed 3Vs which now has been extended to 5 Vs. These ‘Vs’ become a reasonable test as to whether a Big Data approach is the right one to adopt for a new area of analysis. The Vs are (I will limit myself to original 3Vs):
1) Volume: The size of the data. With technology it’s often very limiting to talk about data volume in any absolute sense. As technology marches forward, numbers get quickly outdated so it’s better to think about volume in a relative sense instead. If the volume of data you’re looking at is an order of magnitude or larger than anything previously encountered in your industry, then you’re probably dealing with Big Data. For some companies this might be 10’s of terabytes, for others it may be 100’s of petabytes.
2) Velocity: The rate at which data is being received and has to be acted upon is becoming much more real-time. While it is unlikely that any real analysis will need to be completed in the same time period, delays in execution will inevitably limit the effectiveness of campaigns, limit interventions or lead to sub-optimal processes. For example, some kind of discount offer to a customer based on their location is less likely to be successful if they have already walked some distance past the store.
3) Variety: There are two aspects of variety to consider: syntax and semantics. In the past these have determined the extent to which data could be reliably structured into a relational database and content exposed for analysis. While modern ETL tools are very capable of dealing with data arriving in virtually any syntax, in the past they were less able to deal with semantically rich data such as free text. As a result many organisations restricted the data coverage of IM systems to a narrow range of data. Deferring the point at which this kind of rich data, which is often not fully understood by the business, also has significant appeal and avoids costly and frustrating modelling mistakes. It follows then that by being more inclusive and allowing greater model flexibility additional value may be created - this is perhaps one of the major appeals of the Big Data approach.
While Big data approaches were giving a stable state to modern data needs, there was a crucial missing piece, Value. In an extended white paper of Oracle it was mentioned that “The commercial value of any new data sources must also be considered. Or, perhaps more appropriately, we must consider the extent to which the commercial value of the data can be predicted ahead of time so that ROI can be calculated and project budget acquired. ‘Value’ offers a particular challenge to IT in the current harsh economic climate.”
But creating Value needed something fast, reliable, real-time understanding of high volume of data, especially when data sources have moved from traditional devices to everyone & everything around us. It was time when we were looking over dawn of IoT and edge computing. It extended digital data from humans to every possible physical thing associated with humans.
In contrast to conventional databases that were built for persistence first, new era of digital tsunami needed system for small events occurring in real time, and needed shorten the decision-making cycle, deepening a company or segment’s understanding of its end user. A phenomenon now called digital data streaming. A Digital Data Stream (DDS) is a continuous digital encoding and transmission of data describing a related class of events. The transmission, or flow, of these digital representations of events is a DDS, which may be human-generated (e.g., a tweet, an Instagram) or machine-generated (e.g., a CO2 reading, a GPS location). DDSs allow managers to dissect events in real-time, to shorten the decision cycle, and to deepen their understanding of customers at the same time. Cities like San Francisco are employing data information for services such as SFPark, which communicates real-time information about available parking spaces, significantly reducing greenhouse gas emissions (and the frustration of searching for a spot). And with that the status quo of generations old batch processed business intelligence reporting was soon replaces by new kid on the block by name “Real-time analytics”.
Within last few years, very rapidly, DSS became everyone’s favourite. A DDS can capture, and thus represent, up to six basic elements describing an event. These elements are “primitives,” meaning that they cannot be described in terms of other elements or inferred from them. These primitives derive from what are commonly known as the 5W+H of narrative (who, what, when, where, why, and how). Discovered and re-discovered several times throughout history they originate in rhetoric and the 2nd century BCE.
Within DSS community, Streamability is determined by three characteristics: detectability, measurability, and interpretability. Detectability is as the name suggests: it questions whether or not the event exceeds a minimum threshold to be persistently detected, triggering a stream. Measurability is the ability of the elements of an event to be quantified with the necessary precision. Interpretability reflects whether a firm can understand the content of one or more elements in a stream: this can be particularly difficult when considering the “why” element, a subjective quality that can be hard to determine unobtrusively. These aspects of Streamability are dependent on the nature of the event itself and the technology utilized. A good example to visualise all thee characteristics is by looking at Coca Cola’s intelligent dispenser shops and Malls. Coca-Cola’s sensor-enabled Freestyle fountain drink dispenser dispenses over 100 different flavours, gathers and reports consumption data for market analysis. Customers can choose from among these many flavours and drinks, and create and share (through a dedicated app) custom mixes. Thus, these new fountains are also a platform for experimenting with flavours, without a commitment to major bottling and marketing investments. Coca-Cola has an opportunity to tap into changes in beverage consumption as they occur and respond quickly.
DDS initiatives shows that organizations extract value from events in a DDS via either process-to-actuate or assimilate-to-analyse tactics.
1. Process-To-Actuate occurs when a firm creates value by initiating action based on real-time DDS processing. An insurance company monitoring a weather forecast data stream and sending text messages to its customers in the area where hail is expected in the next 30 minutes illustrates the immediacy of process-to-actuate. The firm combines events that are currently streaming in a DDS (i.e., real-time location-specific short-term weather forecasts) and the results of a static database query and other contextual data in order to alert its potentially affected customers in a timely manner. The result is superior customer service and fewer insurance claims because customers have been able to garage their vehicles at the right time.
2. Assimilate-To-Analyse occurs when a firm extracts value by merging multiple data streams and static databases and dissecting the composite data set. The focus is on extraction of insights rather than immediate action. To avoid the financial risks associated with planning errors, some firms have integrated external DDSs in their demand forecasting system. For instances, Tesco and other retailers merge and analyze data from multiple digital data streams to generate forecasts to estimate demand. Predictions are based on information generated from store location, product characteristics, recent weather history, and weather forecasts. Note how the result of the analysis is not immediate automatic action, as in process-to-actuate, but rather the presentation of superior insight that enables better decision making.
All these systems mentioned above have their own pro and cos, and its not easy to say anyone of these systems meets all requirements of today. It often ends up being a Polyglot data sphere, where companies are choosing right tools for right job.
All these systems came into existence to solve the problems of their time. But one thing that revolutionised all the above was Cloud computing. All the above mentioned systems are complex and needed someone to maintain it. It needed lot of technical knowledge to even install them properly on to the machines. But Cloud computing made them easy by provisioning them as a service. Now a days you can ask for anything, from a relational database to HDFS to Data Streams, and each one of them is available as a service on all major Cloud platforms. Meaning you have to know basics of these systems, click few buttons and you have your data system build and deployed for you. With few clicks you can check the performance of the system, and pay little more and you can fix performance issues. This has given a huge power in hands of everyone. With data stores become easy to use, now real rush it to harvest the value out of data. That’s where Machine learning and Artificial intelligence (AI) are shaping every business from oceans to space. Resurging interest in machine learning and AI is due to the same factors that have made data mining and Bayesian analysis more popular than ever. Things like growing volumes and varieties of available data, computational processing that is cheaper and more powerful, and affordable data storage. All of these things mean it's possible to quickly and automatically produce models that can analyse bigger, more complex data and deliver faster, more accurate results – even on a very large scale. And by building precise models, an organization has a better chance of identifying profitable opportunities or avoiding unknown risks.
References: 1. https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf 2. https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf 3. https://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf 4. https://pdfs.semanticscholar.org/8cb6/c2711afd3e504400ee12d3b582cc06348b08.pdf