After Big Data
https://pxhere.com/en/photo/341022

After Big Data

When Distributed File Systems came on the scene in the late noughties, everyone realised that something big was happening in the world of data management. Until Hadoop and MapReduce went mainstream, distributed file systems had been the preserve of expensive proprietary vendors like Teradata. Now it would be available to everyone via open-source software running on generic hardware. The subsequent decade, in which 64-bit computing, solid state drives and cloud computing also became ubiquitous, was momentous for the data industry, with seemingly endless change and disruption. The era of 'big data' had arrived, but what did it mean, and what does the future hold?

Big Data

Big data vendors in the early days sold their wares with 'the 3 Vs' - velocity of data, variability of data, and volume of data. The attendant marketing hype resulted in almost every large organisation attempting some form of proof of concept. Unfortunately, most of these initial endeavours were failures, not because big data products were immature, but because people expected (and were perhaps led to expect) the functionality of a database from what were actually, at that stage, more or less just data storage products.

Benchmark tests showed that while data write was faster on Hadoop/DFS, data read performance was slower than existing MPP style databases, such as DBMS-X or Vertica, that ran on similar generic hardware.

This is explainable - loading data to a database requires the data to be processed into defined structures, with specific types of data portioned out into appropriate storage containers, and with supporting metadata such as statistics and indexes generated at the same time. Loading data to a DFS requires only distributing the files to the available disks. Without the extra pre-processing overhead, 'ingesting' data is always going to be quicker on a raw DFS.

On the other hand, reading data using a simple pattern search required a full scan of all files on a DFS, while a database could use the supporting metadata to read only relevant portions of relevant files. As databases place data carefully into small, efficient, and subject focused files, the files to be searched are smaller too, meaning that any type of read other than a full-scan is likely to be faster on a database with parallelised data storage.

In fact, the range of compelling 'use cases' for DFS, in which it clearly performs much better overall than parallel databases, is generally limited to unstructured and semi-structure data items that databases struggle to cater for at all - items such as web pages, images, audio and video, documents, and so on.

Making Big Data Work for all Data

It is not enough to store data quickly, or flexibly, or infinitely. You must make it accessible - and not just by experts, but by ordinary business users. This means adding structure - a schema layer like the relational model - to enable data to be conceptually understandable, and interface standards like SQL so that users can work with it from a wide variety of tools. The native DFS did not have these.

Then we have the basic task of maintaining data integrity. In order to?ensure integrity, database management software enforces 'ACID' constraints on data transactions. A DFS natively has no data management software. Updating of data on DFS was even slower than reading it, and could easily lead to corruption of data if more than one processing was doing so at the same time.

Moving traditional relational data workloads to a DFS was increasingly viewed as unlikely to deliver many benefits and 'Big Data' entered what Gartner's 'hype cycle' calls the Trough of Disillusionment. 'Data Lake' became increasingly synonymous with 'Data Swamp'.

The Data Lakehouse

Gradually, the penny dropped, and 'big data' vendors began to look back to relational database software to learn how they could best deliver business benefit from the opportunity DFS data storage provided. They did this by creating data management software that would sit on top of DFS storage systems to provide the kinds of functionality that users of databases had been taking for granted for years.

Data lake vendors, pumped with venture capital, began to develop augmentations that would level the playing field. Some of the innovations included:

  • Indexing - rather than search through all of the files in a lake to find required values, secondary files are created that stored the identity of files containing specific values. A proprietary software engine knows to scan this index and to only open files it indicates as containing the required pattern.
  • Compression - As self-describing data file formats like XML and JSON are significantly enlarged by markup, various strategies are employed to reduce the size of stored data by converting raw files to more efficient formats, or by recording values by field rather than by record in columnar file formats like Parquet.
  • Partitioning - Large files were broken into sub-units according to a directory structure, based on common query criteria. For examples, sales data could be broken out by period. This meant that only a sub-set of overall data was stored in each directory and only relevant files were read for each query.
  • Statistics - information recorded about the values in each file could help in avoiding unnecessary searching of file. For example if the max value of a field in a given file is X, and the query searches for records with values greater than X, then no search is necessary in that file.
  • Aggregation - while data is created and may be stored at atomic level, many queries look only for aggregate data, such as sum, count, max, etc. Data from many files can be 'aggregated' into a separate file, which can be orders of magnitude smaller and quicker to read.

If all of the above sound familiar, they should - similar innovations were developed by relational database vendors and modellers in the 1980s and 1990s. Why were Big Data vendors spending billions to re-invent the wheel? By putting the functionality of the database on top of the flexible storage of the data lake they were creating a new concept - the 'data lake-house'.

Separating Storage and Compute

ACID constraints to ensure data integrity require that data modifications (create, update, delete) be centrally coordinated, but makes no such demands on read access. For those willing to risk getting a small amount of data in a transitory state, data can be read at any time, without impacting 'eventual' data integrity.

Data lakes often use relaxed, open, labelled, storage formats like XML and JSON, allowing other applications easy access to data without having to request it from the DBMS, a situation often described as 'separation of storage and compute'. This makes DFS based data stores ideal for processing of real-time data, like video streams, and for feeding AI models to generate instant outputs.

When bulk exporting data, files like these can be simply copies, without having to process data in the files record by record through DBMS. This allows for much faster replication and sharing of data.

Updating data in some open formats is tricky and slow, a fact data lakehouse products may address by simply replacing the entire data file when any change is made, rather than updating specific details within an existing file. With super cheap storage capacity, and bearing in mind that large files are broken up and distributed as much smaller files, and that only the fragment being changed needs to be replaced, this is not as inefficient as it first sounds.

It means all data is always in an intact, readable state, and that slow updates are minimised. In addition, by retaining all older versions of file fragments it is possible to roll back to any previous state, delivering a 'replay' or 'rewind' capability that is much faster and simpler than working with traditional database transaction logs.

The Modern Data Stack

A convergence has gradually been taking place. While DFS vendors have been developing DBMS style software (lake-houses) to manage the use of data on DFS storage, legacy database vendors have been reengineering their products to use DFS storage effectively. Data lake-house and database products are heading in the same direction, are becoming increasingly similar, and ultimately will be hard to distinguish at all.

Gradually a new paradigm has emerged around products that combine the management and structure of databases, on top of the flexibility and scale of a DFS. These are not the monolithic individual products common in the database market of the past, but rather collections of related, often interchangeable, products that can be used together in different arrangements to suit a wide variety of use cases.

The modern data stack consists of separate data storage, data integration, data management and data visualisation components. Increasingly these are cloud-based SaaS products, that leverage open source foundational products like Hadoop and Spark. Products like Databricks, for example, use a common Spark architecture, available as a service from all the main cloud vendors, for storage - making their product essentially vendor agnostic.

Pricing models are typically based on computing power, data throughput, and/or data egress, rather than capacity. Components are modular and can be powered up or down, on or off as needs require. Instead of committing long term to one broad product that can adapt to many circumstances, the trend is increasingly to take up specialised products for only as long as they are required, and to have multiple products running at the same time addressing different aspects of the workload.

Often a cluster of specialised products are offered by the same vendor, who in the past would have bundled this diverse functionality as features within a single large application. The advantage of clustering or bundling of applications is that a single vendor is more likely to co-ordinate multiple applications effectively, reducing overlap, ensuring interoperability, providing holistic training and so on.

How has the market changed?

Given all of the disruption and fragmentation of recent years it can be difficult to make sense of where things are at. According to the most recent figures from Gartner, in 2021 the database market was dominated by just 3 vendors - Microsoft, AWS and Oracle. These three collectively generated 68% of the $80bn revenue in the space, with Google and IBM accounting for a further 12%.

All of the other vendors combined - including the likes of Teradata and Snowflake - made up just 20% of the market. While some of the revenue generated by the big 5 will be for their non-relational, DFS-based products, most will be for their primary products - SQL Server, Redshift, Oracle DB, BigQuery and DB2 - which are all more or less traditional relational database products. Exclusively non-relational database vendors, including the likes of MongoDB, claimed less than 3% of the total market.?

Notably AWS nearly tripled its market share in the four years to 2021, while Oracle's nearly halved. The big difference between these two is not so much in the type of products they offer as the means - AWS is exclusively cloud based, while Oracle struggled to offer cloud services at all, scrapping its initial attempt and really only getting going in late 2018. Microsoft, the market leader, has a strong presence in both camps, and has managed so far to bring many of its legacy customers along with them on their cloud journey.

Conclusion

Reports of the death of the database have been greatly exaggerated. In fact, while cloud-first is now clearly dominant, databases continue to be the solution of choice for the vast majority of data processors and consumers. This will, in all likelihood, remain true for the foreseeable future, despite all of the on-going hype about data lakes, data lake-houses, streaming, AI, and the separation of storage and compute.

A key observation must also be that many of the problems that people encounter working with data will not be resolved by changing technology, but rather by improving their approaches to data governance and data management. The more things change, the more they remain the same...

John Thompson is a Director with EY's Technology Consulting practice. His primary focus for many years has been the effective design and management of enterprise data systems.

Excellent article... the more things change... the more they stay the same.. amazing how that statement holds up in so many situations..

回复
Meherine McLaughlin

Business Intelligence, Data Insights, Data Engineering

2 年

Great article John!

回复
Sreelal Pillai

Sr. Data Architect (Ex Microsoft) | Microsoft Certified: Azure Solutions Architect Expert, Microsoft Certified: Azure Administrator Associate

2 年

A great read!

回复

要查看或添加评论,请登录

John Thompson的更多文章

  • Enterprise Data - its just plumbing, right?

    Enterprise Data - its just plumbing, right?

    When I started as a data consultant many years ago, my first solo assignment was to resolve a number of issues a small…

    7 条评论
  • The Big Power of Small Data

    The Big Power of Small Data

    We have all been so bombarded in recent years with information about 'Big Data' that the value of 'Small Data' is…

    1 条评论
  • When do you not need a Data Warehouse?

    When do you not need a Data Warehouse?

    ‘Data Warehouse’ (DWH) is the term used for the last 30 years by both technicians and business stakeholders to mean…

    2 条评论
  • Becoming Data Centric

    Becoming Data Centric

    I’ve spent the last two decades working with analysts to solve data problems in a systematic way and to create…

  • What is Data Entropy?

    What is Data Entropy?

    There is a common meme that LinkedIn regulars will know well. It shows a series of pictures of Lego, one with lots of…

    6 条评论
  • Schrems II: What Does it mean for EU Data Processors?

    Schrems II: What Does it mean for EU Data Processors?

    The Schrems 2 case has been long running and much discussed and its ultimate findings, while still being digested, will…

  • How is Data Management Different from IT Management?

    How is Data Management Different from IT Management?

    In a season where the Liverpool football team is about to win the Premier League for the first time in 30 years, a…

  • Rise of the (Data Science) Robots

    Rise of the (Data Science) Robots

    I started out at university studying Molecular Genetics and for a long time considered doing a doctorate and building a…

    5 条评论
  • Choosing a BI Tool

    Choosing a BI Tool

    Data reporting and visualisation ‘BI’ tools come in many flavours, with a bewildering variety of features to confuse…

    7 条评论
  • Why Do We Need Analytic Data Platforms?

    Why Do We Need Analytic Data Platforms?

    When talking to customers I often encounter the same questions repeatedly. One of the most common is "Why do we need a…

    3 条评论

社区洞察

其他会员也浏览了