Bad Fashion: Open Data Lakehouses
A great number of companies I speak with are enthusiastic about adopting the open data lakehouse pattern. Some already have. The majority are making a mistake.
Open is always good, right?
Not always. Open is great for standardisation, interoperability and extensibility. But because open ‘feels’ right, doesn’t mean it applies to every organisation. At Google, we support the open data lakehouse pattern via BigLake for organisations pursuing this approach but it's important this fits your organisation, not because it’s fashionable. I’ll mention our capabilities throughout but the principles apply universally.
For clarity, let’s define the open data lakehouse, so we agree on the pattern under scrutiny. An open data lakehouse involves building an organisations data repository by storing data in open file formats (e.g. Apache Parquet) in object storage (typically cloud), possibly further abstracted by open table formats (e.g. Apache Iceberg), whereupon that data can be directly accessed by multiple and heterogeneous query engines.
My objection is that insufficient benefit is conferred by adopting an ‘open’ approach - I agree on the merits of a multi-engine data lakehouse (and Google’s approach to it).
To begin, a brief diversion. Adopting file-based storage approaches (most commonly associated with HDFS) was an approach to handle the demands of big data (a phrase now out of fashion). Data scales had grown such that traditional data warehouse systems were either unable to cope with those volumes (due to their coupled storage-compute architectures) or were cost-prohibitive (as storage needed to scale alongside compute). However, modern (particularly cloud-based, like Google’s BigQuery) data platforms separate storage and compute - allowing each to scale independent of the other. This remediated both issues; eliminating scaling limits and aligning data platform storage cost with that of file-based approaches.
However, those data platforms usually have a proprietary data storage format (although they may provide options for handling data in open formats). So the question becomes what benefits do those open formats confer - as it is no longer scalability nor price.
There are two primary arguments for open formats. The first, is standardisation and interoperability. When data is delivered from other systems (whether via event streams, ETL or ELT), open file formats means this data can easily be landed in the sink (in this case the object storage of the open data lakehouse). Similarly, when this data is analysed or queried, its open format means a variety of different engines (and different vendors) can directly access that data - without needing intermediaries or APIs to perform their processing.
The second, is avoiding vendor lock-in. By retaining all data in a non-proprietary format, if the organisation decides to change their object storage system or their analytic engine(s), they can do so with ease - and are not ‘held hostage’ by any vendor who has locked their data into a proprietary format with no obvious means to extract/export this data.
Despite these benefits, there are notable drawbacks, which are more significant. These are in the domains of capability, performance and security - from least to most critical.
The first dimension concerns capability. Open source has a long and storied history of democratising innovation - which has undoubtedly been beneficial for the technology industry - but rarely has open source been the originator of innovation. Perhaps the most famous is Linux, started in 1991 but based on Unix, released in 1970. There are similar examples like Kubernetes hitting 1.0 in 2015, whilst based on Google’s Borg from several years prior, Parquet from 2013, based on Google’s Dremel paper from 2010 (but in production since 2006) and Apache Hadoop, which took over 5 years to hit 1.0 in 2011 but itself was triggered by Google’s paper from 2003.
So whilst open file formats provide fantastic standardisation and interoperability, that comes at a cost of lagging technical capabilities. For example, proprietary geospatial data types have been commonplace for decades but GeoParquet only hit 1.0 in September 2023. With the recent explosion of Generative AI, other functionality, such as vector embeddings, are becoming business-critical but currently lack standardisation in open file formats.
Proponents of the open data lakehouse might claim this is a non-issue as those extra capabilities can be applied by your preferred engine with direct access to the open data files. An example might be Apache Spark superseding MapReduce (itself inspired by Google’s MapReduce paper). However, that capability then no longer forms part of the open data lakehouse but is siloed within that engine (or its file format), rendering it inaccessible from other engines pointed at the open data lakehouse and defeating its purpose.
Others may point to the benefit of open source meaning you are free to extend formats in whatever manner you like. This is true and extensibility is a fantastic open source advantage. However, the purpose of an open data lakehouse is interoperability between storage format and analytic engine. Not only would an organisation need to branch its storage format but also each and every analytic engine harnessing that format. Realistically, this is not something most organisations would even consider (and undoes the standardisation expected of an open data lakehouse).
If you contend that data formats have ceased innovation and, therefore, there are highly unlikely to be new encodings or data types that need retention, then standardising on an open file format may not limit your organisation. However, if you believe that innovation continues apace in the data world, taking advantage of those innovations through their proprietary formats may benefit your organisation well in advance of them being standardised through open formats. The latter seems more plausible, for example, at Google we added Text Embeddings in August 2023.
The second dimension is performance. It is almost always the case that tightly-coupled systems deliver greater performance than loosely-coupled systems. Components within tightly-coupled systems can make (correct) assumptions about other components and so bypass abstractions inherent in loosely-coupled systems. Our performance benchmarking for BigQuery bears this out. When extracting data from various sources, using open formats to send data makes complete sense; there may be thousands of sources, so tight-coupling is not practicable. However, when ingesting data into the data lakehouse, only a very small number of engines will analyse that data. These can be tightly-coupled with the data lakehouse and so converting data from an open to a proprietary format during ingestion is eminently sensible. Organisations should take advantage of open source file format interoperability when transmitting data but not when persisting it into a data lakehouse.
领英推荐
It can be contended that compute resources required for such format conversions are wasted, when conversion could be done upon querying/analysing the data (such schema-on-read approaches are often touted as a data lakehouse benefit). However, data is usually analysed much more than ingested - it would be strange to analyse data only once in its entire lifecycle. So undertaking conversion upon ingestion is the preferred approach - rather than multiple interpretations of that data every time it is queried.
A hybrid approach is to land the data (especially when using ELT patterns) into a ‘raw’ layer in open format and post-process the data into a ‘curated’ layer in proprietary format. If the raw data is never directly queried, this is viable - although it may be the case that a proprietary format is still beneficial from an ease-of-management perspective (especially when it comes to very large datasets, split into thousands of files).
Let us pause to make a very important concession. Vendor lock-in is a real risk. When recommending conversion of data during ingestion to proprietary formats, there remains the concern that once converted, it will be impossible (or very difficult) to extract that data again. Therefore, an organisation will be at their data lakehouse vendor’s mercy and, hence, must retain their data in its original open format.
Fortunately, this risk is easy to mitigate. Ensure your data lakehouse can quickly, cheaply and easily export the data into open formats. The vast majority of vendors do so as standard. Any vendor that doesn’t is implying a lack of confidence in their product; enforcing lock-in, rather than exhibiting confidence in ability to retain customers through product excellence. At Google, we not only have a variety of open export formats, we even provide free data transfer when exiting Google Cloud completely. And lest the complaint be made that converting data on ingestion only to revert back to its original format upon export is non-sensical, consider the frequency of each - ingestion is a continuous process (especially with modern event-based streams) whereas changing data lakehouse vendor is highly likely to be infrequent, on the order of once every few years. It is not appropriate to constrain your ongoing analytic performance for the sake of a very low-frequency occurrence.
Performance becomes an even more important aspect when reconsidering the suitability of the open data lakehouse, especially if operating at exabyte, or at least petabyte, scale. Open source query engines have improved significantly in recent years, as have the efficiency of the open source file formats (in theory, one could build an open data lakehouse atop of CSV files but I have never seen anyone try). So for small-scale workloads (gigabytes or low terabytes), the additional performance overhead may not be noticeable. Although, this does imply an open data lakehouse may be more expensive, when considering compute resources required for an equivalent degree of analytics (as additional processing and handling of abstractions is required). Licence cost may be ‘saved’ through open source engines but more ‘spent’ in greater compute requirements. However, as long as that compute falls within ‘normal’ machine scales, it would not generate notable performance challenges and some organisations may decide the additional cost is justified.
So an open data lakehouse has performance drawbacks and capability constraints. That said, a valid contention might be that innovation is in analytic engines, not data formats. An open data lakehouse allows new engines to be added, with their associated innovations, without altering underlying data. This is where the most important dimension, security, comes to the fore.
Open data lakehouses are, almost by definition, file-based. For simplicity, let’s treat that file as representing a table of data. File-based access controls are very coarse; you can read, change or delete either the entire file or not at all. This is woefully inadequate for data. Data access controls demand far more granularity. For example, there may be column-based access controls, which determine whether someone can access that data (for example, salary) or even know whether that column exists (restricting the metadata). Similarly, there may be row-based access controls, which determine whether someone can access data depending on its value (for example, only rows pertaining to a specific business unit).
This is a challenge open data lakehouses have not adequately addressed. There are generally two approaches to doing so.
The first pushes the responsibility for access controls to the query engines themselves. Whilst the file system can approve or deny access to the file, the query engine must implement the restriction on which subsets of data within that file are authorised for any given analysis. This can be supported by a catalogue to inform the engine what data should be subject to controls but controls are not (and cannot be) enforced by that catalogue - it only tells the engine what it should do. This is the approach taken by Tabular (now acquired by Databricks) where data is still directly accessible by engines.
Organisations are then left with a choice of either keeping those data access controls synchronised across multiple engines or assuring themselves that those engines will adhere to the information they receive from the catalogue (requiring integration). Whichever approach organisations choose, this places an additional constraint on which engines are permitted to access data in the open data lakehouse. It is no longer the case that whichever engine someone wants to adopt can be harnessed. Instead thorough replication of access controls or catalogue integration and assurance needs to occur before any query. Hardly the flexibility proponents of the open data lakehouse envisage.
The second approach sees the catalogue become the enforcer of data access controls. Rather than engines directly interacting with the open format files, they request data via catalogue APIs for read/write requests and are only provided the actual data (not files) they are authorised for. Snowflake has open-sourced Polaris along these lines. With this approach, we have come full circle! When all engines request data via APIs, then whether the data is stored in open or proprietary formats becomes irrelevant. Indeed, for the performance and capability reasons covered earlier, it makes more sense to use a proprietary format behind the API. This applies even if a future open source standard for that API emerges. Furthermore, when the API is directly returning data (as opposed to pointing to a file) then it has itself become a query engine - yet again coupling the storage format of the data with its analysis, as has historically been the case (and with good reason).
An open data lakehouse pattern either allows any new query engine to access the data directly and immediately (which is the most commonly touted benefit) or it can enforce data access controls. It cannot do both.
Therefore, modern data lakehouses only allow API-based data querying. When API-based, advantages of open file formats disappear and the API becomes the channel for data access via the ‘primary’ query engine, with options to connect ‘secondary’ query engines to sustain innovation and experimentation. Or the data lakehouse itself is multi-engine (as it enforces data access controls internally) whilst providing APIs for alternative engines (and, of course, open file format export capabilities to avoid lock-in). Google has chosen this approach, with Dataplex securing data and BigLake Storage APIs governing data access for both Google-provided (e.g. Spark) and third-party engines. You could still retain your open data lakehouse pattern under the covers but with more drawbacks than advantages.
Despite this, I concede there are scenarios where adopting an open data lakehouse pattern makes sense. This can be appropriate for organisations who do not handle massive data sets (so do not need finely-optimised performance), who do not require cutting-edge capabilities (so can wait for open-source to catch-up) and do not need to secure the majority of their data (so can allow everyone to use the engine of their choice). My supposition is the majority of organisations - particularly the ones I interact with - do not fit this characterisation. They should avoid the fashionable open data lakehouse pattern.
To conclude; open file formats are fantastic for standardising data ingestion and have accelerated our ability to drive data interoperability between systems for data-in-transit. They are also important to demand as export formats (to eliminate technical vendor lock-in). But they do not provide a useful pattern for the retention of data (data-at-rest) for analytic purposes - especially when dealing with exabyte-scale, novel interaction or highly secure environments.