Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning
This weekend I visited family in Sacramento.? As I drove back to my city by the bay, I found myself passing through the nearly 20 lanes of toll booths in Oakland that merge into the 5 lanes of the Bay Bridge.? Every time I make this journey I think about how similar this is to how a data warehouse becomes bottlenecked when retrieving data.
The most common way to connect to a data warehouse is via JDBC, ODBC, or some other driver.? These drivers connect to an endpoint which ultimately becomes a bottleneck for how much data can pass through it.
For most queries, the throughput is sufficient because the power of the database is being used to join, filter, and aggregate data before returning a query result that is much smaller.? However, when a large amount of data needs to be retrieved from a data warehouse, this can quickly become a real bottleneck.? This is a very common scenario when performing machine learning (ML).
Training an ML model benefits from doing so with data at the lowest level of granularity; down to the transaction level.? The early iterations of training a model may examine as many attributes that can link to those transactions.? A data scientist will whittle the set of features down via iteration to a much smaller list to ones that have the greatest impact on generating an accurate model.??
Cloud data warehouses (CDW) have tried to eliminate this bottleneck by exporting directly to cloud object storage.? This process will leverage many nodes of a CDW to copy data in parallel from its proprietary internal format to an open format, like Apache Parquet, onto cloud storage.? Once the export is complete, data can then be ingested in parallel to perform model training.? This multi-hop approach is how both the Redshift and Snowflake connectors for Apache Spark are architected.? This would be the equivalent to a bunch of cars driving over the East Bay Bridge, parking on Treasure Island, and having the drivers switch cars before completing the journey to San Francisco.
?There are a number of issues with this approach:
Each of these can be mitigated by choosing a data lakehouse architecture instead.
ML is Slow on a Cloud Data Warehouse
It takes time for data to complete the multiple hops:
Depending on the size of your data and the size of your data warehouse cluster, this can easily run for hours.??
By comparison, a data lakehouse writes data once to cloud storage in an open format, and then ingests directly from there to perform model training.?
Three steps go down to one.
领英推荐
ML is Expensive on a Cloud Data Warehouse
The compute necessary to export data from a CDW is not free.? The user must pay the CDW vendor for this compute; every time the same data is exported. It’s equivalent to paying a toll each time you retrieve your own data.
Additionally, the storage costs are duplicated because the same data is paid to be stored in a proprietary CDW format and in an open format on cloud data storage.
Because a data lakehouse always stores its data directly on cloud storage in an open format, the data is written once and can be read many times without the need to copy it.
ML Requires Multiple Governance Layers with a Cloud Data Warehouse
After data has been exported to cloud storage, it must be retained for some amount of time.? Typically, a data scientist will want to retain the exact dataset used to train a model so that it can be reproduced at some point in the future.
The governance layer used by the CDW is different from the governance layer used by data lakes on cloud storage.? These two governance layers would need to be kept in sync, and the lineage related to these 2 different datasets would need to be captured and maintained in potentially two different systems.
A data lakehouse is architected so that all data is written once to cloud storage, and then all use cases can be run against it without the need to copy data.? This includes ML and traditional data warehouse use cases.? Because the data is persisted once, a single governance layer can be used for it and lineage can be tracked without bifurcation.??
Avoid the bottlenecks and toll booths
Cloud data warehouses were not designed to perform machine learning.? To do so would be slow, expensive, and require multiple governance layers.? By comparison, a data lakehouse has been architected so that all your data is written once to cloud storage in an open format, and then read in-place for all your use cases, including ML and data warehousing.
Choosing a data lakehouse can help you avoid the bottlenecks and paying the toll that comes with using a cloud data warehouse for ML.
Senior Software Engineer at ? | Big Data | Lakehouse Architect | Spark | Distributed Systems | Gen AI | Forever Student
2 年DW fails not only in machine learning but almost everywhere. It doesn't scale the way it promises.
Great job, Jason!
Senior Solutions Architect - Data & AI
2 年Great post Jason Pohl ??!
Sr Architect - Data Analytics Engineering
2 年Upneet Shah
The article really nails the thing people miss about the Data Lakehouse. Data Science and ML are the big differentiators for the Data Lakehouse vs the Data Warehouse.