Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning

Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning

This weekend I visited family in Sacramento.? As I drove back to my city by the bay, I found myself passing through the nearly 20 lanes of toll booths in Oakland that merge into the 5 lanes of the Bay Bridge.? Every time I make this journey I think about how similar this is to how a data warehouse becomes bottlenecked when retrieving data.

The most common way to connect to a data warehouse is via JDBC, ODBC, or some other driver.? These drivers connect to an endpoint which ultimately becomes a bottleneck for how much data can pass through it.

For most queries, the throughput is sufficient because the power of the database is being used to join, filter, and aggregate data before returning a query result that is much smaller.? However, when a large amount of data needs to be retrieved from a data warehouse, this can quickly become a real bottleneck.? This is a very common scenario when performing machine learning (ML).

Training an ML model benefits from doing so with data at the lowest level of granularity; down to the transaction level.? The early iterations of training a model may examine as many attributes that can link to those transactions.? A data scientist will whittle the set of features down via iteration to a much smaller list to ones that have the greatest impact on generating an accurate model.??

Cloud data warehouses (CDW) have tried to eliminate this bottleneck by exporting directly to cloud object storage.? This process will leverage many nodes of a CDW to copy data in parallel from its proprietary internal format to an open format, like Apache Parquet, onto cloud storage.? Once the export is complete, data can then be ingested in parallel to perform model training.? This multi-hop approach is how both the Redshift and Snowflake connectors for Apache Spark are architected.? This would be the equivalent to a bunch of cars driving over the East Bay Bridge, parking on Treasure Island, and having the drivers switch cars before completing the journey to San Francisco.

?There are a number of issues with this approach:

  1. It is slow
  2. It is expensive
  3. It forces multiple governance models

Each of these can be mitigated by choosing a data lakehouse architecture instead.

ML is Slow on a Cloud Data Warehouse

It takes time for data to complete the multiple hops:

No alt text provided for this image

Depending on the size of your data and the size of your data warehouse cluster, this can easily run for hours.??

By comparison, a data lakehouse writes data once to cloud storage in an open format, and then ingests directly from there to perform model training.?

Three steps go down to one.

No alt text provided for this image

ML is Expensive on a Cloud Data Warehouse

The compute necessary to export data from a CDW is not free.? The user must pay the CDW vendor for this compute; every time the same data is exported. It’s equivalent to paying a toll each time you retrieve your own data.

Additionally, the storage costs are duplicated because the same data is paid to be stored in a proprietary CDW format and in an open format on cloud data storage.

No alt text provided for this image

Because a data lakehouse always stores its data directly on cloud storage in an open format, the data is written once and can be read many times without the need to copy it.

ML Requires Multiple Governance Layers with a Cloud Data Warehouse

After data has been exported to cloud storage, it must be retained for some amount of time.? Typically, a data scientist will want to retain the exact dataset used to train a model so that it can be reproduced at some point in the future.

The governance layer used by the CDW is different from the governance layer used by data lakes on cloud storage.? These two governance layers would need to be kept in sync, and the lineage related to these 2 different datasets would need to be captured and maintained in potentially two different systems.

A data lakehouse is architected so that all data is written once to cloud storage, and then all use cases can be run against it without the need to copy data.? This includes ML and traditional data warehouse use cases.? Because the data is persisted once, a single governance layer can be used for it and lineage can be tracked without bifurcation.??

Avoid the bottlenecks and toll booths

Cloud data warehouses were not designed to perform machine learning.? To do so would be slow, expensive, and require multiple governance layers.? By comparison, a data lakehouse has been architected so that all your data is written once to cloud storage in an open format, and then read in-place for all your use cases, including ML and data warehousing.

Choosing a data lakehouse can help you avoid the bottlenecks and paying the toll that comes with using a cloud data warehouse for ML.

Anirban Goswami

Senior Software Engineer at ? | Big Data | Lakehouse Architect | Spark | Distributed Systems | Gen AI | Forever Student

2 年

DW fails not only in machine learning but almost everywhere. It doesn't scale the way it promises.

回复

Great job, Jason!

回复
Alexander Phu

Senior Solutions Architect - Data & AI

2 年

Great post Jason Pohl ??!

回复
Lopa Das

Sr Architect - Data Analytics Engineering

2 年

The article really nails the thing people miss about the Data Lakehouse. Data Science and ML are the big differentiators for the Data Lakehouse vs the Data Warehouse.

要查看或添加评论,请登录

Jason Pohl的更多文章

社区洞察

其他会员也浏览了