登录查看更多内容

Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning

Jason Pohl

Making Big Data Simple

发布日期: 2022年7月12日

This weekend I visited family in Sacramento.? As I drove back to my city by the bay, I found myself passing through the nearly 20 lanes of toll booths in Oakland that merge into the 5 lanes of the Bay Bridge.? Every time I make this journey I think about how similar this is to how a data warehouse becomes bottlenecked when retrieving data.

The most common way to connect to a data warehouse is via JDBC, ODBC, or some other driver.? These drivers connect to an endpoint which ultimately becomes a bottleneck for how much data can pass through it.

For most queries, the throughput is sufficient because the power of the database is being used to join, filter, and aggregate data before returning a query result that is much smaller.? However, when a large amount of data needs to be retrieved from a data warehouse, this can quickly become a real bottleneck.? This is a very common scenario when performing machine learning (ML).

Training an ML model benefits from doing so with data at the lowest level of granularity; down to the transaction level.? The early iterations of training a model may examine as many attributes that can link to those transactions.? A data scientist will whittle the set of features down via iteration to a much smaller list to ones that have the greatest impact on generating an accurate model.??

Cloud data warehouses (CDW) have tried to eliminate this bottleneck by exporting directly to cloud object storage.? This process will leverage many nodes of a CDW to copy data in parallel from its proprietary internal format to an open format, like Apache Parquet, onto cloud storage.? Once the export is complete, data can then be ingested in parallel to perform model training.? This multi-hop approach is how both the Redshift and Snowflake connectors for Apache Spark are architected.? This would be the equivalent to a bunch of cars driving over the East Bay Bridge, parking on Treasure Island, and having the drivers switch cars before completing the journey to San Francisco.

?There are a number of issues with this approach:

It is slow
It is expensive
It forces multiple governance models

Each of these can be mitigated by choosing a data lakehouse architecture instead.

ML is Slow on a Cloud Data Warehouse

It takes time for data to complete the multiple hops:

Depending on the size of your data and the size of your data warehouse cluster, this can easily run for hours.??

By comparison, a data lakehouse writes data once to cloud storage in an open format, and then ingests directly from there to perform model training.?

Three steps go down to one.

领英推荐

Expert Guide on AWS Big Data's Tools and Best Practices

NetCom Learning 5 个月前

Understanding Databricks

CoffeeBeans 1 个月前

Snowflake vs. Databricks: A Comprehensive Comparison

Sanjay Kumar MBA,MS,PhD 3 个月前

ML is Expensive on a Cloud Data Warehouse

The compute necessary to export data from a CDW is not free.? The user must pay the CDW vendor for this compute; every time the same data is exported. It’s equivalent to paying a toll each time you retrieve your own data.

Additionally, the storage costs are duplicated because the same data is paid to be stored in a proprietary CDW format and in an open format on cloud data storage.

Because a data lakehouse always stores its data directly on cloud storage in an open format, the data is written once and can be read many times without the need to copy it.

ML Requires Multiple Governance Layers with a Cloud Data Warehouse

After data has been exported to cloud storage, it must be retained for some amount of time.? Typically, a data scientist will want to retain the exact dataset used to train a model so that it can be reproduced at some point in the future.

The governance layer used by the CDW is different from the governance layer used by data lakes on cloud storage.? These two governance layers would need to be kept in sync, and the lineage related to these 2 different datasets would need to be captured and maintained in potentially two different systems.

A data lakehouse is architected so that all data is written once to cloud storage, and then all use cases can be run against it without the need to copy data.? This includes ML and traditional data warehouse use cases.? Because the data is persisted once, a single governance layer can be used for it and lineage can be tracked without bifurcation.??

Avoid the bottlenecks and toll booths

Cloud data warehouses were not designed to perform machine learning.? To do so would be slow, expensive, and require multiple governance layers.? By comparison, a data lakehouse has been architected so that all your data is written once to cloud storage in an open format, and then read in-place for all your use cases, including ML and data warehousing.

Choosing a data lakehouse can help you avoid the bottlenecks and paying the toll that comes with using a cloud data warehouse for ML.

Anirban Goswami

2 年

DW fails not only in machine learning but almost everywhere. It doesn't scale the way it promises.

Ron Berkovits

2 年

Great job, Jason!

Alexander Phu

Senior Solutions Architect - Data & AI

2 年

Great post Jason Pohl ??!

Lopa Das

Sr Architect - Data Analytics Engineering

2 年

Upneet Shah

2 次回应

Ali Ghodsi

2 年

The article really nails the thing people miss about the Data Lakehouse. Data Science and ML are the big differentiators for the Data Lakehouse vs the Data Warehouse.

33 次回应

查看更多评论

要查看或添加评论，请登录

Jason Pohl的更多文章

Databricks Lights a Spark Underneath Your SaaS : 10 Years Later

2025年1月13日

Databricks Lights a Spark Underneath Your SaaS : 10 Years Later

Ten Years ago today, I attended a Spark Meetup in the new Databricks Office in San Francisco and then published a blog…

3 条评论
MLflow Puts the “Science” in Data Science

2022年8月1日

MLflow Puts the “Science” in Data Science

The Scientific Method has been practiced for hundreds of years to advance the knowledge of humans and our understanding…

3 条评论
Why a Data Lakehouse is the Best Option for Scalable Machine Learning

2022年7月18日

Why a Data Lakehouse is the Best Option for Scalable Machine Learning

When I was a kid in school, my teacher would sometimes call a pop quiz. Rather than collect all of the tests herself to…

26 条评论
Big Data Mixtape

2017年3月5日

Big Data Mixtape

Note: this article was originally published on Medium. I grew up within receiving distance of the radio waves of…

3 条评论
Databricks Lights a Spark Underneath Your SaaS

2015年1月15日

Databricks Lights a Spark Underneath Your SaaS

On January 13, Databricks hosted a meetup in their brand new San Francisco headquarters. On the agenda was what to…

2 条评论
The Adventures of Mark Twain and Proprietary Hardware

2014年11月15日

The Adventures of Mark Twain and Proprietary Hardware

Samuel Clemens, aka Mark Twain, was a celebrated author and humorist, but did you know that he was also a pioneer in…

See all articles

Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning

Jason Pohl

Making Big Data Simple

ML is Slow on a Cloud Data Warehouse

领英推荐

ML is Expensive on a Cloud Data Warehouse

ML Requires Multiple Governance Layers with a Cloud Data Warehouse

Jason Pohl的更多文章

社区洞察

其他会员也浏览了

A Guide to Modern Cloud Data Platforms

Azure Databricks Vs Snowflake: A Comparison Guide You Need to Know

Data Analytics Services: AWS, Azure, GCP

Building Blocks of a Typical Cloud Data Pipeline

databricks

Unlocking the Power of Data: Modern Data Analytics Reference Architecture on AWS

Navigate the World of Cloud Data Services: An Overview for Tech Executives

Cloud Data Lakes - The Future of Large Scale Data Analysis

CIO Strategy for AWS Big Data Implementation

Mastering Machine Learning Model Deployment: A Comprehensive Guide with Azure Services

ML is Slow on a Cloud Data Warehouse

领英推荐

ML is Expensive on a Cloud Data Warehouse

ML Requires Multiple Governance Layers with a Cloud Data Warehouse

Jason Pohl的更多文章

Databricks Lights a Spark Underneath Your SaaS : 10 Years Later

MLflow Puts the “Science” in Data Science

Why a Data Lakehouse is the Best Option for Scalable Machine Learning

Big Data Mixtape

Databricks Lights a Spark Underneath Your SaaS

The Adventures of Mark Twain and Proprietary Hardware

社区洞察

其他会员也浏览了

A Guide to Modern Cloud Data Platforms

Azure Databricks Vs Snowflake: A Comparison Guide You Need to Know

Data Analytics Services: AWS, Azure, GCP

Building Blocks of a Typical Cloud Data Pipeline

databricks

Unlocking the Power of Data: Modern Data Analytics Reference Architecture on AWS

Navigate the World of Cloud Data Services: An Overview for Tech Executives

Cloud Data Lakes - The Future of Large Scale Data Analysis

CIO Strategy for AWS Big Data Implementation

Mastering Machine Learning Model Deployment: A Comprehensive Guide with Azure Services