登录查看更多内容

Why a Data Lakehouse is the Best Option for Scalable Machine Learning

Jason Pohl

Making Big Data Simple

发布日期: 2022年7月18日

When I was a kid in school, my teacher would sometimes call a pop quiz.? Rather than collect all of the tests herself to grade, she would have each of us pass our test to the person behind us and then proceed to instruct us on how to grade them.? Afterwards, she would collect them and calculate the class average.? Although I didn’t realize it at the time, I was a participant in distributed data processing.? Massive Parallel Processing (MPP) databases behave in much the same way.? Each student is a worker that is given instructions to perform independently on a portion of data.? The output of that processing, such as the test score, is then collected to a driver node, the teacher, which performs the final calculation.

Apache Spark became the most active open source project in big data because it took this pattern of distributed data processing and generalized it so that it could be applied to any type of processing and any type of data.? This includes MPP style SQL processing, but also advanced machine learning (ML) through the MLlib library.? The processing required for ML is unique because it requires multiple scans of the data to transform it into “features” and then apply various algorithms against them.? Typically, there are a number of these feature transformations and algorithms chained together to form a pipeline that is “fit” against the data to generate a “model”.

The Data Lakehouse has been architected to be both distributed and open, which makes it the only option for scalable ML.? There are two major advancements that have accelerated the adoption of ML:

The Dataframe API
Open Source Software

The Dataframe API gained broad adoption when the R programming language was open sourced in 2001.? Later in 2009 the Pandas Dataframe API did the same for Python.? These languages are used heavily for statistical analysis, and researchers have contributed to the community by continuously open sourcing new kinds of algorithms across both R and Python.??

These Dataframes have a limitation though.? They are single threaded, and can only scale based on the resources of the single machine they run on.? Once data grows to a meaningful size, a data scientist must make a decision.? Either start sampling the data to continue training on a single node, or distribute the ML across all the data.??

Sampling data can be a project in itself to ensure the sample is representative of the problem space.? For example, suppose you are training a model to detect fraud in a dataset where it rarely occurs.? If the sample does not contain sufficient instances of fraud, it will be hard to fit a model.? Training against the entire dataset requires a distributed ML framework.? It’s a tradeoff between applying more data scientist hours or more compute hours.

The breakthrough came in 2015 when the Dataframe API for Apache Spark was introduced.? This allowed the processing of a Dataframe to be distributed across a cluster of worker nodes, just like the students grading tests.? MPP databases have a known set of aggregate functions, like sum(), max(), avg(), that can be applied to each worker.? Distributing the work of ML is much more complex than distributing aggregate functions.? Each algorithm is a bit different in implementation.? Both the Dataframe and the ML framework must be distributed in order to train models against large datasets. ? Consider the following example, which creates a simple text classification model using MLlib:

领英推荐

DABL

360DigiTMG 1 年前

The Data Science Course: Complete Data Science…

Bluechip Technologies Asia 1 年前

Issue #192 - THE ML ENGINEER ??

Alejandro Saucedo 2 年前

tokenizer = Tokenizer(inputCol="text", outputCol="words"

hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

lr = LogisticRegression(maxIter=10, regParam=0.001)

pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])




df = spark.read.load("/path/to/data")

model = pipeline.fit(df))

If you examine the dependency injection in the above example, you’ll notice that the Dataframe “df” is being injected into the ML framework to generate a model.? Because both the client and the service are distributed, a single model is able to be trained across all of the data.

By comparison, Snowflake was not built for distributed ML.? Rather, the only way to perform model training with Snowflake is to do the reverse by injecting an ML framework, like scikit-learn, into their proprietary Snowpark Dataframe via a UDF call.? There is no ML framework on the market, not even one from Snowflake, that can inject a Snowpark Dataframe as a dependency.??

ML on Snowflake requires the following steps:

Group data
Pass data into UDF
Convert data to a Pandas Dataframe
Train data, single threaded, via scikit-learn

Because a single model gets trained on a single worker node, the following limitations are introduced:

If one of the groups of data is more than a few thousand records, it fails
If one call of the UDF consumes too much memory, it fails
If one of the UDFs takes more than 60 seconds to execute, it fails

A Data Lakehouse is the most scalable way to perform ML.? It was architected from the beginning for the storage and compute layers to be both distributed and open.? This enabled a community to build the tools, like MLlib, to be able to perform ML at scale.? When McGraw Hill Education needed to train against over 10 million student interactions, they chose the Databricks Lakehouse to do so.? This allowed them to provide personalized recommendations that increased student pass rates by over 13%.??

The next time your data team gets a pop quiz, make sure they are equipped with a Data Lakehouse to be able to scale with the data and deliver a perfect score.

James Grogan

Data Architect, Data Engineer, Integration and BI Specialist (Data Warehouse and related ecosystems)

2 年

You may wish to review TeradataML - which has both a DF and SQL workflow. Python / R front-ends are available, along with fully parallel-aware (where the algo permits) function set.

.Rogier Werschkull

Head of Data @ Coolgradient | Data-analytics trainer | Rockstar & AI artist @ night ;) | Love calling bullshit

2 年

Nick Akincilar or Daan Bakboord: any comments on the ML limitations of Snowflake as explained in this piece?

Paul Johnson

Software, Cloud & Analytics | Data Engineering | Feed AI, BI & Data Science with Quality Data

2 年

“the only option for scalable ML” is a very bold claim indeed. What about MADlib running on an MPP Greenplum cluster? The data can reside either externally or be loaded to the DBMS should you choose.

4 次回应

Josh Fischer

Lead Platform Engineer @ InRhythm with Mastercard

2 年

Kit Menke

1 次回应

.Rogier Werschkull

Head of Data @ Coolgradient | Data-analytics trainer | Rockstar & AI artist @ night ;) | Love calling bullshit

2 年

AFAIK a 'lakehouse' is not a technology but a data architectural pattern: it is about re-intruducing the 'old' datawarehousing concerns of organising data in a subject oriented manner & data integration to all the failed data Lake implementations. As in: -The data lake adagium of 'just make sense of all this raw data yourself' failed because it obviously does not scale. -Thus, Data Lakes became data swaps AKA 'Hotel California' -People at databricks pitched 'data lakehouse' as a new term for doing data warehousing in a different tech stack. -Which mainly is 'old wine in new bags', as in: it is 'just datawarehousing' where you separate your time variant / non volatile concerns (=lake) from your subject oriented/integrated ones (=house).

45 次回应

查看更多评论

要查看或添加评论，请登录

Jason Pohl的更多文章

Databricks Lights a Spark Underneath Your SaaS : 10 Years Later

2025年1月13日

Databricks Lights a Spark Underneath Your SaaS : 10 Years Later

Ten Years ago today, I attended a Spark Meetup in the new Databricks Office in San Francisco and then published a blog…

3 条评论
MLflow Puts the “Science” in Data Science

2022年8月1日

MLflow Puts the “Science” in Data Science

The Scientific Method has been practiced for hundreds of years to advance the knowledge of humans and our understanding…

3 条评论
Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning

2022年7月12日

Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning

This weekend I visited family in Sacramento. As I drove back to my city by the bay, I found myself passing through the…

30 条评论
Big Data Mixtape

2017年3月5日

Big Data Mixtape

Note: this article was originally published on Medium. I grew up within receiving distance of the radio waves of…

3 条评论
Databricks Lights a Spark Underneath Your SaaS

2015年1月15日

Databricks Lights a Spark Underneath Your SaaS

On January 13, Databricks hosted a meetup in their brand new San Francisco headquarters. On the agenda was what to…

2 条评论
The Adventures of Mark Twain and Proprietary Hardware

2014年11月15日

The Adventures of Mark Twain and Proprietary Hardware

Samuel Clemens, aka Mark Twain, was a celebrated author and humorist, but did you know that he was also a pioneer in…

See all articles

Why a Data Lakehouse is the Best Option for Scalable Machine Learning

Jason Pohl

Making Big Data Simple

领英推荐

Jason Pohl的更多文章

社区洞察

其他会员也浏览了

Top Languages to Master Machine Learning!

Types of Sampling in Machine Learning

Personal Knowledge Graphs. Semantic Entity Persistence in DataLog. Deductive databases

Microservices Design IV: Distributed Tracing, Python in Excel and ChatGPT Enterprise

KDnuggets 16:n42: Python Machine Learning Open Source Projects; Facebook Groups for Big Data & Data Science

Exploring Scikit-Learn in 10 Examples

Future-Proof Your Career: Key Data Science Skills for the AI Era

Summarization with LLMs: A Comprehensive Guide

Top 8 Low Code/No Code ML Libraries Every Data Scientist Should Know About

领英推荐

Jason Pohl的更多文章

Databricks Lights a Spark Underneath Your SaaS : 10 Years Later

MLflow Puts the “Science” in Data Science

Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning

Big Data Mixtape

Databricks Lights a Spark Underneath Your SaaS

The Adventures of Mark Twain and Proprietary Hardware

社区洞察

其他会员也浏览了

Top Languages to Master Machine Learning!

Types of Sampling in Machine Learning

Personal Knowledge Graphs. Semantic Entity Persistence in DataLog. Deductive databases

Microservices Design IV: Distributed Tracing, Python in Excel and ChatGPT Enterprise

KDnuggets 16:n42: Python Machine Learning Open Source Projects; Facebook Groups for Big Data & Data Science

Exploring Scikit-Learn in 10 Examples

Future-Proof Your Career: Key Data Science Skills for the AI Era

Summarization with LLMs: A Comprehensive Guide

Top 8 Low Code/No Code ML Libraries Every Data Scientist Should Know About