Why a Data Lakehouse is the Best Option for Scalable Machine Learning

Why a Data Lakehouse is the Best Option for Scalable Machine Learning

When I was a kid in school, my teacher would sometimes call a pop quiz.? Rather than collect all of the tests herself to grade, she would have each of us pass our test to the person behind us and then proceed to instruct us on how to grade them.? Afterwards, she would collect them and calculate the class average.? Although I didn’t realize it at the time, I was a participant in distributed data processing.? Massive Parallel Processing (MPP) databases behave in much the same way.? Each student is a worker that is given instructions to perform independently on a portion of data.? The output of that processing, such as the test score, is then collected to a driver node, the teacher, which performs the final calculation.

Apache Spark became the most active open source project in big data because it took this pattern of distributed data processing and generalized it so that it could be applied to any type of processing and any type of data.? This includes MPP style SQL processing, but also advanced machine learning (ML) through the MLlib library.? The processing required for ML is unique because it requires multiple scans of the data to transform it into “features” and then apply various algorithms against them.? Typically, there are a number of these feature transformations and algorithms chained together to form a pipeline that is “fit” against the data to generate a “model”.

The Data Lakehouse has been architected to be both distributed and open, which makes it the only option for scalable ML.? There are two major advancements that have accelerated the adoption of ML:

  1. The Dataframe API
  2. Open Source Software

The Dataframe API gained broad adoption when the R programming language was open sourced in 2001.? Later in 2009 the Pandas Dataframe API did the same for Python.? These languages are used heavily for statistical analysis, and researchers have contributed to the community by continuously open sourcing new kinds of algorithms across both R and Python.??

These Dataframes have a limitation though.? They are single threaded, and can only scale based on the resources of the single machine they run on.? Once data grows to a meaningful size, a data scientist must make a decision.? Either start sampling the data to continue training on a single node, or distribute the ML across all the data.??

Sampling data can be a project in itself to ensure the sample is representative of the problem space.? For example, suppose you are training a model to detect fraud in a dataset where it rarely occurs.? If the sample does not contain sufficient instances of fraud, it will be hard to fit a model.? Training against the entire dataset requires a distributed ML framework.? It’s a tradeoff between applying more data scientist hours or more compute hours.

The breakthrough came in 2015 when the Dataframe API for Apache Spark was introduced.? This allowed the processing of a Dataframe to be distributed across a cluster of worker nodes, just like the students grading tests.? MPP databases have a known set of aggregate functions, like sum(), max(), avg(), that can be applied to each worker.? Distributing the work of ML is much more complex than distributing aggregate functions.? Each algorithm is a bit different in implementation.? Both the Dataframe and the ML framework must be distributed in order to train models against large datasets. ? Consider the following example, which creates a simple text classification model using MLlib:

tokenizer = Tokenizer(inputCol="text", outputCol="words"

hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

lr = LogisticRegression(maxIter=10, regParam=0.001)

pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])




df = spark.read.load("/path/to/data")

model = pipeline.fit(df))        

If you examine the dependency injection in the above example, you’ll notice that the Dataframe “df” is being injected into the ML framework to generate a model.? Because both the client and the service are distributed, a single model is able to be trained across all of the data.

By comparison, Snowflake was not built for distributed ML.? Rather, the only way to perform model training with Snowflake is to do the reverse by injecting an ML framework, like scikit-learn, into their proprietary Snowpark Dataframe via a UDF call.? There is no ML framework on the market, not even one from Snowflake, that can inject a Snowpark Dataframe as a dependency.??

ML on Snowflake requires the following steps:

  1. Group data
  2. Pass data into UDF
  3. Convert data to a Pandas Dataframe
  4. Train data, single threaded, via scikit-learn

Because a single model gets trained on a single worker node, the following limitations are introduced:

  • If one of the groups of data is more than a few thousand records, it fails
  • If one call of the UDF consumes too much memory, it fails
  • If one of the UDFs takes more than 60 seconds to execute, it fails

A Data Lakehouse is the most scalable way to perform ML.? It was architected from the beginning for the storage and compute layers to be both distributed and open.? This enabled a community to build the tools, like MLlib, to be able to perform ML at scale.? When McGraw Hill Education needed to train against over 10 million student interactions, they chose the Databricks Lakehouse to do so.? This allowed them to provide personalized recommendations that increased student pass rates by over 13%.??

The next time your data team gets a pop quiz, make sure they are equipped with a Data Lakehouse to be able to scale with the data and deliver a perfect score.

James Grogan

Data Architect, Data Engineer, Integration and BI Specialist (Data Warehouse and related ecosystems)

2 年

You may wish to review TeradataML - which has both a DF and SQL workflow. Python / R front-ends are available, along with fully parallel-aware (where the algo permits) function set.

回复
.Rogier Werschkull

Head of Data @ Coolgradient | Data-analytics trainer | Rockstar & AI artist @ night ;) | Love calling bullshit

2 年

Nick Akincilar or Daan Bakboord: any comments on the ML limitations of Snowflake as explained in this piece?

回复
Paul Johnson

Software, Cloud & Analytics | Data Engineering | Feed AI, BI & Data Science with Quality Data

2 年

“the only option for scalable ML” is a very bold claim indeed. What about MADlib running on an MPP Greenplum cluster? The data can reside either externally or be loaded to the DBMS should you choose.

Josh Fischer

Lead Platform Engineer @ InRhythm with Mastercard

2 年
.Rogier Werschkull

Head of Data @ Coolgradient | Data-analytics trainer | Rockstar & AI artist @ night ;) | Love calling bullshit

2 年

AFAIK a 'lakehouse' is not a technology but a data architectural pattern: it is about re-intruducing the 'old' datawarehousing concerns of organising data in a subject oriented manner & data integration to all the failed data Lake implementations. As in: -The data lake adagium of 'just make sense of all this raw data yourself' failed because it obviously does not scale. -Thus, Data Lakes became data swaps AKA 'Hotel California' -People at databricks pitched 'data lakehouse' as a new term for doing data warehousing in a different tech stack. -Which mainly is 'old wine in new bags', as in: it is 'just datawarehousing' where you separate your time variant / non volatile concerns (=lake) from your subject oriented/integrated ones (=house).

要查看或添加评论,请登录

Jason Pohl的更多文章

社区洞察

其他会员也浏览了