登录查看更多内容

Laktory SparkChain - A serializable spark-based data transformations

Olivier Soucy

Founder @ okube.ai | Fractional Data Platform Engineer | Open-source Developer | Databricks Partner

发布日期: 2024年5月8日

In our previous article, we explored the pros and cons of using Spark versus SQL for data transformations within data pipelines. We concluded that while Spark excels in creating modular and scalable transformations, it falls short in the portability and declarative simplicity offered by SQL queries. Today, we will delve deeper into Laktory's SparkChain model, which aims to integrate the strengths of both technologies.

PySpark Code

Our code snippet will

load some stock prices data
remove duplicates
create columns
aggregate into monthly metrics
join some metadata

The code structure and comments clearly delineate the data flow, making it easy to understand. By leveraging Spark's chainable operations, we can effectively break down each transformation into its own distinct block. Running this script on a Spark-enabled cluster yields the desired DataFrame. However, what if we wish to abstract the transformations from the code? Or what if we need to apply the same transformations to different inputs? This is where Laktory becomes invaluable.

Spark Chain

The Laktory SparkChain model defines a series of serializable nodes, where each node represents a specific data transformation, articulated through a Spark function and its parameters. Each transformation is consistently applied to the output of the preceding node, ensuring a seamless and orderly data processing flow.

Basic Node

For example, the select method from the previous snippet may be rewritten as:

In this example, a single node is defined and specifies which Spark function to use (select) and which arguments to pass to this function ("created_at", "symbol", etc.). Upon executing the chain, the output mirrors that of directly applying the select method on the DataFrame. This straightforward example can be expanded to encompass all transformations outlined in the initial example, demonstrating the model's flexibility and scalability.

Each node in the chain follows the same fundamental principles as the select node defined earlier, but there are several nuances worth noting.

Column node

The third node introduces the "created_month" column and takes advantage of a special feature supported by a chain node. When the "column" dictionary is specified, the expected output for the Spark function is a column, rather than a DataFrame. This newly defined column is then added to the DataFrame output of the preceding node.

Custom function node

Node four utilizes the "groupby_and_agg" function, which is not a standard Spark function but is instead provided by Laktory's library. Full details about this function are available in the documentation. While the same result could be achieved using separate groupby and agg nodes, this method simplifies the process by consolidating them into a single operation.

Node with Data Source model

领英推荐

9 Predictions for Data in 2023

Tomasz Tunguz 2 年前

Delta Live Tables — Part 4— Data Processing and…

Krishna Yogi Kolluru 8 个月前

Parquet file format – everything you need to know!

Nikola Ilic 1 年前

The fifth and final node is a join operation. Here, the value provided for "other" is a table data source, one of Laktory's models capable of reading data from either a table or a pipeline, depending on the execution context. Notably, this table data source utilizes a Spark chain to perform some local transformations before passing the data to the join function. The use of this table data source model enables a fully serializable definition of the transformation, avoiding the need for complex objects like a DataFrame.

At this point, you might be wondering, "This is interesting, but what's the added value? Why would I choose to build this extensive dictionary instead of directly applying methods to my DataFrame?"

Code Abstraction

In this specific format, I would tend to agree that the added value isn't immediately apparent. However, the method truly begins to shine when you decouple the transformations from the code by storing them in a configuration file, such as YAML:

With the actual code reducing to:

This approach enables highly generic code that can be applied across various scenarios. The data engineer only needs to update the configuration file, which keeps the focus squarely on data transformation rather than on code cleanup or refactoring. This clarity enhances the visibility of the intent behind each change. Additionally, this method offers significant portability benefits. The serializable configuration, whether stored as a YAML or JSON file, contains all the transformation instructions. The sole requirement is that the Laktory package be installed on the cluster, simplifying deployment and execution across different environments.

Modularity

In my previous post, I extensively praised Spark's modularity. You might wonder if working with a YAML configuration means we lose the ability to create modular components. Not at all! Since each node can itself be a Spark chain, you can create reusable chains across multiple data flows. Consider a chain designed to remove duplicates and null values:

Other chains may reuse this cleaning chain:

Expandability

Another significant advantage of using PySpark is the ability to leverage Python's extensive library ecosystem. You might wonder if we lose this capability when working with SparkChain. Absolutely not! Upon execution, SparkChain searches for PySpark built-in functions, but it also fully supports the integration of custom functions. For instance, if you have a dataset containing coordinates and you need to compute the distance between cities, you could configure the file as follows:

with the supporting code:

The only requirement was to create the haversine_udf UDF (user-defined function) and include it in the chain execute command.

Closing remarks

In this post, we deep dived into the Laktory SparkChain model, which enables the definition and declaration of a series of Spark transformations using a serializable configuration. This approach maintains the modularity and expandability of Spark, while also introducing a level of abstraction that shifts the focus from code implementation to data transformation. It is particularly well-suited for building scalable data pipelines. The SparkChain class can be effectively used as demonstrated in this post, but its power is amplified when employed in the context of defining a data pipeline. This is typically how it is utilized when constructing Laktory stacks. Here’s an example:

I hope the content of this post was useful and that it made you want to try SparkChain in your own project. Don't hesitate to reach out if you have any questions!

要查看或添加评论，请登录

Olivier Soucy的更多文章

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

2024年10月15日

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…

1 条评论
Lakehouse as Code | 03. Data Pipeline Jobs

2024年10月8日

Lakehouse as Code | 03. Data Pipeline Jobs

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Lakehouse as Code | 02. Workspace

2024年9月30日

Lakehouse as Code | 02. Workspace

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Lakehouse as Code | 01. Unity Catalog

2024年9月24日

Lakehouse as Code | 01. Unity Catalog

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Data Pipelines | To merge, or not to merge

2024年9月10日

Data Pipelines | To merge, or not to merge

In recent years, data has shifted towards a more streaming-centric nature. Online transactions, website clicks, TikTok…
Unity Catalog | 3 levels to rule them all

2024年9月3日

Unity Catalog | 3 levels to rule them all

In May 2021, Databricks introduced Unity Catalog (UC), promising a unified governance layer designed to streamline the…

5 条评论
Databricks AI Playground | How to bring your own model

2024年8月26日

Databricks AI Playground | How to bring your own model

After a few months in public preview, Databricks AI Playground has garnered great feedback from the community. But if…

1 条评论
Building a Data Pipeline with Polars and Laktory

2024年7月8日

Building a Data Pipeline with Polars and Laktory

When discussing data pipelines, distributed engines like Spark and big data platforms such as Databricks and Snowflake…

1 条评论
DataFrames Battle Royale | Pandas vs Polars vs Spark

2024年5月31日

DataFrames Battle Royale | Pandas vs Polars vs Spark

Usually, when I sit down to write these blog posts, I have a clear direction in mind. This time, however, it's a bit…

4 条评论
Analytics for Everyone | Data driven decisions using ChatGPT

2024年5月21日

Analytics for Everyone | Data driven decisions using ChatGPT

Preface: If you are non-technical but have an interest in technology or data-driven decision-making, please keep…

2 条评论

See all articles

Laktory SparkChain - A serializable spark-based data transformations

Olivier Soucy

Founder @ okube.ai | Fractional Data Platform Engineer | Open-source Developer | Databricks Partner

PySpark Code

Spark Chain

领英推荐

Code Abstraction

Modularity

Expandability

Closing remarks

Olivier Soucy的更多文章

社区洞察

其他会员也浏览了

Generative AI Tools Landscape - Data Applications – Part1

Tackling the “Large Number of Small Files” Problem in Spark

DBT ZERO TO HERO

DBT: Capture The In-house Data Flow

Hierarchical Data Retrieval in InnoDB: Implementing Recursive Self-Joins with CTEs

THE VOICE OF THE CUSTOMER – MIXING TEXT AND STRUCTURED DATA By W H Inmon

Databricks Variant Data

Delta Lake: Enhancing Data Lakes with ACID Transactions and Performance

Optimizing Data Performance with Partitioning in Databricks

Designing a Timeseries Analytics DB for Stock Market Analysis

PySpark Code

Spark Chain

领英推荐

Code Abstraction

Modularity

Expandability

Closing remarks

Olivier Soucy的更多文章

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

Lakehouse as Code | 03. Data Pipeline Jobs

Lakehouse as Code | 02. Workspace

Lakehouse as Code | 01. Unity Catalog

Data Pipelines | To merge, or not to merge

Unity Catalog | 3 levels to rule them all

Databricks AI Playground | How to bring your own model

Building a Data Pipeline with Polars and Laktory

DataFrames Battle Royale | Pandas vs Polars vs Spark

Analytics for Everyone | Data driven decisions using ChatGPT

社区洞察

其他会员也浏览了

Generative AI Tools Landscape - Data Applications – Part1

Tackling the “Large Number of Small Files” Problem in Spark

DBT ZERO TO HERO

DBT: Capture The In-house Data Flow

Hierarchical Data Retrieval in InnoDB: Implementing Recursive Self-Joins with CTEs

THE VOICE OF THE CUSTOMER – MIXING TEXT AND STRUCTURED DATA By W H Inmon

Databricks Variant Data

Delta Lake: Enhancing Data Lakes with ACID Transactions and Performance

Optimizing Data Performance with Partitioning in Databricks

Designing a Timeseries Analytics DB for Stock Market Analysis