Laktory SparkChain - A serializable spark-based data transformations

Laktory SparkChain - A serializable spark-based data transformations

In our previous article, we explored the pros and cons of using Spark versus SQL for data transformations within data pipelines. We concluded that while Spark excels in creating modular and scalable transformations, it falls short in the portability and declarative simplicity offered by SQL queries. Today, we will delve deeper into Laktory's SparkChain model, which aims to integrate the strengths of both technologies.

PySpark Code

Our code snippet will

  • load some stock prices data
  • remove duplicates
  • create columns
  • aggregate into monthly metrics
  • join some metadata

The code structure and comments clearly delineate the data flow, making it easy to understand. By leveraging Spark's chainable operations, we can effectively break down each transformation into its own distinct block. Running this script on a Spark-enabled cluster yields the desired DataFrame. However, what if we wish to abstract the transformations from the code? Or what if we need to apply the same transformations to different inputs? This is where Laktory becomes invaluable.

Spark Chain

The Laktory SparkChain model defines a series of serializable nodes, where each node represents a specific data transformation, articulated through a Spark function and its parameters. Each transformation is consistently applied to the output of the preceding node, ensuring a seamless and orderly data processing flow.

Basic Node

For example, the select method from the previous snippet may be rewritten as:

In this example, a single node is defined and specifies which Spark function to use (select) and which arguments to pass to this function ("created_at", "symbol", etc.). Upon executing the chain, the output mirrors that of directly applying the select method on the DataFrame. This straightforward example can be expanded to encompass all transformations outlined in the initial example, demonstrating the model's flexibility and scalability.

Each node in the chain follows the same fundamental principles as the select node defined earlier, but there are several nuances worth noting.

Column node


The third node introduces the "created_month" column and takes advantage of a special feature supported by a chain node. When the "column" dictionary is specified, the expected output for the Spark function is a column, rather than a DataFrame. This newly defined column is then added to the DataFrame output of the preceding node.

Custom function node

Node four utilizes the "groupby_and_agg" function, which is not a standard Spark function but is instead provided by Laktory's library. Full details about this function are available in the documentation. While the same result could be achieved using separate groupby and agg nodes, this method simplifies the process by consolidating them into a single operation.

Node with Data Source model


The fifth and final node is a join operation. Here, the value provided for "other" is a table data source, one of Laktory's models capable of reading data from either a table or a pipeline, depending on the execution context. Notably, this table data source utilizes a Spark chain to perform some local transformations before passing the data to the join function. The use of this table data source model enables a fully serializable definition of the transformation, avoiding the need for complex objects like a DataFrame.

At this point, you might be wondering, "This is interesting, but what's the added value? Why would I choose to build this extensive dictionary instead of directly applying methods to my DataFrame?"

Code Abstraction

In this specific format, I would tend to agree that the added value isn't immediately apparent. However, the method truly begins to shine when you decouple the transformations from the code by storing them in a configuration file, such as YAML:

With the actual code reducing to:


This approach enables highly generic code that can be applied across various scenarios. The data engineer only needs to update the configuration file, which keeps the focus squarely on data transformation rather than on code cleanup or refactoring. This clarity enhances the visibility of the intent behind each change. Additionally, this method offers significant portability benefits. The serializable configuration, whether stored as a YAML or JSON file, contains all the transformation instructions. The sole requirement is that the Laktory package be installed on the cluster, simplifying deployment and execution across different environments.

Modularity

In my previous post, I extensively praised Spark's modularity. You might wonder if working with a YAML configuration means we lose the ability to create modular components. Not at all! Since each node can itself be a Spark chain, you can create reusable chains across multiple data flows. Consider a chain designed to remove duplicates and null values:

Other chains may reuse this cleaning chain:

Expandability

Another significant advantage of using PySpark is the ability to leverage Python's extensive library ecosystem. You might wonder if we lose this capability when working with SparkChain. Absolutely not! Upon execution, SparkChain searches for PySpark built-in functions, but it also fully supports the integration of custom functions. For instance, if you have a dataset containing coordinates and you need to compute the distance between cities, you could configure the file as follows:

with the supporting code:

The only requirement was to create the haversine_udf UDF (user-defined function) and include it in the chain execute command.

Closing remarks

In this post, we deep dived into the Laktory SparkChain model, which enables the definition and declaration of a series of Spark transformations using a serializable configuration. This approach maintains the modularity and expandability of Spark, while also introducing a level of abstraction that shifts the focus from code implementation to data transformation. It is particularly well-suited for building scalable data pipelines. The SparkChain class can be effectively used as demonstrated in this post, but its power is amplified when employed in the context of defining a data pipeline. This is typically how it is utilized when constructing Laktory stacks. Here’s an example:

I hope the content of this post was useful and that it made you want to try SparkChain in your own project. Don't hesitate to reach out if you have any questions!

要查看或添加评论,请登录

Olivier Soucy的更多文章

社区洞察

其他会员也浏览了