Databricks Photon and its relation to Apache Spark

Databricks Photon and its relation to Apache Spark

Understanding what Databricks Photon is isn't very easy. Specifically, Photon's relation to Apache Spark can be confusing. In this article I aim to shed some light.

What Photon is not

Here are three statements that help us understand what Photon is not:

  • Photon is not a replacement for Apache Spark
  • Photon and Spark do not have the same scope
  • Photon operates within the Spark execution framework

What Photon is

Photon is a query engine. It partially replaces Apache Spark's native execution engine when running your workload on a Photon-enabled Databricks cluster. The diagram below is taken from Databricks' original Photon paper ; it shows Photon's place within a cluster.

Photon's place within a Databricks cluster

A Photon task runs within the Spark execution framework (grey block in the middle), and the execution framework is part of the Databricks Runtime. Thus, a strict hierarchy applies:

Databricks Runtime > Spark execution framework > Photon task

Databricks Runtime

A Databricks cluster does not run open-source Apache?Spark. Databricks clusters are powered by the Databricks Runtime, a highly customized (optimized) version of Spark. Perhaps the most impactful optimization is Photon.

Spark execution framework

The Spark execution framework consists of a single driver node and one or more executor nodes. The driver node is responsible for workload orchestration and takes care of centralized tasks such as query planning and task scheduling. The driver assigns data processing tasks to the executors. Multiple tasks run concurrently and the execution framework combines the results of individual tasks. A data processing task that runs on an executor can be either a Photon task or a native Spark task.

Photon task

Photon is positioned at the lowest level of the Databricks Runtime, and executes tasks on partitions of data on a single thread. Not all Spark operations are currently supported by Photon. For those operations that are not, the Spark execution framework will fall back to Spark's native engine. Tasks are assigned to Photon if possible, and to the native engine if not – Photon and Spark's native engine co-exist.

Comparing Spark's native engine with Photon

Differences

Here are the key differences between Spark's native execution and Photon.

Open versus closed

The native Spark execution engine is part of open-source Apache Spark. Photon is a closed-source proprietary engine embedded in the Databricks Runtime, exclusively available for Databricks customers. Spark was originally built for big data workloads on unstructured data lakes – it does not provide best-in-class performance for more traditional data warehousing workloads. Photon was built to address this shortcoming. It specifically aims to improve performance on structured data, which is essential for Databricks to compete with other cloud data warehouses and to make the Lakehouse architecture a zero-compromise solution.

JVM versus C++

Spark's native engine runs in the Java Virtual Machine (JVM), while Photon is implemented in C++. The authors of the Photon paper cite performance liminations in the JVM as the main driver for the switch. An example of this is degraded garbage collection performance on heaps greater than 64GB. Benefits of C++ include explicit control over memory management and Single Instruction, Multiple Data (SIMD) instructions, which is used to exensively optimize code execution.

Row-oriented versus columnar memory

Spark's native engine uses a row-oriented format for in-memory data. Photon did not adopt this format, and instead uses a columnar representation. Again, performance was the driver for this design decision. A benefit of a columnar format is that it is more amenable to SIMD instructions, which enable parallel, vectorized execution. Another benefit is that Parquet files—Photon's primary interface—also have a columnar layout. Where Spark's native engine requires an expensive column-to-row pivot, Photon does not.

Code generation versus interpreted vectorization

Spark's native engine uses a code-generation design, which means that a compiler produces specialized, query-specific code at runtime. Multiple operators are typically collapsed into a pipeline function to optimize performance, at the cost of observability. Photon uses an interpreted vectorization design, which means that data is processed in batches, and a pre-compiled code path is dynamically assigned to each batch at runtime. The latter approach can leverage batch-specific statistics at runtime to choose the optimal code path, a process called batch-level adapativity.

Overlap

Spark's native engine and Photon serve the same purpose of executing data processing tasks within the Spark execution framework. They also share the same user-facing APIs and have consistent semantics – this means that application code is portable between both engines and guaranteed to have the same outcome.

A comparison of Spark's native execution engine and Photon

Conclusion

Photon is a modern execution engine optimized for the Lakehouse. It does not replace Apache Spark, and it does not affect the Spark execution framework – it partially replaces Spark's native execution engine.


Harry V.

I help you get a grip on your Cloud invoice

8 个月

Good article, what I hoped to read is the answer to 'how does Photon consume DBU'? It is different given the mention on the Azure Databricks Pricing page "Enabling Photon will increase DBU count". Does it consume 100% on top of regular Compute?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了