Databricks Photon and its relation to Apache Spark
Understanding what Databricks Photon is isn't very easy. Specifically, Photon's relation to Apache Spark can be confusing. In this article I aim to shed some light.
What Photon is not
Here are three statements that help us understand what Photon is not:
What Photon is
Photon is a query engine. It partially replaces Apache Spark's native execution engine when running your workload on a Photon-enabled Databricks cluster. The diagram below is taken from Databricks' original Photon paper ; it shows Photon's place within a cluster.
A Photon task runs within the Spark execution framework (grey block in the middle), and the execution framework is part of the Databricks Runtime. Thus, a strict hierarchy applies:
Databricks Runtime > Spark execution framework > Photon task
Databricks Runtime
A Databricks cluster does not run open-source Apache?Spark. Databricks clusters are powered by the Databricks Runtime, a highly customized (optimized) version of Spark. Perhaps the most impactful optimization is Photon.
Spark execution framework
The Spark execution framework consists of a single driver node and one or more executor nodes. The driver node is responsible for workload orchestration and takes care of centralized tasks such as query planning and task scheduling. The driver assigns data processing tasks to the executors. Multiple tasks run concurrently and the execution framework combines the results of individual tasks. A data processing task that runs on an executor can be either a Photon task or a native Spark task.
Photon task
Photon is positioned at the lowest level of the Databricks Runtime, and executes tasks on partitions of data on a single thread. Not all Spark operations are currently supported by Photon. For those operations that are not, the Spark execution framework will fall back to Spark's native engine. Tasks are assigned to Photon if possible, and to the native engine if not – Photon and Spark's native engine co-exist.
领英推荐
Comparing Spark's native engine with Photon
Differences
Here are the key differences between Spark's native execution and Photon.
Open versus closed
The native Spark execution engine is part of open-source Apache Spark. Photon is a closed-source proprietary engine embedded in the Databricks Runtime, exclusively available for Databricks customers. Spark was originally built for big data workloads on unstructured data lakes – it does not provide best-in-class performance for more traditional data warehousing workloads. Photon was built to address this shortcoming. It specifically aims to improve performance on structured data, which is essential for Databricks to compete with other cloud data warehouses and to make the Lakehouse architecture a zero-compromise solution.
JVM versus C++
Spark's native engine runs in the Java Virtual Machine (JVM), while Photon is implemented in C++. The authors of the Photon paper cite performance liminations in the JVM as the main driver for the switch. An example of this is degraded garbage collection performance on heaps greater than 64GB. Benefits of C++ include explicit control over memory management and Single Instruction, Multiple Data (SIMD) instructions, which is used to exensively optimize code execution.
Row-oriented versus columnar memory
Spark's native engine uses a row-oriented format for in-memory data. Photon did not adopt this format, and instead uses a columnar representation. Again, performance was the driver for this design decision. A benefit of a columnar format is that it is more amenable to SIMD instructions, which enable parallel, vectorized execution. Another benefit is that Parquet files—Photon's primary interface—also have a columnar layout. Where Spark's native engine requires an expensive column-to-row pivot, Photon does not.
Code generation versus interpreted vectorization
Spark's native engine uses a code-generation design, which means that a compiler produces specialized, query-specific code at runtime. Multiple operators are typically collapsed into a pipeline function to optimize performance, at the cost of observability. Photon uses an interpreted vectorization design, which means that data is processed in batches, and a pre-compiled code path is dynamically assigned to each batch at runtime. The latter approach can leverage batch-specific statistics at runtime to choose the optimal code path, a process called batch-level adapativity.
Overlap
Spark's native engine and Photon serve the same purpose of executing data processing tasks within the Spark execution framework. They also share the same user-facing APIs and have consistent semantics – this means that application code is portable between both engines and guaranteed to have the same outcome.
Conclusion
Photon is a modern execution engine optimized for the Lakehouse. It does not replace Apache Spark, and it does not affect the Spark execution framework – it partially replaces Spark's native execution engine.
I help you get a grip on your Cloud invoice
8 个月Good article, what I hoped to read is the answer to 'how does Photon consume DBU'? It is different given the mention on the Azure Databricks Pricing page "Enabling Photon will increase DBU count". Does it consume 100% on top of regular Compute?