登录查看更多内容

Databricks Photon and its relation to Apache Spark

Jorrit Sandbrink

data/software engineer

发布日期: 2023年11月18日

Understanding what Databricks Photon is isn't very easy. Specifically, Photon's relation to Apache Spark can be confusing. In this article I aim to shed some light.

What Photon is not

Here are three statements that help us understand what Photon is not:

Photon is not a replacement for Apache Spark
Photon and Spark do not have the same scope
Photon operates within the Spark execution framework

What Photon is

Photon is a query engine. It partially replaces Apache Spark's native execution engine when running your workload on a Photon-enabled Databricks cluster. The diagram below is taken from Databricks' original Photon paper ; it shows Photon's place within a cluster.

Photon's place within a Databricks cluster

A Photon task runs within the Spark execution framework (grey block in the middle), and the execution framework is part of the Databricks Runtime. Thus, a strict hierarchy applies:

Databricks Runtime > Spark execution framework > Photon task

Databricks Runtime

A Databricks cluster does not run open-source Apache?Spark. Databricks clusters are powered by the Databricks Runtime, a highly customized (optimized) version of Spark. Perhaps the most impactful optimization is Photon.

Spark execution framework

The Spark execution framework consists of a single driver node and one or more executor nodes. The driver node is responsible for workload orchestration and takes care of centralized tasks such as query planning and task scheduling. The driver assigns data processing tasks to the executors. Multiple tasks run concurrently and the execution framework combines the results of individual tasks. A data processing task that runs on an executor can be either a Photon task or a native Spark task.

Photon task

Photon is positioned at the lowest level of the Databricks Runtime, and executes tasks on partitions of data on a single thread. Not all Spark operations are currently supported by Photon. For those operations that are not, the Spark execution framework will fall back to Spark's native engine. Tasks are assigned to Photon if possible, and to the native engine if not – Photon and Spark's native engine co-exist.

Deepak Rajak 4 年前

Deep Dive into Persist in Apache Spark

Sachin D N ???? 8 个月前

Architecture Powering Down Stream System with CDC from…

Soumil S. 1 年前

Comparing Spark's native engine with Photon

Differences

Here are the key differences between Spark's native execution and Photon.

Open versus closed

The native Spark execution engine is part of open-source Apache Spark. Photon is a closed-source proprietary engine embedded in the Databricks Runtime, exclusively available for Databricks customers. Spark was originally built for big data workloads on unstructured data lakes – it does not provide best-in-class performance for more traditional data warehousing workloads. Photon was built to address this shortcoming. It specifically aims to improve performance on structured data, which is essential for Databricks to compete with other cloud data warehouses and to make the Lakehouse architecture a zero-compromise solution.

JVM versus C++

Spark's native engine runs in the Java Virtual Machine (JVM), while Photon is implemented in C++. The authors of the Photon paper cite performance liminations in the JVM as the main driver for the switch. An example of this is degraded garbage collection performance on heaps greater than 64GB. Benefits of C++ include explicit control over memory management and Single Instruction, Multiple Data (SIMD) instructions, which is used to exensively optimize code execution.

Row-oriented versus columnar memory

Spark's native engine uses a row-oriented format for in-memory data. Photon did not adopt this format, and instead uses a columnar representation. Again, performance was the driver for this design decision. A benefit of a columnar format is that it is more amenable to SIMD instructions, which enable parallel, vectorized execution. Another benefit is that Parquet files—Photon's primary interface—also have a columnar layout. Where Spark's native engine requires an expensive column-to-row pivot, Photon does not.

Code generation versus interpreted vectorization

Spark's native engine uses a code-generation design, which means that a compiler produces specialized, query-specific code at runtime. Multiple operators are typically collapsed into a pipeline function to optimize performance, at the cost of observability. Photon uses an interpreted vectorization design, which means that data is processed in batches, and a pre-compiled code path is dynamically assigned to each batch at runtime. The latter approach can leverage batch-specific statistics at runtime to choose the optimal code path, a process called batch-level adapativity.

Overlap

Spark's native engine and Photon serve the same purpose of executing data processing tasks within the Spark execution framework. They also share the same user-facing APIs and have consistent semantics – this means that application code is portable between both engines and guaranteed to have the same outcome.

A comparison of Spark's native execution engine and Photon

Conclusion

Photon is a modern execution engine optimized for the Lakehouse. It does not replace Apache Spark, and it does not affect the Spark execution framework – it partially replaces Spark's native execution engine.

Harry V.

I help you get a grip on your Cloud invoice

8 个月

Good article, what I hoped to read is the answer to 'how does Photon consume DBU'? It is different given the mention on the Azure Databricks Pricing page "Enabling Photon will increase DBU count". Does it consume 100% on top of regular Compute?

1 次回应

查看更多评论

Databricks Photon and its relation to Apache Spark

Jorrit Sandbrink

data/software engineer

What Photon is not

What Photon is

Databricks Runtime

Spark execution framework

Photon task

领英推荐

Comparing Spark's native engine with Photon

Differences

Overlap

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

Expedite Apache Spark Queries with Bloom Filter Indexing

Databricks vs Spark: Introduction, Comparison, Pros and Cons

Spark Tidbits - Lesson 11

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

Apache Spark 101: Window Functions

Accelerating Spark: Databricks Photon Runtime

Apache Spark 101: DataFrame Write API Operation

Why Dagster is a Top Choice for Orchestrating Apache Spark, Apache Flink, and dbt Jobs

What is Apache Spark ?

What Photon is not

What Photon is

Databricks Runtime

Spark execution framework

Photon task

领英推荐

Comparing Spark's native engine with Photon

Differences

Overlap

Conclusion

Bird versus Bear: Comparing DuckDB and Polars

2023年11月13日

A way to avoid the "void data type" in PySpark and Delta

2023年10月25日

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

2023年7月19日

Exploring Fabric: putting Microsoft's new analytics platform to the test

2023年5月26日

Microsoft OneLake adopts Delta, says goodbye to closed storage formats

2023年5月24日

Which Data Lake storage format wins the popularity contest?

2023年5月21日

社区洞察

其他会员也浏览了

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

Expedite Apache Spark Queries with Bloom Filter Indexing

Databricks vs Spark: Introduction, Comparison, Pros and Cons

Spark Tidbits - Lesson 11

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

Apache Spark 101: Window Functions

Accelerating Spark: Databricks Photon Runtime

Apache Spark 101: DataFrame Write API Operation

Why Dagster is a Top Choice for Orchestrating Apache Spark, Apache Flink, and dbt Jobs

What is Apache Spark ?