Photon: Revolutionizing Query Performance in Lakehouse Systems

Photon: Revolutionizing Query Performance in Lakehouse Systems

Photon, Databricks' fast query engine for Lakehouse systems:



Figure 1: Databricks’ execution layer. Photon runs as part of the Databricks Runtime, which executes queries on a distributedcluster of public cloud VMs. Within these clusters, Photon executes tasks on partitions of data on a single thread.

Photon: The High-Speed Engine Powering Lakehouse Analytics

As organizations transition to Lakehouse architectures, combining the scalability of data lakes with the governance and performance of data warehouses, they encounter new challenges in query execution and data processing. Enter Photon, a cutting-edge query engine developed by Databricks, designed to optimize Lakehouse workloads by leveraging vectorized execution and native code for unmatched performance.


What is Photon?

Photon is a vectorized query engine built from the ground up in C++. It integrates seamlessly with the Databricks Runtime, accelerating SQL and Apache Spark workloads with adaptive, high-speed processing. Unlike traditional JVM-based engines, Photon capitalizes on columnar data layouts and advanced in-memory techniques to achieve superior performance on massive datasets stored in cloud environments like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage.


Key Design Innovations

Photon's architecture incorporates several design choices that set it apart:

  1. Vectorized Execution: Photon processes data in batches rather than rows, improving CPU utilization through techniques like SIMD (Single Instruction, Multiple Data). This reduces interpretation overhead and allows efficient pipeline parallelism. Common operations like hash joins, aggregations, and expression evaluations benefit from specialized vectorized kernels, which optimize memory access patterns.
  2. Native C++ Implementation: Moving away from JVM-based execution, Photon gives developers explicit control over memory management and low-level optimizations. This eliminates performance bottlenecks inherent in Java-based engines, such as garbage collection issues for large heaps and limits imposed by just-in-time (JIT) compilation.
  3. Column-Oriented Design: Photon uses columnar in-memory data representation, which aligns well with columnar storage formats like Apache Parquet and Delta Lake. This design simplifies operations like serialization and enables tight loops for better cache performance and prefetching.
  4. Adaptive Execution: Photon dynamically adapts its execution based on the data properties, such as nullability, ASCII encoding, or sparsity. For instance: Sparse Batch Compaction: Compacting sparse batches during hash joins to maximize memory parallelism. Optimized String Encoding: Detecting UUID patterns and reducing shuffle data volume by encoding them as compact 128-bit integers.
  5. Seamless Integration with Databricks Runtime: Photon is compatible with Apache Spark’s SQL and DataFrame APIs, allowing incremental adoption. It supports fallback to the legacy Spark SQL engine for unsupported operations, ensuring uninterrupted workflows.


Unparalleled Performance

Photon delivers remarkable speedups across various workloads, as evidenced by benchmarks and real-world scenarios:

  1. Micro-Benchmark Results: Hash Joins: Photon’s vectorized hash table achieves up to 3.5× faster performance compared to Spark's sort-merge joins by parallelizing memory accesses. Aggregations: Grouping and aggregation tasks experience up to 5.7× speedups, benefiting from optimized memory pooling and hash table operations. Expression Evaluation: Optimized kernels for common expressions like upper() result in 3× faster performance than Java-based implementations.
  2. TPC-H and TPC-DS Benchmarks: On TPC-H, Photon shows 4× average speedup across 22 queries, with some queries, like Q1, accelerating by 23× due to vectorized arithmetic. For the TPC-DS 100TB benchmark, Photon enabled Databricks to set a world record by delivering industry-leading SQL performance.
  3. Real-World Workloads: Photon processes tens of millions of queries daily, offering consistent improvements for ETL pipelines, machine learning workflows, and real-time analytics.


Real-World Advantages

  1. Simplified Data Architecture: By accelerating queries directly on raw and curated data, Photon removes the need for duplicating datasets across data lakes and warehouses, reducing operational complexity.
  2. Cost Efficiency: Faster query execution translates to lower compute costs, especially for long-running analytical jobs and large-scale data transformations.
  3. Future-Proof Compatibility: Photon’s support for open formats like Apache Parquet ensures that organizations avoid vendor lock-in while maintaining high performance.


How Photon Delivers on the Lakehouse Promise

Photon isn't just an engine; it’s a technological leap that aligns perfectly with the Lakehouse vision:

  • It bridges the gap between structured and unstructured data by providing consistent performance across SQL, Spark DataFrames, and raw datasets.
  • It supports advanced Lakehouse features like ACID transactions and time travel through its tight integration with Delta Lake.
  • Photon accelerates data exploration, enabling faster insights from exabytes of data stored in elastic, cloud-native environments.


Summary

Photon exemplifies Databricks’ commitment to innovation in data engineering and analytics. With its high-performance vectorized engine, Photon redefines what’s possible in the Lakehouse paradigm. Whether you're running SQL queries, building machine learning models, or transforming large datasets, Photon ensures that your data systems are not just faster, but also smarter and more adaptive.


Explore

Explore Photon today on the Databricks website platform link (https://www.databricks.com/product/photon) and experience the future of data processing. Unlock unparalleled speed, efficiency, and adaptability for your Lakehouse workloads!

#Databricks

Also you can find the white paper using below link

https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf

like the content? please share and follow my articles , u can also find my blogs on medium https://medium.com/@manoj.panicker.blog

more blogs in pipeline




要查看或添加评论,请登录

Manoj Panicker的更多文章

社区洞察

其他会员也浏览了