登录查看更多内容

Photon: Revolutionizing Query Performance in Lakehouse Systems

Manoj Panicker

Data Engineer | Databricks| PySpark | Spark SQL | Azure Synapse | Azure Data Factory| SAFe? 6.0

发布日期: 2024年12月4日

+ 关注

Photon, Databricks' fast query engine for Lakehouse systems:

Figure 1: Databricks’ execution layer. Photon runs as part of the Databricks Runtime, which executes queries on a distributedcluster of public cloud VMs. Within these clusters, Photon executes tasks on partitions of data on a single thread.

Photon: The High-Speed Engine Powering Lakehouse Analytics

As organizations transition to Lakehouse architectures, combining the scalability of data lakes with the governance and performance of data warehouses, they encounter new challenges in query execution and data processing. Enter Photon, a cutting-edge query engine developed by Databricks, designed to optimize Lakehouse workloads by leveraging vectorized execution and native code for unmatched performance.

What is Photon?

Photon is a vectorized query engine built from the ground up in C++. It integrates seamlessly with the Databricks Runtime, accelerating SQL and Apache Spark workloads with adaptive, high-speed processing. Unlike traditional JVM-based engines, Photon capitalizes on columnar data layouts and advanced in-memory techniques to achieve superior performance on massive datasets stored in cloud environments like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage.

Key Design Innovations

Photon's architecture incorporates several design choices that set it apart:

Vectorized Execution: Photon processes data in batches rather than rows, improving CPU utilization through techniques like SIMD (Single Instruction, Multiple Data). This reduces interpretation overhead and allows efficient pipeline parallelism. Common operations like hash joins, aggregations, and expression evaluations benefit from specialized vectorized kernels, which optimize memory access patterns.
Native C++ Implementation: Moving away from JVM-based execution, Photon gives developers explicit control over memory management and low-level optimizations. This eliminates performance bottlenecks inherent in Java-based engines, such as garbage collection issues for large heaps and limits imposed by just-in-time (JIT) compilation.
Column-Oriented Design: Photon uses columnar in-memory data representation, which aligns well with columnar storage formats like Apache Parquet and Delta Lake. This design simplifies operations like serialization and enables tight loops for better cache performance and prefetching.
Adaptive Execution: Photon dynamically adapts its execution based on the data properties, such as nullability, ASCII encoding, or sparsity. For instance: Sparse Batch Compaction: Compacting sparse batches during hash joins to maximize memory parallelism. Optimized String Encoding: Detecting UUID patterns and reducing shuffle data volume by encoding them as compact 128-bit integers.
Seamless Integration with Databricks Runtime: Photon is compatible with Apache Spark’s SQL and DataFrame APIs, allowing incremental adoption. It supports fallback to the legacy Spark SQL engine for unsupported operations, ensuring uninterrupted workflows.

Unparalleled Performance

Photon delivers remarkable speedups across various workloads, as evidenced by benchmarks and real-world scenarios:

Micro-Benchmark Results: Hash Joins: Photon’s vectorized hash table achieves up to 3.5× faster performance compared to Spark's sort-merge joins by parallelizing memory accesses. Aggregations: Grouping and aggregation tasks experience up to 5.7× speedups, benefiting from optimized memory pooling and hash table operations. Expression Evaluation: Optimized kernels for common expressions like upper() result in 3× faster performance than Java-based implementations.
TPC-H and TPC-DS Benchmarks: On TPC-H, Photon shows 4× average speedup across 22 queries, with some queries, like Q1, accelerating by 23× due to vectorized arithmetic. For the TPC-DS 100TB benchmark, Photon enabled Databricks to set a world record by delivering industry-leading SQL performance.
Real-World Workloads: Photon processes tens of millions of queries daily, offering consistent improvements for ETL pipelines, machine learning workflows, and real-time analytics.

领英推荐

Supercharging Big Data Analytics with Apache Spark and…

ITVersity, Inc. 3 周前

Copy of What is a Delta Lake?

Lyftrondata 5 个月前

Data Bricks - The New Way to Manage Data Efficiently

Miracle Software Systems, Inc 10 个月前

Real-World Advantages

Simplified Data Architecture: By accelerating queries directly on raw and curated data, Photon removes the need for duplicating datasets across data lakes and warehouses, reducing operational complexity.
Cost Efficiency: Faster query execution translates to lower compute costs, especially for long-running analytical jobs and large-scale data transformations.
Future-Proof Compatibility: Photon’s support for open formats like Apache Parquet ensures that organizations avoid vendor lock-in while maintaining high performance.

How Photon Delivers on the Lakehouse Promise

Photon isn't just an engine; it’s a technological leap that aligns perfectly with the Lakehouse vision:

It bridges the gap between structured and unstructured data by providing consistent performance across SQL, Spark DataFrames, and raw datasets.
It supports advanced Lakehouse features like ACID transactions and time travel through its tight integration with Delta Lake.
Photon accelerates data exploration, enabling faster insights from exabytes of data stored in elastic, cloud-native environments.

Summary

Photon exemplifies Databricks’ commitment to innovation in data engineering and analytics. With its high-performance vectorized engine, Photon redefines what’s possible in the Lakehouse paradigm. Whether you're running SQL queries, building machine learning models, or transforming large datasets, Photon ensures that your data systems are not just faster, but also smarter and more adaptive.

Explore

Explore Photon today on the Databricks website platform link (https://www.databricks.com/product/photon) and experience the future of data processing. Unlock unparalleled speed, efficiency, and adaptability for your Lakehouse workloads!

#Databricks

Also you can find the white paper using below link

https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf

like the content? please share and follow my articles , u can also find my blogs on medium https://medium.com/@manoj.panicker.blog

more blogs in pipeline

要查看或添加评论，请登录

Manoj Panicker的更多文章

Liquid Clustering in Delta Tables: A Game-Changer in Data Management

2025年3月2日

Liquid Clustering in Delta Tables: A Game-Changer in Data Management

Introduction Delta Lake has revolutionized data lake management by introducing ACID transactions, schema enforcement…
OpenAI's forthcoming model, GPT-5

2025年2月15日

OpenAI's forthcoming model, GPT-5

OpenAI's forthcoming model, GPT-5, is anticipated to introduce several significant enhancements over its predecessors…
Dubai - RailBus

2025年2月15日

Dubai - RailBus

Dubai's Roads and Transport Authority (RTA) has unveiled an innovative transportation solution: the RailBus. This…
San Francisco Fire Department (SFFD) - Analysis

2025年2月2日

San Francisco Fire Department (SFFD) - Analysis

Here are 25 comprehensive PySpark queries to explore the San Francisco Fire Department (SFFD) dataset. These queries…

1 条评论
SQL Server from Basic to Advanced using AdventureWorks Database

2025年2月1日

SQL Server from Basic to Advanced using AdventureWorks Database

The AdventureWorks database is a Microsoft SQL Server sample database that simulates a fictional bicycle manufacturing…
Comprehensive Guide to SQL

2025年1月9日

Comprehensive Guide to SQL

Comprehensive Guide to SQL: Basic, Intermediate, and Advanced Tutorials with Scenarios, Explanations, and Examples…

4 条评论
Delta Live Tables: A Comprehensive Guide

2024年12月29日

Delta Live Tables: A Comprehensive Guide

Delta Live Tables: A Comprehensive Guide A Comprehensive Guide with Examples and Code Delta Live Tables (DLT) is an…
Window function in PySpark — one stop to master it all

2024年11月28日

Window function in PySpark — one stop to master it all

Sit patiently and and just follow along. Just reading will not help, copy paste the code first to get to know the…
Mastering Slowly Changing Dimensions (SCD) in Databricks: A Guide for Data Engineers

2024年11月20日

Mastering Slowly Changing Dimensions (SCD) in Databricks: A Guide for Data Engineers

In the fast-evolving world of data engineering, managing and tracking changes in dimension data over time is a critical…

1 条评论

See all articles

Photon: Revolutionizing Query Performance in Lakehouse Systems

Manoj Panicker

Data Engineer | Databricks| PySpark | Spark SQL | Azure Synapse | Azure Data Factory| SAFe? 6.0

Photon: The High-Speed Engine Powering Lakehouse Analytics

What is Photon?

Key Design Innovations

Unparalleled Performance

领英推荐

Real-World Advantages

How Photon Delivers on the Lakehouse Promise

Summary

Explore

Manoj Panicker的更多文章

社区洞察

其他会员也浏览了

Understanding Batch and Real-Time Processing in DataBricks

DoubleCloud’s 14th Product Update

A unified platform with Databricks & dbt

Why use Delta Live Tables in Databricks?

Enhancing Performance and Scalability: Migrating Data Processing to Databricks

GenAI Dev Stack, LLMOps & Vector Databases!

All Databases are Equal, but Some Databases are More Equal than Others

Tableflow: Unifying Streams and Tables to Enable Next-Gen AI Applications

Scale with a K.I.S.S: Keep It Simple, Stupid

Data Engineering on AWS

Photon: The High-Speed Engine Powering Lakehouse Analytics

What is Photon?

Key Design Innovations

Unparalleled Performance

领英推荐

Real-World Advantages

How Photon Delivers on the Lakehouse Promise

Summary

Explore

Manoj Panicker的更多文章

Liquid Clustering in Delta Tables: A Game-Changer in Data Management

OpenAI's forthcoming model, GPT-5

Dubai - RailBus

San Francisco Fire Department (SFFD) - Analysis

SQL Server from Basic to Advanced using AdventureWorks Database

Comprehensive Guide to SQL

Delta Live Tables: A Comprehensive Guide

Window function in PySpark — one stop to master it all

Mastering Slowly Changing Dimensions (SCD) in Databricks: A Guide for Data Engineers

社区洞察

其他会员也浏览了

Understanding Batch and Real-Time Processing in DataBricks

DoubleCloud’s 14th Product Update

A unified platform with Databricks & dbt

Why use Delta Live Tables in Databricks?

Enhancing Performance and Scalability: Migrating Data Processing to Databricks

GenAI Dev Stack, LLMOps & Vector Databases!

All Databases are Equal, but Some Databases are More Equal than Others

Tableflow: Unifying Streams and Tables to Enable Next-Gen AI Applications

Scale with a K.I.S.S: Keep It Simple, Stupid

Data Engineering on AWS