登录查看更多内容

Databricks Detective #1 - The problem of slow photon

Lu Wang

发布日期: 2023年1月13日

This is the 1st in a series of articles I plan to write, where I share some customer mysteries I solved. I'll take you along in these journeys to show you how I worked with my colleagues at Databricks to got to the root of the issues to eventually resolve them. So sit tight and follow along.

The mystery

Databricks photon engine is now GA on all 3 cloud platforms, and there were quite a few customer case studies that showed it consistently saved time and reduced Total Cost of Ownership (TCO) for Databricks customers. E.g. DuPont , T-Mobile ..

In October of this year (2022), one of my customers wanted to find out if photon could save TCO on their Databricks compute.

They decided to run a Photon POC to test it out. There are a few relevant factors about this customer that are worth mentioning:

They are using instance pools
They want to programmatically set the timezone for the notebooks they are executing
They have Python UDFs in the notebook that are not being called

I showed the customer team how to enabled photon on a cluster (check the Use Photon Acceleration) on the cluster creation screen, and shared some sample notebooks and cluster configurations with them (in the resource section below)

The customer decided to use their existing notebook for this POC. When their test results came back, it showed a few unexpected results:

The photon enabled cluster with custom-timezone ran the notebook in 55 seconds, compared to 32 seconds on a non-photon cluster with the exact same configuration, meaning it ran 70% longer on photon.
The photon enabled cluster with custom-timezone and Python UDF present in the notebook ran the job in 56 seconds, which is more than 70% longer than non-photon
The photon enabled cluster without custom-timezone ran the job in just 24 seconds, which is a saving of 25% percent compared to non-photon cluster

Here is a chart showing the data above

The customer couldn't make sense of the testing results, since Databricks customers have shown many times that photon accelerate job runs by 2 - 5 times for most data engineering use cases. They shot me an email saying they thought Photon will be faster.

I was scratching my head as well, and decided to put on my detective hat. I asked for all the relevant files from the customer, then went to Photon SMEs looking for answers.

Solving the mystery

The SME I worked with Jesse and Udai noticed a couple things immediately:

The customer's test cluster was running on Databricks Runtime (DBR) 10.4, not the latest Databricks Runtime (DBR) version 11.3 LTS
The customer's instance pool compute was not photon-enabled

Based on that, my first recommendation to to customer was to make both the instance pool and cluster DBR version 11.3, and to enable photon on the instance pool.

The result came back a few days later, and with the changes above the photon enabled clusters with custom timezone ran the job in 54 seconds, it's an 1 second improvement from the 55 seconds last time, though it's still not showing the full effect of photon.

I decided to dig deeper in the customer code, and working with the SMEs to look at the SparkLog, which had this section on the bottom:

== Photon Explanation ==

Photon does not fully support the query because:

	Session time zone: 'PST' is not supported by Photon.

Photon does not support offset based time zones like 'GMT+8' or abbreviated time zones like 'IST'. Please use a region based time zone from the 'TZ database name' column on this page:

https://en.wikipedia.org/wiki/List_of_tz_database_time_zones. For example, 'Etc/GMT' is a time zone accepted by Photon."

We also confirmed by looking at the Spark UI DAG that showed Photon wasn't involved in the execution of the query even when the job was running on a photon-enabled cluster. Photon enabled query steps should show in orange color but instead all the steps were in blue which indicate the standard execution engine was used instead of the photon engine.

Non-photon SQL query DAG (my sample, not from customer):

The customer notebook used this SQL command:

Marvelous MLOps 1 个月前

An Overview of Apache Spark's Dynamic Ballet of…

Sachin D N ???? 10 个月前

?? DATA Pill #107 - dbt 1.8 is just wow, How Twitter…

Adam Kawa 5 个月前

SET TIME ZONE 'PST';

Based on the "Photon Explain" message and the wiki link it provided, I asked the customer to change the SQLcommand to:

SET TIME ZONE 'America/Los_Angeles';

The customer came back with their new test result using the new timezone format in a week, and low and behold, the photon enabled clusters with custom timezone ran the job in 25 seconds, compared to 38 seconds on a non-photon cluster with the exact same configuration using the same notebook, which come to about 34% time savings.

The customer was very happy to see the time saving from the photon enabled cluster, and plan to use custom tags to create a TCO comparison between photon-enabled clusters vs non-photon-enabled cluster.

A photon engine executed SQL query DAG should look like this, the orange portion were executed by the the photon engine:

This was still one more mystery to resolve, which is why just having a Python UDF in the notebook that is not called by any commands will increase the execution time. I talked to a few other folks at Databricks, but still couldn't figure this one out. A few days later, when I spoke to the customer again to find out more on this issue, the customer told me they no longer observed this issue after making the timezone setting change, so it's just a red herring, it had no impact on the time it takes to run the notebook.

Learnings

Here is what I learned from this case:

Photon engine does run faster than the standard engine, how much your notebook/job runs depend on many steps in the query plan were executed by the photon engine
SparkLog and the Spark UI are valuable tools to debug photon engine performance issues

Resources:

If your team is interested in testing photon performance improvement and/or the timezone setting, here are somes resources to get you started:

Notebooks: setup table ; photon run (the non-photon run notebook is exactly the same, only the compute/cluster used will differ)

Cluster setting (AWS): Photon cluster config JSON ; Non Photon cluster config JSON

Here are the screenshots of my own job runs using my notebooks (my runs show much more time savings due the fact that a larger proportion of the commands in my notebooks were executed by the photon engine):

Non Photon run:

Photon run with incorrect timezone setting:

Photon run with correct timezone setting:

If you have any mysteries that you recently solved or need solving, please leave a comment below.

That's all for today, will share my next mystery solving adventure soon.