Databricks Detective #1 - The problem of slow photon
This is the 1st in a series of articles I plan to write, where I share some customer mysteries I solved. I'll take you along in these journeys to show you how I worked with my colleagues at Databricks to got to the root of the issues to eventually resolve them. So sit tight and follow along.
The mystery
Databricks photon engine is now GA on all 3 cloud platforms, and there were quite a few customer case studies that showed it consistently saved time and reduced Total Cost of Ownership (TCO) for Databricks customers. E.g. DuPont , T-Mobile ..
In October of this year (2022), one of my customers wanted to find out if photon could save TCO on their Databricks compute.
They decided to run a Photon POC to test it out. There are a few relevant factors about this customer that are worth mentioning:
I showed the customer team how to enabled photon on a cluster (check the Use Photon Acceleration) on the cluster creation screen, and shared some sample notebooks and cluster configurations with them (in the resource section below)
The customer decided to use their existing notebook for this POC. When their test results came back, it showed a few unexpected results:
Here is a chart showing the data above
The customer couldn't make sense of the testing results, since Databricks customers have shown many times that photon accelerate job runs by 2 - 5 times for most data engineering use cases. They shot me an email saying they thought Photon will be faster.
I was scratching my head as well, and decided to put on my detective hat. I asked for all the relevant files from the customer, then went to Photon SMEs looking for answers.
Solving the mystery
The SME I worked with Jesse and Udai noticed a couple things immediately:
Based on that, my first recommendation to to customer was to make both the instance pool and cluster DBR version 11.3, and to enable photon on the instance pool.
The result came back a few days later, and with the changes above the photon enabled clusters with custom timezone ran the job in 54 seconds, it's an 1 second improvement from the 55 seconds last time, though it's still not showing the full effect of photon.
I decided to dig deeper in the customer code, and working with the SMEs to look at the SparkLog, which had this section on the bottom:
== Photon Explanation ==
Photon does not fully support the query because:
Session time zone: 'PST' is not supported by Photon.
Photon does not support offset based time zones like 'GMT+8' or abbreviated time zones like 'IST'. Please use a region based time zone from the 'TZ database name' column on this page:
https://en.wikipedia.org/wiki/List_of_tz_database_time_zones. For example, 'Etc/GMT' is a time zone accepted by Photon."
We also confirmed by looking at the Spark UI DAG that showed Photon wasn't involved in the execution of the query even when the job was running on a photon-enabled cluster. Photon enabled query steps should show in orange color but instead all the steps were in blue which indicate the standard execution engine was used instead of the photon engine.
Non-photon SQL query DAG (my sample, not from customer):
The customer notebook used this SQL command:
领英推荐
SET TIME ZONE 'PST';
Based on the "Photon Explain" message and the wiki link it provided, I asked the customer to change the SQLcommand to:
SET TIME ZONE 'America/Los_Angeles';
The customer came back with their new test result using the new timezone format in a week, and low and behold, the photon enabled clusters with custom timezone ran the job in 25 seconds, compared to 38 seconds on a non-photon cluster with the exact same configuration using the same notebook, which come to about 34% time savings.
The customer was very happy to see the time saving from the photon enabled cluster, and plan to use custom tags to create a TCO comparison between photon-enabled clusters vs non-photon-enabled cluster.
A photon engine executed SQL query DAG should look like this, the orange portion were executed by the the photon engine:
This was still one more mystery to resolve, which is why just having a Python UDF in the notebook that is not called by any commands will increase the execution time. I talked to a few other folks at Databricks, but still couldn't figure this one out. A few days later, when I spoke to the customer again to find out more on this issue, the customer told me they no longer observed this issue after making the timezone setting change, so it's just a red herring, it had no impact on the time it takes to run the notebook.
Learnings
Here is what I learned from this case:
Resources:
If your team is interested in testing photon performance improvement and/or the timezone setting, here are somes resources to get you started:
Notebooks: setup table ; photon run (the non-photon run notebook is exactly the same, only the compute/cluster used will differ)
Cluster setting (AWS): Photon cluster config JSON ; Non Photon cluster config JSON
Here are the screenshots of my own job runs using my notebooks (my runs show much more time savings due the fact that a larger proportion of the commands in my notebooks were executed by the photon engine):
Non Photon run:
Photon run with incorrect timezone setting:
Photon run with correct timezone setting:
If you have any mysteries that you recently solved or need solving, please leave a comment below.
That's all for today, will share my next mystery solving adventure soon.
Senior Field Engineering Manager @ Databricks | Strategic Solutions Architect
1 年This is awesome, Lu - Looking forward to the next one!
Solutions Architect at Databricks (Switzerland)
1 年Awesome -looking forward to the next article in the series!
Databricks RSA | Machine Learning, Big Data, Azure / AWS Expert (we are hiring)
1 年Derek de Oliveira and Dwane Townsend
Databricks RSA | Machine Learning, Big Data, Azure / AWS Expert (we are hiring)
1 年Hey excellent article I just learned about Photon today and I am looking forward to testing it out. It looks very promising.