Running Python Workloads on scalable Snowflake Compute clusters

Running Python Workloads on scalable Snowflake Compute clusters

What do you do if you have an old & slow notebook, a 160 million row dataset containing customer reviews that you have to perform sentiment analysis, and have less than 10 mins to do it with no active nodes/servers to utilize?

Prior to last week, your answer might have been "Ohhh I dunno....either change the requirements or say it can't be done"

Well, say no more. As of last week, Snowflake started its Private Preview of Snowpark Python library to solve this exact problem. You say Snow....what...who...when?

So What is Snowpark Python library?

In the world of Python-based data engineering and data science workloads, most users experience a few common issues when it comes to dealing with production size large datasets.

  • Limited resources on the machine or the cluster that is running the Python code. It either can't handle the amount of data or is slow to process it because all the hard work is being done on those machine(s).
  • People using Python and dataframes(in-memory tables) for regular columnar data manipulation & enrichment(as in ELT/ETL) prefer to stick with Python & dataframes for their work because it is much easier to do complex stuff for programmers. It can also be easily debugged in case something doesn't work. Askin them to use powerful SQL warehouse compute resources would require them to use SQL (duuuh) which they will politely refuse as it is a major pain in the neck to debug a complex SQL.
  • Often Python & dataframes is a must for data science workloads because Python can do many things that are simply impossible to do using SQL. Such as performing Sentiment analysis. (Is this review positive or negative?), processing & extracting data from images, audio & video files or run machine learning models against a ton of data. These things are impossible to do using SQL and require a programming language like Python & its many 3rd party libraries.

So how does the Snowpark Python Library help? It helps by doing two very important things.

  1. During code execution, it seamlessly translates the Python code that is accessing & manipulating dataframes(virtual in-memory codebase tables) into ANSI SQL statements and executes them on Snowflake compute clusters at blazing speeds w/o the need for any compute power on the user's machine. English translation: Joe has a crappy laptop, Joe writes 100% native Python code using Snowpark libraries & dataframes. Snowpark automatically translates the Python dataframe code to regular SQL statements & sends them Snowflake and Snowflake does all the compute-intensive work instead of Joe's old laptop. Joe can run his code using dataframes with billions of rows in minutes using Snowflake compute as the calculation & execution engine. Joe is a very happy man.
  2. This was all great for manipulating dataframes like filtering, grouping, sorting, calculating, concatenating, cleaning, trimming &, etc. SQL can easily replicate those things and do it much faster. What if this is DataScience work and is doing something that SQL can never do? Like Sentiment analysis. For example: we got a dataframe (table) of product reviews as in:

No alt text provided for this image

There is no way in hell you can figure out the sentiment of these freeform text reviews using SQL. You must use Python, Java, or Scala and dataframes along with a 3rd party NLP library like NLTK to figure out the sentiment and give it a score of -1 to 1 (Terrible to Awesome). And that takes some compute horsepower.

So programmer could write something like this:

No alt text provided for this image

Above code:

  1. connects to a database
  2. Downloads Product Reviews into a dataframe called "pd" that resides on the Python machine itself.
  3. Defines a custom function called "GetSentiment" which uses a single line of code from NLTK library to get a Sentiment score for any given input value.
  4. Then applies this function to REVIEW_TEXT column of that dataframe and writes the resulting score to a new column on that dataframe called "SCORE"

This will work with few thousand rows but things will happen that will suck the life out of you if you are dealing with larger datasets.

  1. Source dataset (of millions of rows of reviews) will be fetched from the database and downloaded into memory of the machine running the Python code. (Takes a long time due to Database speeds of how fast it can produce the results and time it takes to download the results to the machine due to network-related issues)
  2. The custom function and all other code will execute on the machine where the Python code is running. If you are on a laptop, it will crash & burn. If this is running on a commercial cluster with multiple nodes first you have to figure how many compute nodes you need and how big each node should be (no exact science to figure this. You guestimate and hope they won't run out of memory and fail the job), you have to start those nodes(takes time) or have nodes running all the time if you don't have time(takes money). If you didn't size it right and it either fails or runs slower than expected, you have to terminate the entire cluster and create a bigger one from scratch with more nodes and try again as you can't just scale the one you got(takes time & money).

In the end, the data has to physically move to where the compute is running and you are not sure how much compute you need, how long it will take to start them nor if they will be enough to handle the amount of data w/o crashing. This asumes you have access to an environment where you are allowed to spin up clusters in various sizes freely due to the cost of having to procure those resources ahead of time. In most cases, you have fixed pool of resources which may or may not be available for your specific job.

SNOWFLAKE SNOWPARK PYTHON TO THE RESCUE

This is where the magic happens. If our data scientist decides to use Snowflake & Python Snowpark, the code stays similar & does not change much at all.

Few import lines for Snowpark library, exact same function, the dataframe has some additional GroupBy & Count to eliminate duplicate reviews)

No alt text provided for this image

However, when he/she executes this code on a crappy old laptop, it will do things in a completely different manner and eliminate all of the life-sucking problems above that I previously mentioned.

  • Initial dataframe(df_reviews) will NOT load/move any data from Snowflake to the Python machine. It will automatically translate this line of code to 'SELECT * FROM AMAZON_REVIEWS_160M' and execute using Snowflake Compute. Query & the Results will be associated with the dataframe virtually by the library without downloading the data.
  • UDF portion where we define the custom function that uses a 3rd party library to get a sentiment score: Snowpark will treat this function as something that can't be done using plain old SQL and automatically package the code in a ZIP file, upload it to an internal Snowflake storage location then register as a Python user-defined SQL function (Get it? Actual Python code that can be executed on Snowflake compute using SQL User-defined functions/UDFs). All of this happens automatically when the user executes the code. No need to learn or do anything new. Just plain old Python.

No alt text provided for this image

  • The last dataframe called result_df has some additional data framey things like Select, GroupBy, call_udf, alisas. SQL compatible stuff like selecting columns, GroupBy, count, and alias are again translated ANSI SQL by Snowpark. The user function via call_udf tells it to wrap the SQL with the previously uploaded UDF. When this line executes on the notebook, Nothing really runs on the notebook itself. Instead, it translates the last dataframe code into a SQL statement like shown here and executes it on Snowflake clusters:

No alt text provided for this image

So what you get is a SQL running on Snowflake for many regular transformations and calling a Python UDF which has a custom code that uses NLTK library to perform sentiment analysis.

Here comes the best part. The Python function which is now a Python UDF in Snowflake that can be called via SQL statement will partition the data and execute in parallel using the instantly scalable Snowflake Compute Warehouses in any size to improve performance.

So if I had 160 million reviews and had less than 10 mins to complete the work using a crappy laptop, Snowpark will work beautifully. Just add 1 extra line of code to resize(not recreate as with spark nodes) the warehouse to 3XL (64 nodes) right before calling the UDF scoring function, let all 64 servers split the full dataset then process all data partitions in parallel using many Python instances simultaneously. Once the scoring process is done, simply trigger another Python code to Snowflake to either scale down the cluster to XSmall and/or pause it to stop incurring any charges.

I admit, I am no Python expert but even I was able to finish scoring a 160M dataset in about 6 mins using a 3XL warehouse with 64 servers for a total cost of $13.15 with my first try in coding in Python & Snowpark. And that ain't so bad at all. In this whole process, all my laptop did was to translate the dataframes to SQL and upload my function as a UDF using the Snowpark library and send & execute the queries to Snowflake.

  • No data was moved between my machine and Snowflake data (160M row table)
  • Python code was uploaded to where the data was in Snowflake.
  • Python code executed alongside the data in the same Compute Cluster(Warehouse).
  • None of the computational heavy loads were run on my machine. All of that was pushed down to Snowflake compute as either SQL statements or Python UDFs.
  • I was able to scale up to compute to 64 nodes in less than a second and watch the progress of the execution in Snowflake History view where I could see # or rows & amount of data in GBs processed being updated every few seconds. This is huge where you can actually tell how well the cluster size is performing against the dataset you are running vs. doing it blindly and just hoping it will succeed at some point and not fail due to memory issues or otherwise.

This is all possible because you have the ability to summon massive amounts of computing power within seconds, use them for short periods of time as only when you need it then only pay for actual compute time down a to second without having to pre-purchase and maintain a bunch of VMs just in case you have large workloads.

In the end, Snowpark Python will give you these benefits:

  • No need to use or learn SQL. Code in Python as you have been doing it.
  • Increased performance, lower execution times & lower hardware costs associated with running your Python code. Snowpark will automatically translate dataframes to SQL and push-down compute to Snowflake where you pay per second of compute usage.
  • Lower TCO: Increased Performance with each size up & much Less network traffic. It automatically uploads Python functions & 3rd party libraries to Snowflake where the Python code will execute alongside the data within Snowflake using instantly scalable clusters.
  • Better performance & data security by eliminating much of the data being sent back & forth between where the data resides and the Python platform.
  • ZERO Maintenance. A snowflake account & a free Python Snowpark library is all you need. No need to run & manage a separate platform or purchase & maintain a bunch of VMs. It is as easy as it gets.

It is truly amazing to be able to start from scratch with nothing but a?Snowflake?account then be able to do amazing things with huge?data?sets by instantly summoning 64 servers within seconds only to run them for 6 mins & 10 secs and pay $13 for it. This is the real meaning of?cloud scale & agility &?snowflake makes it all real.

I work with dozens of companies that pay millions of dollars a year just to have less than half the # of servers in this experiment running on-prem 24x7 just to provide acceptable levels of performance to just a few select users because compute is limited and shared across many workloads. When they have do run things at scale, it takes hours & sometimes days and often, have to wait for off business hours.

For them, rolling new ideas or projects into production takes months where most of that time is wasted on getting the scalable infrastructure to work properly which literally provides 0 product differentiation & a total time/resource suck for the entire team while all the smart brains are focused on architecture and scalability an NOT on the product itself.

Imagine the impact of running a successful test like this in 6 mins for any business. The level?of agility & innovation that it can inject into your?organization.

Now imagine this experiment results were a total failure as not many get it right the first time! You would have to repeat it dozens more times before you can get it right! This is the difference between weeks of delays for pushing out new products or projects to production vs. completing all your tests before lunch and moving on to bigger and better ideas.

Being able to gain access to huge amounts of compute resources for just a tiny fraction of time it takes to run that workload in a very economical way is a key for?#agile?#development?for any business &?Snowflake?truly makes it all possible.

If you haven't been convinced by now that this is the best Python news ever for data engineers and data scientists then I don't know what else can be more exciting. If you agree with me, then run to your closest Snowflake architect and ask to be part of the Python Private Preview. If not, you just wasted 10mins of your time but you are probably waiting for either your clusters to start or them taking forever to do something very complex and used to do some free range reading while waiting.

As always, if you find this article helpful, feel free to click on the LIKE button so others in your network can benefit as well.

Ivan Mazza Diez

IT Consultant | Journey to Cloud | Datacenter | Networking | IoT

1 年

Very good article man !!! quick question, do we have something similar inside Azure ? have you tested Synapse ? Regards

回复
Sean Zinsmeister

Director of Outbound Product Management, Google Cloud

2 年

Adding our Search Engine to the power of Snowpark for Sentiment analysis has been a heck of a lot of fun :-)

  • 该图片无替代文字
Bertand Batenburg

Data Architect/Engineer

2 年

Bart van der Hulst, Wilma Timmerman-Hupkes. This could, potentially, simplify the analytics landscape. Less data movement, storing data once, using one mpp compute platform (with python support).

Abdullah Siddique

Senior Technical Lead at MResult | ex Technical Lead at Wipro Limited | Oracle PL/SQL Developer| Snowflake Developer|Snowflake , SQL and Core Python trainer | EX Wipro employee | Freelance trainer

2 年

I had been waiting for python snowpark, thanks for this wonderful and insightful write up

回复
John Park

Solution Architect, Software Engineer, Technologist.

2 年

Great Article Nick !

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了