SnowPark Python— Aamir P

SnowPark Python— Aamir P

Hello readers!

Thank you for supporting all my articles. This article SnowPark Python I am not so confident because this Python to SnowPark was added less than a year ago. So, I find fewer resources to learn SnowPark Python.

Whatever I know and feel confident will be shared in this article. Requesting learners to surf in internet more and learn about this tool.

SnowPark is to build API by querying/processing the data in a pipeline. It will not move the data anywhere but build the application that processes data. Stored procedures are used to automate data transformation and processing.

Sessions are used to interact with the database. For automation purposes use Python in Snowsight. DataFrames are primary to query and process the data. We convert lambdas and functions to user-defined functions. This way we call the data to process.

You can also use SnowPark Python in ML for Credit Card Approval Prediction because of the feature of Stored Procedure. You can troubleshoot your code with logging statements. Recording log messages and trace events is an exclusive feature of this tool. Supported versions of Python start from 3.8–3.11. Create a virtual environment using tools like Anaconda, Miniconda, or virtualenv. I have taught you about Pandas before. Here, Pandas is mainly used because we are going with a data frame. PyArrow will be installed by default. If you did it manually pls uninstall to proceed.

Install the Snowpark Python package into the Python virtual environment by using:-

pip install snowflake-snowpark-python

or

You are comfortable with a Jupyter notebook?

Go ahead with these commands

pip install notebook ~ install

jupyter notebook ~ Start with Jupyter notebook

You can also use SnowPark in VS Code.

First Step

Import Sessions

from snowflake.snowpark import Session

Authenticate using Snowflake connectors. Python dictionaries are mandatory here. Pass and then call the builder object assigned.

If you want to cancel the running session use the below:-

new_session.close()

Python worksheets are well-suited for transformations and data manipulations. Third-party packages like Anaconda can be used or you can import your script from stages.

Python worksheets run inside Snowflake so you can use session.add_import or any other session modules. X-Small Warehouse is the recommended one in SnowPark.

To import any package from SnowPark use the below command:-

from example_package import function import other_package

We can use scikit learn library in SnowPark. Just import and use.

def main is a handler function here. Return functions of datatype will differ. The notable ones are:-

For Dataframe the return type is table.

Row with collect method return is Variant.

String return is a string.

Review these results in the result panel.

To write in the console use print(). If you want to print a data frame use show(). To read files from a stage you have to mention the database and schema in the code. To retrieve and manipulate data we use DataFrame. DataFrame is a relational dataset that only executes when a specific action is triggered. By constructing a DataFrame it holds the external CSV or SQL worksheet. You can specify the column names i.e. you can filter or sort and then upload. Collect method is used to retrieve the data. Use ‘join’ to join DataFrames. Python’s built-in copy() method is used to create a clone of the DataFrame object. ‘col’ is a function to create columns. ‘Select’ is used to specify the columns. ‘Filter’ is a method to specify a filter condition. DataFrame.col method to refer to a column in a specific DataFrame. Column.asis a method to change the name of newly created DataFrame. Use double quotes around object identifiers. Backslash (\) to escape from a double-quoted string literal. Specifying a column name does not require double quotes. ‘lit’ means literal. To retrieve the definition of columns retrieve the Schema property.

Perform action to Evaluate a DataFrame:-

. Collect:- Evaluates the DataFrame and returns the resulting dataset as a list of row objects.

. Count:- Evaluates the DataFrame and returns the number of rows.

. Show:- Evaluates the DataFrame and prints the rows to the console. This method limits the number of rows to 10 (by default).

. Save_as_table:- Saves the data in the DataFrame to the specified table.

How to save Data in a Table?

  1. Use the ‘Write’ property to get a DataFrame Writer object.
  2. In the ‘mode’ method specify the mode.
  3. save_as_table to save the data.

df.write.mode(“overwrite”).save_as_table(“table1”)

create_or_replace_view to create a view in DataFrame.

How to work with files in a stage?

Use the ‘read’ method in the session to access DataFrameReader object. If files are in CSV convert them to struct format. Call the schema that’s how it gets passed to the struct.

import snowflake.snowpark as snowpark

def main(session: snowpark.Session): # your code goes here

Handler is the default main section.

Session is used to access the data in Snowflake.

Import you all know we used it to install packages.

  • If your handler function returns a DataFrame, use the default return type of Table().
  • If your handler function returns a list of Row objects, such as with the Collect method, change the return type to Variant.
  • If your handler function returns a string, such as return "Hello Readers”, or a value that you want to cast as a string, change the return type to String.
  • If your handler function returns an integer, such as with the count method, use a return type of Variant or String.

To test a function pass additional arguments to your function. Set the single-argument function as the handler function to run the worksheet and validate that your code runs with the argument values.

Functions to output a console? Use Print()

Functions to print a dataframe? Use Show

To retrieve and manipulate data, you use the DataFrame class. A DataFrame represents a relational dataset. In order to retrieve the data into the DataFrame use collect() method. Use the ‘Read’ property to get a DataFrameReader object. ‘Col’ function or an expression that evaluates to a column. Use ‘join’ to join dataframes. Column objects can be filtered using ‘filter’ method. ‘Cast’ method is used to convert a column to a specific data type. To retrieve the definition of columns in the DataFrame call the ‘schema’ property.

To save data in a table follow the format mentioned below:-

df.write.mode(“overwrite”).save_as_table(“table1”)

Creating views is the same as SQL commands. The ‘option’ method takes a name and a value of the option that you want to set.

You can create User-defined functions with the help of SnowPark. Snowpark library uploads the code for your function to an internal stage. It executes your function on the server. Data doesn’t need to be transferred to the client in order for the function to process the data.

Creating an anonymous UDF and assigning it to a variable is possible here. If named UDF it is easy to call.

A UDF is defined by ‘CREATE FUNCTION’ command. Session.add_import() is a Snowpark library that uploads the specified files to an internal stage. Vectorized Python UDFs let you define Python functions.

Creating User-Defined Table Functions (UDTFs) using the Snowpark API in Python involves implementing a UDTF handler, registering the UDTF and its handler in the Snowflake database, and defining input types and output schema. This functionality allows for custom tabular data processing within Snowflake using Python.

Implementing a UDTF Handler:

  1. Initialization Method (__init__): Optional method used to initialize stateful processing of input partitions.
  2. Process Method (process): Required method invoked for each input row. Returns tabular values as tuples.
  3. End Partition Method (end_partition): Optional method invoked to finalize the processing of input partitions.

Handlers follow the UDTF logic and must be implemented as Python classes. Detailed handler implementation is explained in the Snowflake documentation.

Registering a UDTF:

  1. Use of Snowpark API: Register the UDTF using Snowpark API functions like udtf or register.
  2. Registration Parameters: Parameters include the UDTF handler class, output schema, input types, and other optional parameters.
  3. Output Schema: Specifies the schema of the UDTF’s tabular output, either as a list of column names or a StructType representing column names and types.

Defining Input Types and Output Schema:

  1. Input Types (input_types): Specify types of input parameters as a list based on snowflake.snowpark.types.DataType.
  2. Output Schema (output_schema): Specify the schema of the tabular output, either as a list of column names or a StructType with column names and types.

Examples of UDTF registration include using the udtf function, the register function, and the register_from_file function. These examples demonstrate registering UDTFs for tasks like generating numbers and performing word count operations. Registration involves specifying parameters such as UDTF handler class, output schema, input types, and additional options.

The Snowpark API allows developers to create stored procedures for DataFrames in Python, providing flexibility in processing data. Stored procedures can be either temporary, existing within the current session, or permanent, available for use in other sessions.

One notable feature is the ability to use third-party packages from Anaconda in a stored procedure. Developers can specify Anaconda packages to install, seamlessly integrating them into Snowflake warehouses. The session.add_packages method facilitates the addition of packages at the session level, ensuring the required dependencies are available for execution.

Creating anonymous stored procedures involves using the sproc function from the snowflake.snowpark.functions module, defining the logic within a lambda function. These procedures are suitable for one-time use within the current session.

For named stored procedures, developers can create either temporary or permanent procedures. The register method or sproc function with the is_permanent argument set to True is used for permanent stored procedures. Permanent procedures are registered only once, resolving dependencies during the registration process, and ensuring consistent dependency versions during production workloads.

To read files within a stored procedure, developers have two options. Statically specified files can be imported as dependencies, allowing their usage within the procedure. Dynamically specified files can be read using the Snowflake_file class from the snowflake.snowpark.files module, providing dynamic file access for streaming files of any size.

In Snowpark Python, you can process DataFrame data by calling system-defined SQL functions, user-defined functions (UDFs), and stored procedures. System-defined SQL functions are accessible through the snowflake.snowpark.functions module. UDFs are invoked using the call_udf function, while user-defined table functions (UDTFs) are called with table_function or join_table_function. Stored procedures can be executed using the call method of the Session class. The provided examples illustrate how to use these methods, including calling system-defined functions, UDFs, and UDTFs, showcasing the flexibility of Snowpark in data processing operations.

Training machine learning (ML) models with Snowpark Python involves leveraging Snowpark-optimized warehouses for resource-intensive workloads. Snowpark-optimized warehouses, designed for large-memory and compute-intensive tasks, are used to train ML models efficiently. These warehouses, configured with a single Snowpark-optimized node, support the execution of custom code for ML training directly within Snowflake.

Snowpark Python stored procedures play a key role in running custom ML code using Snowflake warehouses. These procedures can execute nested queries, utilizing the Snowpark API for Python, to load, transform, and preprocess datasets within the stored procedure’s memory. ML models are then trained and can be uploaded to a Snowflake stage, enabling the creation of User-Defined Functions (UDFs) for inference.

Guidelines for performing single-node ML training workloads include setting warehouse parameters, maximizing memory and compute resources, and potentially using separate warehouses for better performance. An example demonstrates the creation of a Snowpark-optimized warehouse and a stored procedure that trains a linear regression model using scikit-learn and Snowpark Python.

In summary, Snowpark Python, coupled with Snowpark-optimized warehouses, facilitates the seamless integration of ML training into Snowflake, providing a powerful solution for scalable and efficient machine learning workflows.

Testing Snowpark Python code, especially when connected to Snowflake, is crucial for ensuring functionality and preventing unintended errors. Utilizing standard testing utilities like PyTest facilitates comprehensive testing, covering both unit and integration aspects.

PyTest fixtures are employed to provide data or connections to tests. A fixture is created to establish a Snowpark Session object, enabling seamless testing within the Snowflake environment.

Unit tests focus on individual components such as Python UDFs and DataFrame transformations. For UDFs, testing involves verifying functionality against expected outputs, while DataFrame transformations are tested by encapsulating logic into testable Python methods and comparing input and output DataFrames.

Integration tests ensure that stored procedures work seamlessly with Snowflake tables. These tests involve creating input tables, executing the stored procedure, and comparing expected and actual output tables.

By setting up PyTest and creating appropriate fixtures, developers can rigorously test Snowpark Python code, ensuring reliability and stability in Snowflake environments. Thorough testing reduces the risk of breaking changes and enhances overall code quality.

Please browse to learn more. I have written things I am confident about.

Check out this link to know more about me

Let’s get to know each other! https://lnkd.in/gdBxZC5j

Get my books, podcasts, placement preparation, etc. https://linktr.ee/aamirp

Get my Podcasts on Spotify https://lnkd.in/gG7km8G5

Catch me on Medium https://lnkd.in/gi-mAPxH

Follow me on Instagram https://lnkd.in/gkf3KPDQ

Udemy Udemy (Python Course) https://lnkd.in/grkbfz_N

YouTube https://www.youtube.com/@knowledge_engine_from_AamirP

Subscribe to my Channel for more useful content.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了