This series shows you the various ways you can use Python within Snowflake.
Snowpark for Python is the name for the new Python functionality integration that Snowflake has recently developed. At the Snowflake Summit in June 2022, Snowpark for Python was officially released into Public Preview, which means anybody is able to get started using Python in their own Snowflake environment.
I would understand if you wanted to skip this post and cut directly to the Snowflake’s monolithic documentation page, however I would strongly recommend reading through this post first as it will give you an introduction to both the possibilities and the limitations of this new functionality. We will also walk through the steps you can take to create your own Python User Defined Function, or Python UDF for short, from within Snowflake itself. There’s no need to set up anything on your local machine for this, all you need is a Snowflake account and the relevant privileges, most notably the CREATE FUNCTION privilege at the schema level.
If you are interested in finding out about Python stored procedures instead of UDFs, be sure to check out my matching “Definitive Guide to Creating Python Stored Procedures Directly with the Snowflake User Interface,” which is upcoming.
Why Use a Python UDF Instead of a SQL UDF?
Most Snowflake users are far more comfortable with SQL than they are with Python or other advanced programming languages, simply because SQL is the core language required to leverage the Snowflake platform. With that in mind, it is easy to consider using SQL-based UDFs instead of other languages like Python. However, there are several strong reasons to leverage a non-SQL language instead, especially when compared to Python:
- Python is a high-level language capable of far more than standard SQL, including the ability to import and leverage functionality from a wide number of modules
- SQL UDFs can only leverage a single SQL statement. This is stated within Snowflake’s documentation as follows:
You can include only one query expression. The expression can include UNION [ALL].
To summarise, Python is simply more versatile than SQL and unlocks a wider range of functionality. Here are a few examples of powerful functionality that is possible with Python and not with SQL:
- Chain multiple commands instead of restricting to a single SQL statement
- Access supporting files in cloud storage and read their contents to contribute to the overall Python script
- Perform powerful data transformations built out of multiple components using Pandas
- Leverage the map() function or list comprehension to apply a function to each value of a list; a simple way to support looping and iteration
- Apply machine learning models to generate new forecasts using libraries such as PyTorch or scikit-learn
Why NOT Use a Python UDF?
Python UDFs have no way of interacting with Snowflake objects directly. In other words, a Python UDF is incapable of executing a SQL statement. This is because UDFs do not have access to the “outside world” and a Snowflake Snowpark Session object would be required to interact with Snowflake using Python. Unlike a Stored Procedure, a UDF is not passed a Snowflake Snowpark Session as an argument and thus cannot query Snowflake objects.
Warning Regarding Staged Files and Libraries
As mentioned above, it is important to note at this time that UDFs do not have access to the “outside world.” This is a security restriction put in place by Snowflake intentionally. If you wish to create a UDF that accesses the outside world, for example to hit an API using the requests
library, then it is recommended to use external functions instead.
How to Create Your Own Python UDF from a Snowflake Worksheet
Snowflake have now integrated the ability to create Python UDFs directly into the standard commands that can be executed from within a Snowflake worksheet. Whether you are using the classic UI or the new Snowsight UI, you can simply open a worksheet and use this code template to get started.
This template includes line regarding optional packages and imports that will be discussed after our simple examples.
CREATE OR REPLACE FUNCTION <UDF name> (<arguments>) returns <data type> << optional: not null >> language python runtime_version = '3.8' << optional: packages=(<list of additional packages>) >> << optional: imports=(<list of files and directories to import from defined stages>) >> handler = '<name of main Python function>' as $$ def <name of main Python function>(<arguments>): # Python code to determine the main # functionality of the UDF. # This ends with a return clause: return <function output> $$ ;
The main elements to understand here are:
- Everything within the set of
$$
s on lines 9 and 15 must be Python code and forms the function itself. Meanwhile, everything outside of those$$
s is Snowflake’s flavour of SQL and contains metadata for the function. - On row 7, we define what is known as a handler. This is the name of the main function within our Python code that the Snowflake UDF will execute. This handler must match a function within the Python code or the UDF will fail.
- On row 2, we define the data type that the UDF will return. I intend to explain this through the simple examples below.
- On rows 5 and 6 are lines regarding optional packages and imports for the function. This will be discussed after our simple examples.
- The arguments passed to the UDF on row 1 are the same arguments that are passed to the handler function in Python on row 10; however, data types may need to be changed from Snowflake’s flavour of SQL into Python equivalents. This is discussed more in the note below.
A Note on UDF and Function Arguments
An important thing to keep in mind when creating Python UDFs is the data types that are permitted.
When defining the metadata of your UDF, all data types are viewed from the perspective of Snowflake’s flavour of SQL. Specifically, I am referring to the arguments on line 1 and the returned data type on line 2.
When defining variables within your Python code, any input arguments in the handler function (line 10) or returned values (row 14) must be one of the data types specified in the Python Data Type
column of Snowflake’s SQL-Python Type Mappings table.
It is also important to note that the number of arguments passed to the UDF on row 1 is the same number of arguments passed to the handler Python function on row 10 and must be in the same intended order. Though I would recommend naming the variables similarly for ease of understanding, the names of these variables do not need strictly need to align as long as they are in the same order and leverage similar data types.
Simple Examples
Let’s break this down and provide a few simple examples to explain what is happening more clearly.
Please note that these examples are deliberately minimalistic in their design, and thus do not include any error handling or similar concepts. Best practices would be to prepare functions with error handling capabilities to prevent unnecessary execution and provide useful error outputs to the end user.
Multiply Input Integer by Three
Our first example is a very simple function that takes an integer and multiples it by three. First, let’s see the code:
CREATE OR REPLACE FUNCTION multiply_integer_by_three(INPUT_INT int) returns int not null language python runtime_version = '3.8' handler = 'multiply_by_three_py' as $$ def multiply_by_three_py(input_int_py: int): return input_int_py*3 $$ ;
There are a few things to break down here to confirm our understanding:
- On rows 8 and 9, we have defined a very simple function in Python which multiples an input number by three.
- The name of our handler function on row 5 matches that of our Python function defined on row 8.
- On row 2, we can see that our function will return an integer, and that the integer will not be NULL.
- The INPUT_INT argument passed to the UDF on row 1 will be the integer that is passed to the input_int_py argument when executing the handler function in Python on row 8.
If we execute this code in a Snowflake worksheet, we can then call the function, in the same way that we would any other function, to see the result.
Multiply Two Input Integers Together
Our second example is another simple function, this time taking two input integers and multiplying them together. Let’s see the code:
CREATE OR REPLACE FUNCTION multiply_two_integers_together( INPUT_INT_1 int , INPUT_INT_2 int ) returns int not null language python runtime_version = '3.8' handler = 'multiply_together_py' as $$ def multiply_together_py( input_int_py_1: int , input_int_py_2: int ): return input_int_py_1*input_int_py_2 $$ ;
There are a lot of similarities between this and our previous function; however, this time, we have multiple inputs.
Again, if we execute this code in a Snowflake worksheet, we can then call the function in the same way that we would any other function to see the result.
Multiply All Integers in an Input Array by Another Integer
The last of our simple examples takes things one step further, so that we can introduce an additional Python function in our script and demonstrate the power of list comprehension. This UDF will accept an array of integers as an input along with a second input of a single integer. The function will multiply all members of the array by that second integer.
Let’s see the code:
CREATE OR REPLACE FUNCTION multiply_all_integers_in_array( INPUT_ARRAY array , INPUT_INT int ) returns array not null language python runtime_version = '3.8' handler = 'multiply_integers_in_array_py' as $$ # First define a function which multiplies two integers together def multiply_together_py( a: int , b: int ): return a*b # Define main function which maps multiplication function # to all members of the input array def multiply_integers_in_array_py( input_list_py: list , input_int_py: int ): # Use list comprehension to apply the function multiply_together_py # to each member of the input list return [multiply_together_py(i, input_int_py) for i in input_list_py] $$ ;
There are several things to note here:
- On row 5, we can see that we are now expecting an array as an output instead of an integer.
- On rows 12-17, we are defining a second Python function within our script. This function is called
multiply_together_py
. This is not strictly needed as we could use a lambda function; however, I have included it to demonstrate the functionality. - On row 27, we use the concept of list comprehension to apply our
multiply_together_py
function to every member of our input array
Again, if we execute this code in a Snowflake worksheet, we can then call the function in the same way that we would any other function to see the result.
Importing Additional Libraries from Anaconda
Even though our examples so far have all been fairly basic, we can already start to see how powerful Python UDFs could be. Not only are we able to receive inputs and use them to produce outputs, we can also define our own functionality within the Python script and perform more complex logical steps when executing our function.
What about if we wish to use Python libraries that are not part of the standard inbuilt set? For example, what if we wish to leverage Pandas, PyTorch or a wide range of other popular libraries? The good news here is that Snowflake have partnered with Anaconda, and you already have everything you need to leverage any of the libraries listed in Anaconda’s Snowflake channel.
If you wish to leverage any libraries that are not included in Anaconda’s Snowflake channel, including any libraries you have developed in-house, then you will need to import them separately. This will be discussed in the next section.
Accepting the Terms of Usage to Enable Third-Party Packages
To leverage third party packages from Anaconda within Snowflake, an ORGADMIN must first accept the third-party terms of usage. I walk through the process now; however, more details can be found here for those who desire it. This step must only be completed once for the entire organisation.
- Using the ORGADMIN role in the SnowSight UI, navigate to `Admin` > `Billing` to accept the third party terms of usage
- Confirm acknowledgement
- The screen will then update to reflect the accepted terms
Examples Using Supported Third-Party Libraries
Now that we have enabled third-party libraries for our organisation, we can show some more interesting examples.
Generate a Random Name
This function imports the faker library to generate fake data. This is a useful tool for creating dummy datasets. Let’s see the code:
CREATE OR REPLACE FUNCTION generate_fake_name() returns string not null language python runtime_version = '3.8' packages = ('faker') handler = 'generate_fake_name_py' as $$ # Import the required modules from faker import Faker # Define main function which generates a fake name def generate_fake_name_py(): fake = Faker() return fake.name() $$ ;
The core difference between this UDF and our previous examples are:
- On row 5, we have now included a line to include the
faker
library in the metadata for the UDF. - On row 11, we use standard Python functionality to import the required tools from the
faker
library, which are then leveraged in our Python function.
Again, if we execute this code in a Snowflake worksheet, we can then call the function in the same way that we would any other function to see the result.
Generate a Snowflake-Compliant Key Pair
This function imports the cryptography library to generate an authentication key pair that is compliant with Snowflake. I have discussed the benefits of this in a previous blog; however, this new Python functionality renders the technical side of this blog redundant!
It is important to note that this specific approach is not best practice as unintended users may be able to access the private key from Snowflake’s history; however, I will resolve this in an upcoming blog post.
Let’s see the code:
CREATE OR REPLACE FUNCTION generate_key_pair() returns variant not null language python runtime_version = '3.8' packages = ('cryptography') handler = 'generate_key_pair_py' as $$ # Import the required modules from cryptography.hazmat.primitives import serialization as crypto_serialization from cryptography.hazmat.primitives.asymmetric import rsa # Define main function which generates a Snowflake-compliant key pair def generate_key_pair_py(): keySize = 2048 key = rsa.generate_private_key(public_exponent=65537, key_size=keySize) privateKey = key.private_bytes( crypto_serialization.Encoding.PEM, crypto_serialization.PrivateFormat.PKCS8, crypto_serialization.NoEncryption() ) privateKey = privateKey.decode('utf-8') publicKey = key.public_key().public_bytes( crypto_serialization.Encoding.PEM, crypto_serialization.PublicFormat.SubjectPublicKeyInfo ) publicKey = publicKey.decode('utf-8') key_pair = { "private_key" : privateKey , "public_key" : publicKey } return key_pair $$ ;
Similar to our last example, the core difference between this UDF and our basic examples are:
- On row 5, we have now included a line to include the
cryptography
library in the metadata for the UDF. - On rows 11 and 12, we use standard Python functionality to import the required tools from the
cryptography
library, which are then leveraged in our Python function.
Again, if we execute this code in a Snowflake worksheet, we can then call the function in the same way that we would any other function to see the result.
Importing Files and Libraries via a Stage
So far, we have demonstrated quite a lot in our miniature crash course into Python UDFs in Snowflake. Before we wrap up, I’d like to cover one last piece, which is how to import libraries that are not covered by the Anaconda Snowflake channel and how to import other external files.
Warning Regarding Staged Files and Libraries
It is important to note at this time that these external files are fixed in time. UDFs do not have access to the “outside world,” and each file must exist within the stage at the time the UDF is created as the file is actually copied into Snowflake’s underlying storage location for the UDF. If you update your files in your external stage, your Snowflake UDF will not undergo these changes unless you recreate the UDF as well.
Unfortunately, files must be imported into UDFs individually and specifically. If you wish to import five files within a directory, you must list all five files individually. You cannot simply specify a parent directory or leverage a wildcard.
Examples Using External Files and Libraries
For these examples, we will be uploading all of our required files to the stage “STG_FILES_FOR_UDFS.”
Import an External Excel xlsx File
This example reads an Excel xlsx file into a pandas dataframe, then uses information in that file to map the input string to the corresponding output.
The file we will be leveraging is a very simple mapping file that looks like this:
The file is called “Dummy Mapping File.xlsx,” and I have uploaded it to the stage “STG_FILES_FOR_UDFS.”
Here is the code for the UDF itself:
CREATE OR REPLACE FUNCTION leverage_external_mapping_file(INPUT_ITEM string) returns string not null language python runtime_version = '3.8' packages = ('pandas', 'openpyxl') -- openpyxl required for pandas to read xlsx imports = ('@STG_FILES_FOR_UDFS/Dummy Mapping File.xlsx') handler = 'leverage_external_mapping_file_py' as $$ # Import the required modules import pandas import sys # Retrieve the Snowflake import directory IMPORT_DIRECTORY_NAME = "snowflake_import_directory" import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME] # Define main function which leverages the mapping def leverage_external_mapping_file_py(input_item_py: str): df_mapping = pandas.read_excel(import_dir + 'Dummy Mapping File.xlsx', skiprows=5, usecols="C:D") df_mapped_group = df_mapping[df_mapping['Item']==input_item_py] mapped_group = 'No matching group found' if len(df_mapped_group.index) > 0 : mapped_group = df_mapped_group.iloc[0]['Group'] return mapped_group $$ ;
The unique differences for this example are:
- Row 6 includes a line to import the file from a stage. This is a critical component for our Python function to be able to access the file during the
pandas.from_excel()
function. - Mirroring this is the code on lines 16 and 17 that leverages the
sys
library to access the location where Snowflake stores files that have been imported into the UDF.
The rest of the code in this script is specific Python code to download the Excel file into a dataframe, filter it to our specific item and return the matched group value.
Again, if we execute this code in a Snowflake worksheet, we can then call the function in the same way that we would any other function to see the result.
Import an Non-Standard External Library
This example is designed to leverage the xldate function of the xlrd library. This library is not included in the standard set of supported libraries, so it must be imported specifically.
Let’s see the code:
CREATE OR REPLACE FUNCTION leverage_external_library(INPUT_INT integer) returns string not null language python runtime_version = '3.8' imports = ('@STG_FILES_FOR_UDFS/xlrd/xldate.py') handler = 'leverage_external_library_py' as $$ # Import the required modules import sys # Retrieve the Snowflake import directory IMPORT_DIRECTORY_NAME = "snowflake_import_directory" import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME] # Import the required external modules using the importlib.util library import importlib.util module_spec = importlib.util.spec_from_file_location('xldate', import_dir + 'xldate.py') xldate = importlib.util.module_from_spec(module_spec) module_spec.loader.exec_module(xldate) # Define main function which leverages the mapping def leverage_external_library_py(input_int_py: int): return xldate.xldate_as_datetime(input_int_py, 0) $$ ;
The unique differences for this example are:
- Row 5 includes a line to import the
xldate.py
file from a stage. This is critical, and would match the file found in yourenv/Lib/site-packages/xlrd
directory that is created during apip install
in a virtual environment (or your default environment). - As with the previous example, lines 14 and 15 leverage the
sys
library to access the location where Snowflake stores files that have been imported into the UDF. - The block of lines 18-21 leverage the
importlib.util
library to import the specific module from the imported directory.
Again, if we execute this code in a Snowflake worksheet, we can then call the function in the same way that we would any other function to see the result.
Wrap Up
So, there we have it. We have covered a lot of content in this post, and some of these UDFs could potentially have their own blog post all on their own! My goal is to demonstrate how to create UDFs for different scenarios and provide that in a single post so that it is easy for readers to find and refer back to if needed.
I hope this helps you get started on your own journey with Python UDFs in Snowflake. Let us know what cool things you come up, it’s always great to see new functionality lead to productive and useful innovation.
If you wish to see the code for all of this, and other content that we have, you can find it on the InterWorks GitHub.
If you are interested in finding out about Python stored procedures, be sure to check out my matching, upcoming post, “Definitive Guide to Creating Python Stored Procedures Directly with the Snowflake User Interface.”