A Definitive Guide to Creating Python UDFs in Snowflake using Snowpark

Data

A Definitive Guide to Creating Python UDFs in Snowflake using Snowpark

This series shows you the various ways you can use Python within Snowflake.

Snowpark for Python is the name for the new Python functionality integration that Snowflake has recently developed. At the Snowflake Summit in June 2022, Snowpark for Python was officially released into Public Preview, which means anybody is able to get started using Python in their own Snowflake environment.

I would understand if you wanted to skip this post and cut directly to the Snowflake’s monolithic documentation page; however, I would strongly recommend reading through this post first, along with my other definitive guides for Python and Snowflake, as these posts will give you an introduction to both the possibilities and limitations of this new functionality whilst walking you through how to get going yourself. In this post, we will walk through the steps you can take to create your own Python User Defined Function, or Python UDF for short, by connecting to Snowflake as part of an external Python script using the Snowpark for Python package.

If you are not interested in connecting to Snowflake and instead wish to simply dive into creating Python UDFs and Stored Procedures from within the Snowflake User Interface, then I would recommend these posts instead:

If you are interested in finding out about Python Stored Procedures instead of UDFs, be sure to check out my matching “Definitive Guide to Creating Python Stored Procedures in Snowflake using Snowpark,” which is coming soon!

Why Define and Leverage a Python UDF Through Snowpark Instead of the Snowflake UI?

As is made clear in my previous blog posts, you can create Python UDFs directly within the Snowflake UI instead of connecting with Snowpark. However, there are a few benefits to using Snowpark instead:

  • Use your IDE of choice to develop your Python functions instead of relying on the Snowflake UI, especially when facilitating access to users who may not be familiar with Snowflake but are familiar with Python
  • More freedom to leverage Git and other tools to maintain and deploy Python functions, especially if the same functions are used elsewhere in your technology stack
  • Directly create and leverage Python functions from within wider Python scripts outside of Snowflake whilst still pushing the compute down to Snowflake when calling functions

Warning Regarding Staged Files and Libraries

It is important to note at this time that UDFs do not have access to the “outside world.” This is a security restriction put in place by Snowflake intentionally. If you wish to create a UDF that accesses the outside world, for example to hit an API using the requests library, then it is recommended to use external functions instead.

Requirements

This post focuses on leveraging Snowflake Snowpark Sessions from the Snowpark for Python package to connect to Snowflake and create Python UDFs. If you haven’t already, I would recommend reading my Definitive Guide to Snowflake Sessions with Snowpark for Python as this will give you a strong introduction into setting up a Snowpark environment to connect to Snowflake and provide you with a few shortcuts in the form of code snippets.

Throughout this blog post, I will always use the variable snowpark_session to refer to a Snowflake Snowpark Session that we have established using the above-mentioned post.

How to Create Your Own Python UDF from a Snowflake Snowpark Session

Snowflake have integrated the ability to create Python UDFs directly into the standard commands that can be executed for a Snowflake Snowpark Session object. Here is a sample Python code to demonstrate how to create your own UDF using Snowpark.

This could easily be overwhelming at this stage so we will follow this template’s description with some examples, starting off simple and getting more complex as we go.

This template includes lines regarding optional packages and imports that will be discussed after our simple examples

# Define Python function locally
def <name of main Python function>(<arguments>):
  # Python code to determine the main 
  # functionality of the UDF. 
  # This ends with a return clause:
  return <function output>

# Import data type for returned value
from snowflake.snowpark.types import <specific Snowpark DataType object>

# Optional: Import additional data types
from snowflake.snowpark.types import <specific Snowpark DataType object>

# Optional: Import additional packages or files
snowpark_session.add_packages('<list of required packages natively available in Snowflake (i.e. included in Anaconda Snowpark channel)>')
snowpark_session.add_import('<path\\to\\local\\directory\\or\\file>')

# Upload UDF to Snowflake
snowpark_session.udf.register(
    func = <name of main Python function>
  , return_type = <specific Snowpark DataType object for returned value>
  , input_types = <list of input DataType() objects for input parameters>
  , is_permanent = True
  , name = '<UDF name>'
  , replace = True
  , stage_location = '@<UDF stage name>'
)

 

The main elements to understand here are:

  • On rows 2-6, we define the main Python function that will be leveraged by our UDF. This function can leverage other functions that you have defined in your script  or imported from elsewhere; however, the UDF can only be assigned a single main Python function.
  • On rows 19-27, we leverage the udf.register() method of the Snowflake Snowpark Session object to create a new UDF in Snowflake.
  • On row 20, we determine the Python function that will be leveraged by our UDF. The name here would match the function name on row 2.
  • On row 24, we determine the name of the UDF within Snowflake. You can pass a fully qualified name here if you prefer; otherwise, the UDF will be created in the same namespace as your Snowflake Snowpark Session object.
  • On row 26, we determine the name of the Snowflake stage to which the files for our UDF will be uploaded.
  • On row 9, we import the specific Snowpark DataType object that will be used for the value returned by the UDF. This is discussed more in a note below.
  • On row 21, we determine the specific Snowpark DataType object that will be used for the value returned by the UDF. This would match that which we imported on row 9.
  • On row 10, we have an optional import for any additional specific Snowpark DataType objects that will be used for the inputs to our UDF. If required, there may be more than one additional line here.
  • On row 22, we determine the list of specific Snowpark DataType object that will be used for the input arguments for the UDF. All of these must be included in the imports on rows 9 and 12.
  • On rows 15 and 16, we provide optional rows to add additional packages and imports to our UDF. This will be discussed after our simple examples.
  • On row 23, we determine that the UDF we create will not be temporary. A temporary UDF will only exist within our specific Snowflake Snowpark Session object.
  • On row 25, we determine whether or not to overwrite an existing UDF with the same name. If this is set to False and a UDF already exists, an error will be returned.

Snowpark DataType Objects

Snowpark must leverage compatible data types for inputs and returned value when creating both UDFs and Stored Procedures. You can find the full list of types in Snowflake’s documentation. The examples provided in this blog leverage several different data types, which should give you a reasonable idea for how to configure your own use-cases.

Simple Examples

Let’s break this down and provide a few simple examples to explain what is happening more clearly.

Please note that these examples are deliberately minimalistic in their design and thus do not include any error handling or similar concepts. Best practice would be to prepare functions with error handling capabilities to prevent unnecessary execution and provide useful error outputs to the end user.

Multiply Input Integer by Three

Our first example is a very simple function that takes an integer and multiples it by three. First, let’s see the code:

##################################################################
## Define the function for the UDF

def multiply_by_three(input_int_py: int):
  return input_int_py*3

##################################################################
## Register UDF in Snowflake

### Add packages and data types
from snowflake.snowpark.types import IntegerType

### Upload UDF to Snowflake
snowpark_session.udf.register(
    func = multiply_by_three
  , return_type = IntegerType()
  , input_types = [IntegerType()]
  , is_permanent = True
  , name = 'SNOWPARK_MULTIPLY_INTEGER_BY_THREE'
  , replace = True
  , stage_location = '@UDF_STAGE'
)

 

There are a few things to break down here to confirm our understanding:

  • On rows 4 and 5, we have defined a very simple function in Python which multiplies an input number by three.
  • The name of our function on row 15 matches that of our Python function defined on row 4.
  • On row 16, we can see that our function will return an integer as it uses IntegerType(), which we can also see has been imported on row 11.
  • On row 17, we can see that the input arguments for our function is also an integer using IntegerType(). This is already imported on row 11.

If we execute this code as part of a Python script, we can then call the function using the snowpark_session.sql().show() method to see the result.

We can also call the function from within a Snowflake worksheet in the same way that we would any other function to see the result.

Multiply Two Input Integers Together

Our second example is another simple function, this time taking two input integers and multiplying them together. Let’s see the code:

##################################################################
## Define the function for the UDF

def multiply_together(
    input_int_py_1: int
  , input_int_py_2: int
  ):
  return input_int_py_1*input_int_py_2

##################################################################
## Register UDF in Snowflake

### Add packages and data types
from snowflake.snowpark.types import IntegerType

### Upload UDF to Snowflake
snowpark_session.udf.register(
    func = multiply_together
  , return_type = IntegerType()
  , input_types = [IntegerType(), IntegerType()]
  , is_permanent = True
  , name = 'SNOWPARK_MULTIPLY_TWO_INTEGERS_TOGETHER'
  , replace = True
  , stage_location = '@UDF_STAGE'
)

 

There are a lot of similarities between this and our previous function; however, this time, we have multiple inputs.

Again, if we execute this code as part of a Python script, we can then call the function using the snowpark_session.sql().show() method to see the result.

Again, we can also call the function from within a Snowflake worksheet in the same way that we would any other function to see the result.

Multiply All Integers in an Input Array by Another Integer

The last of our simple examples takes things one step further so that we can introduce an additional Python function in our script and demonstrate the power of list comprehension. This UDF will accept an array of integers as an input along with a second input of a single integer. The function will multiply all members of the array by that second integer.

Let’s see the code:

##################################################################
## Define the function for the UDF

### First define a function which multiplies two integers together
def multiply_together(
    a: int
  , b: int
  ):
  return a*b

### Define main function which maps multiplication function
### to all members of the input array
def multiply_integers_in_array(
    input_list_py: list
  , input_int_py: int
  ):
  # Use list comprehension to apply the function multiply_together_py
  # to each member of the input list
  return [multiply_together(i, input_int_py) for i in input_list_py]

##################################################################
## Register UDF in Snowflake

### Add packages and data types
from snowflake.snowpark.types import ArrayType
from snowflake.snowpark.types import IntegerType

### Upload UDF to Snowflake
snowpark_session.udf.register(
    func = multiply_integers_in_array
  , return_type = ArrayType()
  , input_types = [ArrayType(), IntegerType()]
  , is_permanent = True
  , name = 'SNOWPARK_MULTIPLY_ALL_INTEGERS_IN_ARRAY'
  , replace = True
  , stage_location = '@UDF_STAGE'
)

 

There are several things to note here:

  • On row 31, we can see that we are now expecting an array as an output instead of an integer. To support this, we leverage the ArrayType(), which is imported on row 25.
  • On row 26, we must still import the IntegerType() as this is leveraged in our list of inputs on row 32.
  • On rows 5-9, we are defining a second Python function called multiply_together, which is leveraged by our main function that we define on rows 13-19. This is not strictly needed as we could use a lambda function; however, I have included it to demonstrate the functionality.
  • On row 19, we use the concept of list comprehension to apply our multiply_together  function to every member of our input array

Again, if we execute this code as part of a Python script, we can then call the function using the snowpark_session.sql().show() method to see the result.

Again, we can also call the function from within a Snowflake worksheet, in the same way that we would any other function, to see the result.

Importing Additional Libraries from Anaconda

Even though our examples so far have all been fairly basic, we can already start to see how powerful Python UDFs could be. Not only are we able to receive inputs and use them to produce outputs, we can also define our own functionality within the Python script and perform more complex logical steps when executing our function.

What about if we wish to use Python libraries that are not part of the standard inbuilt set? For example, what if we wish to leverage Pandas, PyTorch or a wide range of other popular libaries? The good news here is that Snowflake have partnered with Anaconda, and you already have everything you need to leverage any of the libraries listed in Anaconda’s Snowflake channel.

If you wish to leverage any libraries that are not included in Anaconda’s Snowflake channel, including any libraries you have developed in-house, then you will need to import them separately. This will be discussed in the next section.

Accepting the Terms of Usage to Enable Third-Party Packages

To leverage third-party packages from Anaconda within Snowflake, an ORGADMIN must first accept the third-party terms of usage. I’ll walk through the process now; however, more details can be found here for those who desire it. This step must only be completed once for the entire organisation.

  1. Using the ORGADMIN role in the SnowSight UI, navigate to AdminBilling to accept the third party terms of usage
  2. Confirm acknowledgement
  3. The screen will then update to reflect the accepted terms

Examples Using Supported Third-Party Libraries

Now that we have enabled third-party libraries for our organisation, we can show some more interesting examples.

Generate a Random Name

This function imports the faker library to generate fake data. This is a useful tool for creating dummy datasets. Let’s see the code:

##################################################################
## Define the function for the UDF

### Import the required modules 
from faker import Faker

### Define main function which generates a fake name
def generate_fake_name():
  fake = Faker()
  return fake.name()

##################################################################
## Register UDF in Snowflake

### Add packages and data types
from snowflake.snowpark.types import StringType
snowpark_session.add_packages('faker')

### Upload UDF to Snowflake
snowpark_session.udf.register(
    func = generate_fake_name
  , return_type = StringType()
  , input_types = []
  , is_permanent = True
  , name = 'SNOWPARK_GENERATE_FAKE_NAME'
  , replace = True
  , stage_location = '@UDF_STAGE'
)

 

The core difference between this UDF and our previous examples are:

  • On row 11, we use standard Python functionality to import the required tools from the faker library, which are then leveraged in our Python function.
  • On row 17, we instruct our Snowflake Snowpark Session to leverage the faker library. As this is a supported third-party package, this is all we need to do.
  • In our input on row 16 and our return type on row 22, we can see an example of using StringType().

Again, if we execute this code as part of a Python script, we can then call the function using the snowpark_session.sql().show() method to see the result.

Again, we can also call the function from within a Snowflake worksheet in the same way that we would any other function to see the result.

Generate a Snowflake-Compliant Key Pair

This function imports the cryptography library to generate an authentication key pair that is compliant with Snowflake. I have discussed the benefits of this in a previous blog; however, this new Python functionality renders the technical side of this blog redundant!

It is important to note that this specific approach is not best practice as unintended users may be able to access the private key from Snowflake’s history; however, I will resolve this in an upcoming blog post.

Let’s see the code:

##################################################################
## Define the function for the UDF

### Import the required modules 
from cryptography.hazmat.primitives import serialization as crypto_serialization
from cryptography.hazmat.primitives.asymmetric import rsa

### Define main function which generates a Snowflake-compliant key pair
def generate_key_pair():
  keySize = 2048
  
  key = rsa.generate_private_key(public_exponent=65537, key_size=keySize)
  
  privateKey = key.private_bytes(
      crypto_serialization.Encoding.PEM,
      crypto_serialization.PrivateFormat.PKCS8,
      crypto_serialization.NoEncryption()
  )
  privateKey = privateKey.decode('utf-8')
  
  publicKey = key.public_key().public_bytes(
      crypto_serialization.Encoding.PEM,
      crypto_serialization.PublicFormat.SubjectPublicKeyInfo
  )
  publicKey = publicKey.decode('utf-8')
  
  key_pair = {
      "private_key" : privateKey
    , "public_key" : publicKey
  }

  return key_pair

##################################################################
## Register UDF in Snowflake

### Add packages and data types
from snowflake.snowpark.types import VariantType
snowpark_session.add_packages('cryptography')

### Upload UDF to Snowflake
snowpark_session.udf.register(
    func = generate_key_pair
  , return_type = VariantType()
  , input_types = []
  , is_permanent = True
  , name = 'SNOWPARK_GENERATE_KEY_PAIR'
  , replace = True
  , stage_location = '@UDF_STAGE'
)

 

Similar to our last example, the core difference between this UDF and our basic examples are:

  • On rows 5 and 6, we use standard Python functionality to import the required tools from the cryptography library, which are then leveraged in our Python function
  • On row 39, we instruct our Snowflake Snowpark Session to leverage the cryptography library. As this is a supported third-party package, this is all we need to do.
  • In our input on row 38 and our return type on row 44, we can see an example of using VariantType().

Again, if we execute this code as part of a Python script, we can then call the function using the snowpark_session.sql().show() method to see the result.

Again, we can also call the function from within a Snowflake worksheet in the same way that we would any other function to see the result.

Importing Files and Libraries via a Stage

So far, we have demonstrated quite a lot in our miniature crash course into Python UDFs in Snowflake. Before we wrap up, I’d like to cover one last piece, which is how to import libraries that are not covered by the Anaconda Snowflake channel and how to import other external files.

For these examples, we will be uploading all of our required files with the snowpark_session.add_import() method.

Import an External Excel xlsx File

This example reads an Excel xlsx file into a pandas dataframe, then uses information in that file to map the input string to the corresponding output.

The file we will be leveraging is a very simple mapping file that looks like this:

The file is called “Dummy Mapping File.xlsx” and sits on our local machine in the following relative directory path:

User Defined Functions\Supporting Files\

Here is the code for the UDF itself:

##################################################################
## Define the function for the UDF

### Import the required modules 
import pandas
import openpyxl # openpyxl required for pandas to read xlsx
import sys

### Define main function which leverages the mapping
def leverage_external_mapping_file(input_item_py: str):

  # Retrieve the Snowflake import directory
  IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
  import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]

  # Read mapping into Pandas
  df_mapping = pandas.read_excel(import_dir + 'Dummy Mapping File.xlsx', skiprows=5, usecols="C:D")

  # Map input value
  df_mapped_group = df_mapping[df_mapping['Item']==input_item_py]
  mapped_group = 'No matching group found'
  
  if len(df_mapped_group.index) > 0 :
    mapped_group = df_mapped_group.iloc[0]['Group']

  return mapped_group

##################################################################
## Register UDF in Snowflake

### Add packages and data types
from snowflake.snowpark.types import StringType
snowpark_session.add_packages(['pandas', 'openpyxl'])
snowpark_session.add_import('User Defined Functions/Supporting Files/Dummy Mapping File.xlsx')

### Upload UDF to Snowflake
snowpark_session.udf.register(
    func = leverage_external_mapping_file
  , return_type = StringType()
  , input_types = [StringType()]
  , is_permanent = True
  , name = 'SNOWPARK_LEVERAGE_EXTERNAL_MAPPING_FILE'
  , replace = True
  , stage_location = '@UDF_STAGE'
)

 

The unique differences for this example are:

  • On rows 13 and 14 within our Python function, we leverage the sys library to access the location where Snowflake stores files that have been imported into the UDF. This is potentially the most useful snippet of code in this blog I found it to be the hardest thing to find in the documentation!
  • Row 34 includes a line to import the file from our local directory into our Snowflake Snowpark Session. This is a critical component for our Python function to be able to access the file during the pandas.from_excel() function on row 17.
  • On row 33 we also demonstrate adding multiple packages in a single line.

The rest of the code in this script is specific Python code to download the Excel file into a dataframe, filter it to our specific item and return the matched group value.

Again, if we execute this code as part of a Python script, we can then call the function using the snowpark_session.sql().show() method to see the result.

Again, we can also call the function from within a Snowflake worksheet in the same way that we would any other function to see the result.

Import an Non-Standard External Library

This example is designed to leverage the xldate function of the xlrd library. This library is not included in the standard set of supported libraries, so it must be imported specifically.

Let’s see the code:

##################################################################
## Define the function for the UDF

### Import the required modules 
import xlrd

# Define main function which leverages the mapping
def leverage_external_library(input_int_py: int):

  return xlrd.xldate.xldate_as_datetime(input_int_py, 0)

##################################################################
## Register UDF in Snowflake

### Add packages and data types
from snowflake.snowpark.types import StringType
from snowflake.snowpark.types import IntegerType
snowpark_session.add_import('User Defined Functions/Supporting Files/xlrd')

### Upload UDF to Snowflake
snowpark_session.udf.register(
    func = leverage_external_library
  , return_type = StringType()
  , input_types = [IntegerType()]
  , is_permanent = True
  , name = 'SNOWPARK_LEVERAGE_EXTERNAL_LIBRARY'
  , replace = True
  , stage_location = '@UDF_STAGE'
)

 

The unique differences for this example are:

  • Row 18 includes a line to import the package from our local directory into our Snowflake Snowpark Session. This is a critical component for our Python function to be able to access the file during the xlrd.xldate.xldate_as_datetime() function on row 10.

Again, if we execute this code as part of a Python script, we can then call the function using the snowpark_session.sql().show() method to see the result.

Again, we can also call the function from within a Snowflake worksheet in the same way that we would any other function to see the result.

Wrap Up

So, there we have it. We have covered a lot of content in this post, and some of these UDFs could potentially have their own blog post all on their own! My goal is to demonstrate how to create UDFs for different scenarios and provide that in a single post so that it is easy for readers to find and refer back to if needed.

I hope this helps you get started on your own journey with Python UDFs in Snowflake via Snowpark. Let us know what cool things you come up with. It’s always great to see new functionality lead to productive and useful innovation.

If you wish to see the code for all of this and other content that we have, you can find it on the InterWorks GitHub.

If you are interested in finding out about Python stored procedures, be sure to check out my matching “Definitive Guide to Creating Python Stored Procedures in Snowflake using SnowPark,” which is coming soon!

More About the Author

Chris Hastie

Data Lead
Configure Azure Private Endpoints for Snowflake Internal Stages Snowflake and Azure are incredibly capable of storing data securely. You can see all reports available to customers through their pages ...
Configure Azure Private Link Connectivity with Snowflake Snowflake and Azure are incredibly capable of storing data securely. You can see all reports available to customers through their pages ...

See more from this author →

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!