Snowflake External Tables: Connect Efficiently to Your Data Lake

Data

Snowflake External Tables: Connect Efficiently to Your Data Lake

In this blog post, we will dive into the realm of Snowflake External Tables, a feature redefining data management and analysis. If you find yourself grappling with massive datasets, complex data structures or ever-increasing data volumes, you’re not alone. In this age of information abundance, businesses face unprecedented challenges in efficiently storing, processing and accessing data. Fortunately, Snowflake’s approach to External Tables offers a solution that empowers organisations to effortlessly integrate, query and analyse data from external sources without compromising on performance or scalability. This allows you to follow a data lake approach to your data files whilst still leveraging them through Snowflake and exposing transformed data to end users. We will be discussing the following:

  • Creating a File Format
  • Creating an External Stage
  • Creating an External Table
  • Performance optimisation, including partitioning and materialized views
  • Refreshing External Table metadata and refresh history

Creating a File Format

Snowflake File Format Objects are configurations that define how data is organised and stored in files within Snowflake. They specify the structure and encoding of data files, enabling Snowflake to efficiently read, write and process data in various formats like CSV, JSON, Parquet, Avro and more. File Format Objects allow users to customise settings such as field delimiters, compression options, character encoding and handling of null values, ensuring compatibility with different data sources and optimising storage and query performance. We will be creating a file format to handle our News Articles JSON data. Since our files are structured to each contain the full file contents as a list, we leverage the STRIP_OUTER_ARRAY option to break each list member into an individual record.

CREATE OR REPLACE FILE FORMAT MY_JSON_FILE_FORMAT 
    TYPE = 'JSON' 
    STRIP_OUTER_ARRAY = TRUE 
;

Creating an External Stage

The next step involves creating an external stage (using CREATE STAGE) that establishes a connection to an external location housing your data files, such as an S3 Bucket. Since we know our files in the stage are JSON, we can specify the file format that we created above when creating the stage.

CREATE STAGE MY_STAGE
    URL = 's3://ext-table-example-wp/'
    STORAGE_INTEGRATION = MY_STORAGE_INTEGRATION
    FILE_FORMAT = MY_JSON_FILE_FORMAT
;

In this case, we employed a storage integration to facilitate authentication with AWS. To learn further details about Storage Integrations and their setup procedures, you can refer to the following resources:

Creating an External Table

Once the file format and the stage are created, we can create our external table. For this example, we will start by creating a very simple external table that does nothing more than directly access the data in the files.

CREATE OR REPLACE EXTERNAL TABLE NEWS_ARTICLES 
WITH LOCATION = @MY_STAGE/ 
FILE_FORMAT = MY_JSON_FILE_FORMAT 
;

First.png

On the example above, your external table will be created with the contents of your JSON file. This means your external table will have one column called VALUE and a row for each object that your JSON file contains. We can query specific attributes by using the $1 notation, for example:

SELECT $1:authors AS AUTHORS ,$1:category AS CATEGORY ,$1:date AS PUBLISHED_DATE ,$1:headline AS HEADLINE ,$1:link AS WEBSITE_URL ,$1:short_description AS SHORT_DESC FROM NEWS_ARTICLES ;

-- 9 screenshot

This can be an extremely useful way to validate the data of your external table. You can also create external tables specifiying the fields that you’d like to use with the $1 notation on your create external table statement. This method of creating external tables is useful when you know what your schema looks like. In this example, we’ll be re-creating the NEWS_ARTICLES table:

CREATE OR REPLACE EXTERNAL TABLE NEWS_ARTICLES ( 
    AUTHOR STRING AS ($1:authors::STRING) 
    ,CATEGORY STRING AS ($1:category::STRING) 
    ,PUBLISHED_DATE DATE AS ($1:date::DATE) 
    ,HEADLINE STRING AS ($1:headline::STRING) ,LINK STRING AS ($1:link::STRING) 
    ,SHORT_DESC STRING AS ($1:short_description::STRING) 
) 
WITH LOCATION = @MY_STAGE/ 
FILE_FORMAT = MY_JSON_FILE_FORMAT 
;

Another great feature of external tables is the ability to load data using a PATTERN. Imagine your S3 bucket contained data from sales and customers in a single folder called sales-and-customers. For sales, your files are named sales_001 to sales_009 and your customer files are named customer_001 to customer_009. In this case, if you want to create an external table with only customer data, you can use the PATTERN property in your CREATE EXTERNAL TABLE statement, for example:

CREATE OR REPLACE EXTERNAL TABLE PATTERN_TESTING 
WITH LOCATION = @MY_STAGE/sales-and-customers 
PATTERN = '.*customer_.*[.]json' 
FILE_FORMAT = MY_JSON_FILE_FORMAT 
;

Performance Optimisation

As the data resides outside Snowflake, querying an external table may result in slower performance compared to querying an internal table stored within Snowflake. However, there are two ways to enhance the query performance for external tables. Firstly, when creating your External Table, you can improve performance by adding partitions. Alternatively, you can also create a Materialised View (Enterprise Edition Feature) based on the external table. Both approaches offer optimisations to expedite data retrieval and analysis from external tables.

Partitioning Your External Table

Partitioning your external tables is highly advisable, and to implement this, ensure that your underlying data is structured with logical paths incorporating elements like date, time, country or similar dimensions in the path. For this example, we will be partitioning news articles based on their published year. In this S3 bucket, I created three different folders, one for each year that will be used for partitioning. Each folder has a JSON file inside with multiple news articles for each year.

Second.png

The JSON files inside the folder are structured in the following way:

[ 
    { 
       "link": "<url>", 
       "headline": "<headline>", 
       "category": "<category", 
       "short_description": "<short description>", 
       "authors": "<authors>", 
       "date": "2020-01-01" 
    }, 
    {...} 
]

As mentioned above, ensure that your underlying data is structured with logical paths. We can verify the logical path by listing the contents of the stage:

LIST @MY_STAGE;

Third.png

Given we structured the file path with a folder for each year, we can use SPLIT_PART(METADATA$FILENAME,'/', 1) to generate our partitions. To confirm what our partitions look like, we can use the following SELECT statement:

SELECT DISTINCT 
    SPLIT_PART(METADATA$FILENAME,'/', 1) 
FROM @MY_STAGE 
;

Fourth.png

Now we can create our external table with the partition on the year:

CREATE OR REPLACE EXTERNAL TABLE NEWS_ARTICLES_WITH_PARTITION ( 
    AUTHOR STRING AS ($1:authors::STRING) 
    ,CATEGORY STRING AS ($1:category::STRING) 
    ,PUBLISHED_DATE DATE AS ($1:date::DATE) 
    ,HEADLINE STRING AS ($1:headline::STRING) 
    ,LINK STRING AS ($1:link::STRING) 
    ,SHORT_DESC STRING as ($1:short_description::STRING) 
    ,FILE_PARTITION STRING AS (SPLIT_PART(METADATA$FILENAME,'/', 1)) 
) 
PARTITION BY (FILE_PARTITION) 
WITH LOCATION = @MY_STAGE/ 
FILE_FORMAT = MY_JSON_FILE_FORMAT 
;

Fifth.png

Note that by default, the VALUE column containing the JSON object will be the first column of your table. We can then run a simple SELECT statement and check the query profile to understand what impact the partitions had in the query performance. When analysing the query profile, we can see that we scanned 0.93mb and only one partition out of 3 that exist.

SELECT * FROM NEWS_ARTICLES_WITH_PARTITION WHERE FILE_PARTITION = '2020';

Sixth.png

To compare the performance of a partitioned table versus a non-partitioned table, we can query the table we created in the previous section without a partition called NEWS_ARTICLES_WITHOUT_PARTITION using the YEAR function on the published date for filtering, and we can see that we scanned 2.5mb and all the partitions (the whole dataset):

SELECT * FROM NEWS_ARTICLES_WITHOUT_PARTITION WHERE YEAR(PUBLISHED_DATE) = '2020';

Seventh.png

Partitions in this specific example will not make a difference since the dataset is small, however, it can yield massive improvements when querying larger datasets.

Materialized Views (Enterprise Edition Feature)

A materialized view is an optimised and pre-computed data set that stems from a query specification (the SELECT statement in the view definition) and is stored for future use. By pre-computing the data, querying a materialized view is considerably faster compared to executing the same query against the base table from which the view is derived. This performance advantage becomes especially significant when dealing with complex and frequently executed queries.

Materialized views excel at speeding up resource-intensive operations such as aggregation, projection and selection, particularly when these operations are performed regularly on vast datasets. Their ability to store and present pre-processed data allows for quicker access and minimises the need for repeated, computationally expensive calculations.

The example above works well for simple JSON structures; however, it can become cumbersome when your JSON objects have nested elements. In cases like this, using a materialized view to flatten and parse the semi-structured data into a structured, tabular format for users to query can be extremely beneficial. For example, imagine a file with the randomly generated JSON below:

[ 
{ 
    "id": "64cb14c1037b347e044ffc52", 
    "email": "sonya_lowe@zensor.ovh", 
    "username": "sonya90", 
    "profile": { 
      "name": 
      "Sonya Lowe", 
      "company": 
      "Zensor", 
      "dob": "1990-07-04", 
      "address": "12 Girard Street, Wacissa, Colorado", 
      "location": { 
        "lat": -70.917534, 
        "long": 60.715719 
      }, 
      "about": "Ex commodo tempor et laboris aute et do eiusmod deserunt est adipisicing ipsum. Esse nostrud labore veniam nisi sit aute in Lorem non est." 
    }, 
    "apiKey": "cffbe3b8-d9f9-46d1-92cc-4e1ac51b32b5", 
    "roles": [ 
      "member", 
      "admin" ], "createdAt": "2014-10-05T00:11:09.421Z", "updatedAt": "2014-10-06T00:11:09.421Z" }, {...} ]

In this case, we can use the stage we created in the previous sections called MY_STAGE. One of the advantages of utilizing external tables lies in the ability to leverage subdirectories. This allows us to create distinct tables from various files located in different folders within your external storage system. By utilizing subdirectories, we can efficiently organize and manage data, facilitating seamless access and analysis across multiple datasets. In my S3 bucket, I created a folder called users containing the JSON file we want to use to create our table. The logical path with the JSON file now looks like this: s3://ext-table-example-wp/users/random-generated-json. When creating our external table, we just need to reference the folder name when specifying the location, as follows:

CREATE OR REPLACE EXTERNAL TABLE RANDOMLY_GENERATED_JSON 
WITH LOCATION = @MY_STAGE/users 
FILE_FORMAT = MY_JSON_FILE_FORMAT 
;

Eigth.png

Once we have this data loaded in our external table, we can create the Materialized View to flatten and parse the data:

CREATE OR REPLACE MATERIALIZED VIEW USERS_JSON AS 
SELECT 
    $1:profile.name AS NAME 
    ,$1:email AS EMAIL 
    ,$1:username AS USERNAME 
    ,$1:profile.company AS COMPANY 
    ,$1:profile.dob AS DATE_OF_BIRTH 
    ,$1:profile.address AS ADDRESS 
    ,$1:profile.location.lat AS LATITUDE 
    ,$1:profile.location.long AS LONGITUDE 
    ,$1:roles AS ROLES FROM RANDOMLY_GENERATED_JSON 
;

This example serves as a simplified representation of semi-structured data handling. In reality, real-world semi-structured datasets can be remarkably more intricate and deeply nested.

Refreshing Your External Table Metadata

When creating your external table, you can set the parameter AUTO_REFRESH to TRUE. This automatically refreshes the metadata when new or updated data files are available in the stage you used when creating your external table. For example:

CREATE OR REPLACE EXTERNAL TABLE RANDOMLY_GENERATED_JSON 
WITH LOCATION = @JSON_STAGE/ 
FILE_FORMAT = MY_JSON_FILE_FORMAT 
AUTO_REFRESH = TRUE 
;

It’s important to note that setting AUTO_REFRESH to TRUE is not supported if you created your external table with manually added partitions (for example, when PARTITION_TYPE = USER_SPECIFIED). When an external table is created, its metadata is refreshed automatically once unless REFRESH_ON_CREATE = FALSE. To ensure that Snowflake is notified when new or updated data becomes available for reading into the external table metadata, it is essential to set up an event notification for your storage location. You can learn more about how to set up these event notifications in the following resources:

Please be aware that when working with Azure and GCP, AUTO_REFRESH = TRUE will only work if the INTEGRATION parameter in your CREATE EXTERNAL TABLE statement is specified, for example:

CREATE EXTERNAL TABLE YOUR_EXTERNAL_TABLE 
WITH LOCATION = @YOUR_EXTERNAL_STAGE 
FILE_FORMAT = YOUR_FILE_FORMAT 
INTEGRATION = YOUR_INTEGRATION_NAME 
;

In addition to modifying the AUTO_REFRESH attribute, you have the option to manually trigger the refresh of external tables. This can be achieved by executing the command ALTER EXTERNAL TABLE <table_name> REFRESH on the specific table for which you wish to update the data.

Metadata Refresh History

EXTERNAL_TABLE_FILE_REGISTRATION_HISTORY

The EXTERNAL_TABLE_FILE_REGISTRATION_HISTORY table function is a great feature to use when in need to retrieve information about the metadata of an external table. It will return results for the external table owner (i.e. the role with the OWNERSHIP privilege on the external table), or a higher role or a role that has the USAGE privilege on the database and schema that contain an external table and any privilege on the external table. The information returned includes files that were added or removed as part of a metadata refresh and any errors found when refreshing the data:

SELECT * FROM 
TABLE(INFORMATION_SCHEMA.EXTERNAL_TABLE_FILE_REGISTRATION_HISTORY(TABLE_NAME=>'MYTABLE'));

Be aware that this table function cannot retrieve information until the external table is refreshed.

AUTO_REFRESH_REGISTRATION_HISTORY

Users with the ACCOUNTADMIN role or a role featuring the global MONITOR USAGE privilege can leverage the AUTO_REFRESH_REGISTRATION_HISTORY table function. This function allows you to retrieve the history of data file registrations in metadata for designated objects, along with the corresponding credits billed for these operations. Note that this function returns billing activity within the last 14 days. The syntax is as follows (all arguments are optional):

SELECT * FROM TABLE(INFORMATION_SCHEMA.AUTO_REFRESH_REGISTRATION_HISTORY( 
  DATE_RANGE_START=>TO_TIMESTAMP_TZ('2023-08-05 12:00:00.000 -0700'), 
  DATE_RANGE_END=>TO_TIMESTAMP_TZ('2023-08-05 12:30:00.000 -0700'), 
  OBJECT_TYPE=>'EXTERNAL_TABLE') 
  ) 
;

Wrap Up

Thank you for taking the time to learn more about Snowflake’s External Tables! With the ever-growing complexity and volume of data, Snowflake steps in as the ultimate problem solver, seamlessly integrating and analysing data from various sources while maintaining top-notch performance. From setting up the basics like file formats and external stages, to the more advanced creation and optimization of external tables, we’ve covered it all. We even explored performance enhancement tactics like partitioning and materialized views, and delved into the mechanics of refreshing external table data – a way of ensuring your data stays up to date. As we navigate the dynamic landscape of data management, Snowflake’s External Tables shine as a versatile tool, bridging the gap between efficient storage and comprehensive data analysis.

More About the Author

Guilherme Rampani

Data Engineer
Matillion Hub Insights Welcome to Matillion Hub: the main spot for handling your Matillion tools. It’s like a control center where you can manage ...
Automating EC2 Instance Management with AWS EventBridge and Lambda Welcome to our journey into the world of AWS automation! In this blog post, we’ll be exploring how to automate the start and stop ...

See more from this author →

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!