Zero to Snowflake: Loading and Querying Semi-Structured JSON Data

Data

Zero to Snowflake: Loading and Querying Semi-Structured JSON Data

This series takes you from zero to hero with the latest and greatest cloud data warehousing platform, Snowflake.

Data can be stored in many different formats. In the previous post in this series, we discussed semi-structured file formats, focusing on a particular approach called JSON, which is formed of arrays, objects and key value pairs. This is one of many semi-structured data formats and is supported by Snowflake, along with other formats like Avro, ORC, Parquet and XML.

Up-to-date documentation of supported semi-structured file formats can be found in Snowflake’s Introduction to Semi-Structured Data.

For the purposes of this post, we will be using the following JSON structure as our example:

[
  {
    "Category" : "Transport, Logistics and Vehicles",
    "Regions" : [
      {
        "Region" : "North",
        "Sub-Categories" : [
          {
            "Sub-Category" : "Transport",
            "EmployeeSales" : [
              {
                "Employee" : "Employee 1",
                "Sales" : 1100,
                "Comment" : "Above expectation | Expected 1000"
              },
              {
                "Employee" : "Employee 2",
                "Sales" : 900,
                "Comment" : "Below expectation | Expected 1000"
              }
            ]
          },
          {
            "Sub-Category" : "Logistics",
            "EmployeeSales" : [
              {
                "Employee" : "Employee 1",
                "Sales" : 5000
              },
              {
                "Employee" : "Employee 2",
                "Sales" : 6000
              }
            ]
          },
          {
            "Sub-Category" : "Vehicles",
            "EmployeeSales" : [
              {
                "Employee" : "Employee 1",
                "Sales" : 7000
              },
              {
                "Employee" : "Employee 2",
                "Sales" : 8000
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "Category" : "Hardware",
    "Regions" : [
      {
        "Region" : "North",
        "Sub-Categories" : [
          {
            "EmployeeSales" : [
              {
                "Employee" : "Employee 1",
                "Sales" : 10000
              },
              {
                "Employee" : "Employee 2",
                "Sales" : 15000
              }
            ]
          }
        ]      
      },
      {
        "Region" : "East",
        "Sub-Categories" : [
          {
            "EmployeeSales" : [
              {
                "Employee" : "Employee 3",
                "Sales" : 23000
              },
              {
                "Employee" : "Employee 4",
                "Sales" : 27000
              }
            ]
          }
        ]
      },
      {
        "Region" : "South",
        "Sub-Categories" : [
          {
            "EmployeeSales" : [
              {
                "Employee" : "Employee 5",
                "Sales" : 20000
              },
              {
                "Employee" : "Employee 6",
                "Sales" : 30000
              }
            ]
          }
        ]
      },
      {
        "Region" : "West",
        "Sub-Categories" : [
          {
            "EmployeeSales" : [
              {
                "Employee" : "Employee 7",
                "Sales" : 50000,
                "Comment" : "Won the \"Hardware Salesman of the Year\" award"
              },
              {
                "Employee" : "Employee 8",
                "Sales" : 25000
              }
            ]
          }
        ]
      }  
    ]
  }
]

Goal in Snowflake

Our goal is to load this data into Snowflake and flatten it to achieve the following table:

Loading the Data into a Snowflake Table

This post assumes you are comfortable with staging data in Snowflake. It is recommended to look at earlier posts in this series if you are unfamiliar with this.

Before we can load any data into Snowflake, we first must create a database and schema where we will stage the data, as well as create the table into which the data will be loaded. For this example, we will be using the DEMO_DB.DEMO_SCHEMA namespace and the stage DEMO_STAGE. As for the table itself, we will need to consider the data that we are loading. Snowflake stores semi-structured data using the VARIANT field type, so we create a table with this field type:

CREATE OR REPLACE TABLE DEMO_DB.DEMO_SCHEMA.DEMO_INPUT_TABLE (
    JSON_DATA VARIANT
);

It is worth noting that Snowflake is capable of loading structured files with semi-structured fields as well. In this case, each field would have the appropriate data type, and the semi-structured field(s) would use the VARIANT data type. In this example, we only have a single set of semi-structured data—hence, a single VARIANT field.

Now that we have a destination table and the data is staged, the COPY INTO command is used to load the data. For current and complete information on the specific file format options when loading JSON data, take a look at Snowflake’s JSON File Format Documentation.

As our example dataset is relatively simple and clean, it can be loaded with the following script in Snowflake:

COPY INTO DEMO_DB.DEMO_SCHEMA.DEMO_INPUT_TABLE
FROM @DEMO_DB.DEMO_SCHEMA.DEMO_STAGE
FILE_FORMAT = (
  TYPE = 'JSON'
  STRIP_OUTER_ARRAY = TRUE
)
;

We have used an option called STRIP_OUTER_ARRAY for this load. This removes the outer set of square brackets [ ] when loading the data, separating the initial array into multiple lines. If we did not strip the outer array, our entire dataset would be loaded into a single row in the destination table. With our low data volumes, this is not a problem; however, it would be an issue in larger datasets. As we have stripped the outer array, the data is loaded into two rows in Snowflake. The file format and STRIP_OUTER_ARRAY option are explained in more detail in the previous blog post.

Once loaded, we can view the data with a simple SELECT * FROM statement:

Above: Data loaded into two rows in Snowflake

Querying the Data in Snowflake

Snowflake is extremely powerful when it comes to querying semi-structured data. To begin, use standard : notation to retrieve the category for each row. We can achieve this as Category is at the highest level in our JSON object for each line:

SELECT JSON_DATA, JSON_DATA:Category
FROM DEMO_INPUT_TABLE;

Above: Demonstrating category using : notation

By using :: notation, it is possible to define the end data type of the values being retrieved. Notice how in this example, the outer quotes are removed:

SELECT
    JSON_DATA
  , JSON_DATA:Category::string
FROM DEMO_INPUT_TABLE;

Above: Demonstrating conversion of category into a string

Standard : notation can fall short when tackling an array, depending on your intentions. It is possible to use : notation to target specific entries of an array; however, it is possible to target specific entries in an array using [ ] notation:

SELECT
    JSON_DATA
  , JSON_DATA:Category::string
  , JSON_DATA:Regions[0]
  , JSON_DATA:Regions[1]
  , JSON_DATA:Regions[2]:Region::String
  , JSON_DATA:Regions[3]:Region::String
FROM DEMO_INPUT_TABLE;

Above: Demonstrating specific array query

This approach receives each entry as its own column, which is not ideal.

Flattening Arrays

Unfortunately, it is not possible to dynamically retrieve all objects in an array with : and [ ] notation alone. This is where flattening comes in. Flattening is a term for unpackaging the semi-structured data into a columnar format by converting arrays into different rows of data. The most up-to-date information on flattening can be found in Snowflake’s Flatten Documentation.

In our case, we execute the flatten as part of our FROM clause using the function LATERAL FLATTEN. The input for the function is the array in the JSON structure that we want to flatten. It is crucial that the flatten is given an alias; in our case, the alias is Regions. There are several fields output by the LATERAL FLATTEN function; however, we are only interested in the Value for now. This contains the object within the array, and each object is provided its own row in the output view:

SELECT
    JSON_DATA
  , JSON_DATA:Category::string
  , Regions.Value
FROM DEMO_INPUT_TABLE
    ,   LATERAL FLATTEN (JSON_DATA:Regions) as Regions;

Above: Demonstrating regions’ value

Our specific semi-structured data in fact has three arrays: Regions, Sub-Categories and EmployeeSales. To fully convert this into a tabular format, it is first necessary to flatten each of these arrays in turn:

SELECT
    JSON_DATA
  , JSON_DATA:Category::string
  , Regions.value:Region::string
  , "Sub-Categories".value:"Sub-Category"::string
  , EmployeeSales.value
FROM DEMO_INPUT_TABLE
    ,   LATERAL FLATTEN (JSON_DATA:Regions) as Regions
    ,   LATERAL FLATTEN (Regions.value:"Sub-Categories") as "Sub-Categories"
    ,   LATERAL FLATTEN ("Sub-Categories".value:EmployeeSales) as EmployeeSales;

Above: Demonstrating multiple lateral flattens

Notice how “Sub-Categories”.value:”Sub-Category”::string returns NULL when the category is Hardware. This is because the “Sub-Category” key does not exist for that category.

Full Structured Query

Combining what we have learned, we can convert the entire JSON dataset into a tabular format:

SELECT
    JSON_DATA:Category::string as Category
  , Regions.value:Region::string as Region
  , "Sub-Categories".value:"Sub-Category"::string as "Sub-Category"
  , EmployeeSales.value:Employee::string as "Employee"
  , EmployeeSales.value:Sales::number as "Sales"
  , EmployeeSales.value:Comment::string as "Comment"
FROM DEMO_INPUT_TABLE
    ,   LATERAL FLATTEN (JSON_DATA:Regions) as Regions
    ,   LATERAL FLATTEN (Regions.value:"Sub-Categories") as "Sub-Categories"
    ,   LATERAL FLATTEN ("Sub-Categories".value:EmployeeSales) as EmployeeSales;

Above: Final output in Snowflake

Summary

Using the tools above, Snowflake can be leveraged to easily load and query semi-structured JSON data. If your data is in another semi-structured format such as Avro, ORC, Parquet and XML (or any other formats listed among Snowflake’s supported file formats), Snowflake can handle it in a similar fashion. This is extremely helpful when dealing with modern data sources, and I encourage you to load some JSON data into Snowflake and practice this functionality yourself.

More About the Author

Chris Hastie

Data Engineer
Zero to Snowflake: Automated Clustering in Snowflake One of Snowflake’s key selling points is automated clustering. However, it’s not immediately clear what this actually ...
Zero to Snowflake: The Nuances of Cloning One of the fantastic features in Snowflake is the support of zero-copy cloning. This is the capability to duplicate an object while ...

See more from this author →

Subscribe to our newsletter

  • I understand that InterWorks will use the data provided for the purpose of communication and the administration my request. InterWorks will never disclose or sell any personal data except where required to do so by law. Finally, I understand that future communications related topics and events may be sent from InterWorks, but I can opt-out at any time.
  • This field is for validation purposes and should be left unchanged.

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

black. lives. matter.