This series highlights how Snowflake excels across different data workloads through a new cloud data platform that organizations can trust and rely on as they move into the future.
When talking about the “doing” part of data engineering, folks usually want to say something about ETL like, “How are your ETLs running today?” This is a horrible term and even worse in plural form. The new hotness is to flip the “T” and the “L” to get ELT, which is unfortunately not a delicious sandwich but fortunately is a delicious pattern of data movement. However, I think the more generic, less confusing term for my hunger pains is what’s called a data pipeline. Because if data is the new oil, it should move through a pipeline!
Unlike oil pipelines, however, Snowflake data pipelines are easy to build and maintain. No leaks here! Snowflake offers this process out of the box; no need to call in that roughneck developer to drill a hole in your applications and yank out that sweet, sweet byte gold.
Let’s take a look at each piece of the integrated data engineering experience Snowflake offers.
Snowpipe
First things first—you have to get your data into Snowflake. Typically, that’s done with a COPY INTO command. With your files already available in cloud object storage, you just tell Snowflake, “Hey Snowflake, copy those files into this table.” It’s not quite that simple but pretty close. The one problem here is you have to tell Snowflake to do it. Sort of like my wife telling me to take out the garbage.
This is where Snowpipe can help. Snowpipe is a service that does automated loading of data into Snowflake from cloud storage systems. All you have to do is create the template in the form of a COPY INTO command, which does the work.
The interesting part is how Snowflake leverages event notifications from cloud object storage to tell Snowpipe what to load. For example, an event notification message can be sent to AWS’ Simple Queue Service (SQS) from AWS’ Simple Storage Service (S3) when a new file shows up. Snowpipe is set up to watch for these notifications and takes action when it sees there’s a new notification on the message queue. Of course, this isn’t just for AWS. You can leverage similar services available in both Google Cloud Platform and Microsoft Azure.
Next, we’ll want to capture those changes in the table data and do something with them.
Streams
Streams are how Snowflake does change data capture. A stream keeps track of what data has changed—been inserted, updated or deleted—in a table. Streams function like any other table object in Snowflake. For example, you can query the stream just like you query a table. The interesting thing here is that a stream contains not only the data in the table but also some metadata columns about the records in that table.
Since we can query the stream like we can query a table, and we have these metadata columns that tell which rows have been changed, we can write a query that asks for only the kind of changes we’re interested in. For example, if we’re only interested in new records, we can write a query to do just that:
SELECT * FROM my_table_stream WHERE metadata$action = 'INSERT';
With the results of the query, we can take only our new data and, for example, insert that data into another table. The problem is we need a way to run that query on some sort of schedule.
Lastly, we’ll see how we can run SQL actions on a schedule.
Tasks
So we have our stream with the changed data, and we have a query that selects only the new records in our table. With a Snowflake task, we can run that query on a schedule so that as new data is loaded into our initial table, we can capture it and move it.
Tasks are a way to run actions in Snowflake. Typically, this might be a SQL statement or a stored procedure. Tasks provide two options for scheduling: internal-based (in minutes) and cron-based. A cron schedule is your typical scheduling utility you’ll find in most Linux-based systems. A cron schedule would be used if you wanted to run your task at a specific day or time. For example, if you only wanted to insert your new data into a target table every day at 1AM. An interval-based schedule can be used if you don’t have a specific preference and would just like the task to run every certain number of minutes. For example, if you wanted to continuously insert your new data into a target table as new data gets loaded, you might set the interval schedule to be every five minutes.
You might be wondering, “Won’t these tasks eat up a bunch of my Snowflake credits if they are just running constantly?” And the answer is YES! But … what you should do to avoid blowing out your credit budget if you’re using a stream is to set the WHEN argument with the SYSTEM$STREAM_HAS_DATA function. This way, the task only executes the action if the stream contains data. The task doesn’t consume any credits checking the stream for data. This ensures your task can run and run without needlessly consuming credits.
Stay Tuned for More
If you’re following along with this series, rest assured that there’s more to come! Check back on the blog next week for another installment and see another facet of Snowflake’s power as a cloud data platform.