This series takes you from zero to hero with the latest and greatest cloud data warehousing platform, Snowflake.
A critical part of any data warehouse is how new data is brought in and prepared for consumption. Depending on the order of events, this process is known either as ETL or ELT. So, what do these acronyms mean?
ETL is the process of extracting data from source system(s), transforming it and loading it into the data warehouse. Meanwhile, ELT swaps the latter two, loading the data into the data warehouse before performing any transformations.
Extract the data from the source system. This could be anything from turning an Excel file into a CSV file to a lengthy process extracting all data from a legacy system into a series of compressed files.
Transform the data into the desired structure for the data warehouse. This includes cleansing data, joining data sources, pivoting and any other modifications.
Load the data into the data warehouse.
A Basic Use Case
Let’s consider a very basic example. I have a spreadsheet with 1 million rows of data spanning 200 columns, and I want to load it into a data warehouse. However, I am only interested in 20 columns and 100,000 specific rows of data.
In an ETL approach, I would remove the undesired columns from my Excel file and filter down to my desired 100,00 rows before saving the file as a CSV and loading it into a data warehouse.
In an ELT approach, I would save the file as a CSV and load it into a data warehouse first. I would then create another table in my data warehouse and insert only the columns and rows I desire.
Most legacy systems opt for the ETL approach, and there are many advantages to doing so. Before I continue, this is not an exhaustive list of advantages, and people will always have their own preferences and experiences in either direction. The following is a list of three advantages that I have experienced personally with ETL:
- Transforming the data before loading it allows for optimisation of the data load by cleansing the data and transforming it into a more accessible format.
- The person performing the ETL may not be familiar with the new data warehouse and can apply their expertise by transforming the data before touching the new system, including the use of any other tools with which they are already familiar.
- Resources in the new data warehouse are not spent on unrequired storage or strenuous transformations, which can be critical if the data warehouse has limited resources, or the resources are needed for another function, such as business intelligence.
With the recent advances in technology and the shift to cloud-based data warehousing, ELT approaches have become far more common and advantageous. This is mainly due to the ability of cloud-based data warehouses to dynamically scale the available resources and separate them completely from other intended purposes. This vastly diminishes the resource concerns for both storing the raw data and transforming it into a preferred structure.
Consider how Snowflake separates Storage from Compute and allows unlimited amounts of each. There is very little downside to storing raw data within Snowflake and leveraging Snowflake’s compute power to transform that raw data into a preferred structure. The storage costs are low, and storage does not compete with other purposes. Meanwhile, the compute resource can turn on/off and scale up/down automatically to execute transformations efficiently and keep costs low. This is especially cost effective when you consider the amount of resources required to transform the data in a legacy system before loading it in.
So, Which Order Should You Go in?
This question will always lean toward the side of user preference. Many people feel strongly one way or the other, and both can be effective tools to ingest data into a data warehouse. Personally, I would recommend ELT if using cloud-based data warehouses such as Snowflake. For me, the benefits far outweigh the drawbacks, and cloud-based transformation tools like FiveTran and Matillion make the process even easier to automate and explore.