Disclaimer: The author is solely responsible for views and opinions in this article.
Snowflake Summit 2022 was a success. Enough has already been written about what came out of it and the exciting new announcements, so I will not be repeating any of that. Instead, this is just my musing about how I perceive the future of data processing on Snowflake.
Benoit Dageville established in the summit keynote the seven pillars of the Data Cloud foundation, two of them kept ringing in my mind … all data, all workloads.
Fuller Data Lakes Support
Snowflake already natively supported semi-structured data like JSON and XML. In fact, it was one of the USPs of Snowflake since quite some time. The launch for support for unstructured data like images and blobs in September 2021 was exciting but still had limited functionality. Only referencing and sharing the unstructured data links was possible. Such data wasn’t particularly “processible” directly in Snowflake.
Similar was the case with external tables. Support for external tables goes back to 2019. Various file formats including the popular ones like Apache Parquet are supported, but external tables are inherently limited in their functionality where one can only read from these tables and DML operations are not supported. Besides, in case of normal external tables, the table metadata would be stored in Snowflake thereby fragmenting the view of the data. The governance and insight gained on the external tables would be locked in Snowflake. But Snowflake’s commitment to data lake workloads is serious and that first showed in the announcement for support for Delta Lake table format in February 2022[1], and they topped it off with the announcement for support for Apache Iceberg format[2]. I would not go into the battle of Delta Lake vs. Iceberg, but supporting two out of three important open store formats (the third being Apache Hudi) for data lake workloads opens some big doors. Possibilities are countless – proper ACID transactions, DML operations, schema evolution and ability to time travel, that too while keeping the entire data and metadata external to Snowflake. Vendor neutrality will no longer be only a promise, data mesh will no longer be only an enterprise architectural pattern.
Native Unstructured Data Processing and Enhanced Data Streaming
The excitement doesn’t stop there. With Snowpark support for Java and now Python, one can not only store the unstructured data but also process it natively inside Snowflake. I can already imagine machine learning workflows built on the data lake and processed purely using Snowflake virtual warehouses. And there is no need to worry about productionization of your data models as you can deploy your models on any cloud gateway of your choice. You can then use external functions to call the results and predictions of your models right from within your SQL queries. This means your machine learning applications require only one familiar backend and processing engine – Snowflake.
Discussion of “all data” cannot be complete without streaming data and Snowflake is aware of it. Streaming data capabilities in Snowflake are even better now. Improved serverless streaming framework reduces the latency by factor of 10 and makes the data readily available not only to Snowpipe but even to connectors built on top of it like Snowflake Kafka connector.
Support for All Data Workloads with Unistore
Now, we’ve really talked about “all data,” and we’ve talked about almost all workloads. Why almost all? Is it not enough to support, data analytics, data science, data lake and data sharing as well as marketplace all on a single platform? Well, we all know the core workload of any data platform – online transactional workloads – data backends supporting hundreds of thousands of simultaneous transactions every second. Transactions with write (DML) operations not just queries. And the holy grail of data platforms has always been the goal to support all these workloads together as one system. One homogeneous system with no patchwork of modules and single operational semantics.
Until now, no system has been able to support transactional, analytical and data lake workloads, let alone machine learning (which is, anyway, late to the party). There have been numerous combinations (for example OLTP and OLAP together or analytical and data lake together), but no platform has tried to solve all the three use cases by bringing all the facets of data and workloads together, and definitely no cloud-based data platform has tried either. Until now.
And this was the pinnacle of announcements in the Snowflake Summit 2022. On 14th June 2022, Snowflake announced Unistore, a hybrid workload that will support both transactional and analytical workloads together. Snowflake is taking a daring approach of row-based storage architecture, which will be intelligently mated with its traditional columnar store to implement Unistore. All of this with no strings attached! You won’t have to make any copy of your table to support online transactions and use the table at the same time as a backend for your analytical dashboard. You won’t have to keep converting the data from 3NF to star-schema formats. And you won’t have to archive your historical data into some other table so that the performance of your transactional data app remains palatable. Have you missed having the enforced constraints (primary and foreign keys) on Snowflake before? Well no longer! Combine that with SQL API and the shiny new native application framework, and you have got something very powerful. If that’s not “all workloads” then what is?
More Than a “Cloud Data Warehouse”
Even after many advanced capabilities like data sharing, governance and recent data lake advancements, some people would still refer to Snowflake as a “cloud data warehouse.” It is so unfair, because the words “data cloud” are no longer mere jargon! They have become a reality. A place where you can do anything and everything with your data. Right from gathering it to monetizing it.
The future of the “Data Cloud” is bright, very bright.
[1] Some would argue that the support for the Hive metastore was, in fact, the first attempt to wider adoption towards data lake architecture, but in the age of cloud data platforms Hive had a very limited and legacy purpose.
[2] They announced it during the summit, but private previews were started long before that.