This blog post is Human-Centered Content: Written by humans for humans.
The nature of our work involves spending time with analytics leader and technology buyers across a diverse range of organizations with varying degrees of data and analytics maturity.
This gives us a nuanced view of how data architecture patterns are evolving in the real world – often in stark contrast to vendor-led messaging and LinkedIn “thought leadership” pieces we’re all bombarded with on a daily basis. So, what do we think’s happening out there?
It’s definitely the case that some of the classic assumptions about the modern data stack are quietly fading into obscurity. As an example, we’ve spent years telling customers to centralize their data, but now the same vendors are happy to sell you tools (or at least provide new features) to query external data that never needs to move. Here’s a breakdown of some evolutionary patterns we’ve seen in the field.
The Importance of Meaning
Often misunderstood, complex to position and hard to implement. Technically, it’s easy to define what a semantic layer is: A shared, governed definition of business terms (metrics, dimensions, hierarchies) that sits between raw data and end users. What we tend to ignore is the fact that we’ve been dealing with semantics since the dawn of analytics – SQL aliases to lend meaning to column names, dashboard titles that explain what’s being visualised – they’re all semantic functions: Ways to bridge the gap between how data is stored and how it should be understood.
So what’s changing? Vendors have jumped into the world of semantic layers to provide tools and features that centralise semantic definitions in one place – solving for the problem of semantic inconsistency, where the same underlying data means different things to different people, in different reports, at different times. Vendors like Looker, dbt, AtScale, Cure, Omni and Sigma all have products or features in this space, each with a different philosophy about how and where the semantic layer should be defined and stored, and AI is making semantic layers much faster to build.
LLMs (Large Language Models) can now scan table schemas, column names, sample values and query history to propose descriptions, tags and relationships automatically, which can all feed into the curation of a semantic layer. As organisations move towards data mesh and data product thinking, AI-generated metadata is what makes those products findable without a team of data stewards manually cataloguing everything.
Nonetheless, the same challenges with semantic layers still exist. They still need to be signed off, and ongoing effort is required to manage the human component: Managing “drift” of semantic definitions, ensuring teams are invested in maintenance and socialising the usage of any downstream tools that use the semantic definitions you’ve strived to create.
Querying data where it lives
“Load it into the warehouse first” has always been a pragmatic default for many data engineering projects, but these days it’s an assumption that’s worth challenging. Every step of a traditional pipeline costs time and money in some way, and although it makes sense to follow this pattern where object stores are hard to query and CDWs are fast, that trade-off is now much less clear-cut.
So what’s changing? Open table formats have turned object stores into queryable databases, negating the requirement to load the data into a CDW in the first place. For non-technical folks, this means that users (and applications) can query data with standard SQL that could be held in your AWS S3 / GCS / Azure storage. It might be a little slower, but you don’t have to bear the cost and effort of bringing that data into your CDW.
As a practical example, imagine you have five years of very granular event log data in S3. Rather than bring in billions of rows into Snowflake (and deal with the compute cost), you could define an Iceberg table “on top” of it, and query it directly.
This can reposition the purpose of your CDW to an extent, making it the right tool for hot, frequently joined, performance sensitive data, with the object store handling the cold, larger-volume data that doesn’t need to be queried as often.
However, it’s not all straightforward – governance and access control becomes harder to federate as data access moves outside your CDW environment, and it becomes more important to make sound decisions about what belongs in the warehouse and what doesn’t.
Compute Cost Discipline and Flying Blind
Consumption-based pricing is here to stay, and CDW costs are real, visible and frequently out of control. They can also be wildly unpredictable without active management, and the increase of AI use cases will not make things easier.
So what’s changing? More than ever before, CDW vendors have a vested interested in pushing your compute costs up. Default configurations for things like warehouse sizing and auto-suspend are rarely optimal. Your query patterns (from users and downstream applications) are likely to be un-optimised and potentially unmonitored. In addition, general best principles for data management (data retention, management of ingress and egress processes, etc.) still apply to SaaS CDWs in the same way they do for legacy on-premise environments, a fact which is often ignored when organisations make the jump to “fully managed” CDW platforms.
In addition, many CDW platforms like Snowflake are “repatriating” workloads with new platform features that fulfill the purpose of other tools in your stack. This can be a great pattern for consolidation – being able to leverage Snowflake OpenFlow to ingest files into the CDW rather than using Fivetran, for example – but pushes your compute cost up even more.
In all cases, what this means is that architectural decisions must be “cost aware.” For example, the choice between loading data into a CDW versus querying it in an object store is a cost question as much as it is an architectural question. Teams must build internal cost visibility into data platform and must be able to justify investment in core platforms and features that impact compute cost.
Conclusion
There is no single “correct” data architecture pattern or stack, and your organisations maturity, capability and appetite for impactful analytics will all inform the right mix of tools and technology. However, the evolutionary changes we’ve covered mean there’s several new principles to consider:
- Treat storage and compute as separate functions when making architectural decisions. The object-store-query pattern we’ve talked about make this possible now, but most teams still tend to conflate them.
- Semantic layers, if you need them, should be considered as infrastructure, not a “project.” They need ownership, versioning, testing and buy-in from the entire organisation to deliver real value.
- Comprehensive cost visibility is an absolute prerequisite when implementing any new platform, especially in conjunction with point 1.
- Don’t mistake complexity for maturity. The team with a well-governed, cost-efficient two-layer architecture (object store and CDW, connected by a solid semantic layer, for example) will outperform the team with a six-layer stack and no clear data ownership.
This space has never been evolving faster. Whilst traditional usage of CDWs continues to offer performance and ease-of-use, our considered opinion is that the value of centralised CDWs like Snowflake is shifting from “we store your data for analytics” to “we provide the governance layer, AI tooling and compute engine for your data.”
The decisions you make about leveraging these new patterns will inform the right architecture that suits your organisation, and that’s where we can help.
