Data Forum: What’s Next for Cloud Data Platforms?

Transcript
Welcome, everybody. Hopefully, the audio is still here, and everybody who's going to show up is able to hear us during the conversation. I'm James Wright. I run strategy here at InterWorks, and I'll be your host today for our second installment of what we're calling the Data Forum, which is a chance to have some of our experts around the world here at InterWorks, some friends from the industry, and many of our clients join in a conversation about the evolving landscape of data and analytics. The QR code on the screen right now will take you to the landing page for our events so you could go back and listen to previous episodes or the last episode where we discussed analytics platform development, specifically in this case, developments from Microsoft's conference as well as Tableau and ThoughtSpot. Today, we're having a conversation just a few weeks after major conferences for Snowflake and Databricks, both at the same time, and we'll be mostly focusing on that content as a bellwether for the broad data platform structure. Our next episode is going to be August 3rd, where we're going to have a conversation specifically about retail, restaurant, and consumer goods space, and we'll be bringing in some friends from the industry to talk about what's happening there—sort of a practical application, if you will, of modern analytics. In terms of today, this will be our agenda for today. We'll broadly try and move through these topics going from most general—what is the modern data landscape?—all the way down into maybe the most predictive, no pun intended, as we look at the space of generative AI and LLMs in the broader ecosystem. As ground rules for today's conversation, we're going to have roughly thirty-five to forty minutes of conversation. We will have a few polls where we'd like to engage you, answer some questions, and we'll talk about the results, and we'll look at whether this audience here, we think, is indicative of the broader audience out there in the world. Please, if you have questions along the way, put them in the Q&A function of Zoom. It'll give us a chance to make sure we don't miss them as the chat scrolls through throughout the conversation. And we will have a reserved Q&A component in the last section of this call. We'll then move on to a brief survey. We'd love you to give us feedback on this format—did it work? Did you enjoy it? What would you like to see in future versions of this conversation? And largely, there are some slides, but they're going to be very simple, really just helping us walk through the narrative we want to stick to. In terms of the folks here today, I'm going to introduce the panel and then we'll get right into the conversation just to give us the most time to have an open conversation. Joining me today from just outside London in the UK, Chris Hastie is a data lead here in the InterWorks practice, is a superhero in the Snowflake space, so certainly I think we'd say is one of the foremost experts there. Matt Woods, also a lead in our data practice, but also manages a big chunk of just our services business here in the states. Matt has been doing data engineering and architecture for well over a decade. He had a chance to attend the Databricks Summit and spend quite a lot of time with them and in that ecosystem where he does quite a lot of work, and so I think will represent some of the current news from there. And then, Ben Bausili joining us. Actually, he was on our last panel. I think there's some really interesting connections between our analytics conversations and these data platform-focused conversations. And specifically, Ben runs product here. So he thinks a lot about generative AI, LLMs, and how that interacts with what we might call the traditional version of modern analytics—the reporting and BI space. So that's our panel today. And I'd like to start out by posing a question, essentially, of the panel. The Lakehouse Basics is the section we're in here. And as we look at the evolution of data warehousing and the data platform, along the way we've had warehouse, and then we had lake, right? So we had this no-SQL, large conversation happening maybe five to eight years ago. Recently, we've evolved this concept of a lakehouse. Which if I was going to give you my simplest definition—we'll certainly ask Chris to give you a more detailed one in a minute—I'd probably tell you that what I'm suggesting here is that every version, or almost every version, of the modern data stack involves some version of the lakehouse. In other words, you have files which sit in cold storage, likely in a web bucket for objects, something like S3 in the Amazon space, for example. And then some, but certainly not all of that, can be staged for either warm analytics, so we can look at it in table-like format, and some of it could be moved all the way into SQL-ness and be able to be analyzed in a traditional modern warehouse. It feels to me like that's certainly the pattern that whether we talk about Snowflake, whether we talk about Databricks, when we talk about Azure or BigQuery or whatever it may be, Redshift, these platforms are all converging on that same basic architecture. But I'll ask the experts. I mean, Chris, what do you think of that? And maybe even before you talk about whether we think I'm right or wrong, talk about the lakehouse and the effect you've seen it have for your clients in terms of thinking about how they approach their fundamental architecture of their data space. Yeah, absolutely. So the main way really lakehouses evolved a lot of spaces is, as you say, data warehouses and data lakes have kind of come together into one more fluid platform. In the old days—and I say old, it's not that long ago—we would mainly see these instances of a lake, loads of files in a place, someone would ingest those into their data warehouse, do everything they want to within the data warehouse, and then some of them may choose to again output the result or basically copies of those tables back out to their lake again and use that as their way of organizing each of their different data stores. We have seen a lot of this shift with everybody coming together and effectively using those files on the cloud storage as that underlying platform that supports the databases in the first place and the data warehouses. That has really shifted the market because it's just enabled a completely different level of decoupling of the storage and compute. That's definitely something that Snowflake likes to talk about a lot. They've done the lakehouse approach for a long time, just without really shouting about it in the same way. Most interestingly, the way that they have it is all of their micro-partitions, they call them, they're just kept on a proprietary side. They're not really output so that a user can go and access the bucket, but they are there. And that decoupling, we've seen that, as you say, across pretty much all of the major players in this space these days, and it allows them to, firstly, save on costs and, secondly, remove that restriction in terms of performance for having to move data and form it and shape it a different way. They can act on it directly where it lands. And that also unlocks this whole other world of being able to work with different data formats. The biggest one these days being this new push for unstructured data and this ability to say, okay, well, my data could be a list of, for example, pictures of some form of different medical scans, and the data is just loads of different pictures, but then we can directly relate to those particular points of data, and we can leverage them without having to figure out a way to ingest them into a structured format that they're really suited for. That's a really great point, Chris, particularly when you look at, again, the combination of well-structured data with the unstructured. Jenny, as the host, can you trigger our first poll, by the way, because I'm really curious to understand for the folks here how practically relevant is this conversation. Right? I mean, are you in a lakehouse setup today, or have you thought about it, or are you still unsure if you should be? On that last point, are we unsure if we should be? I mean, Matt, again, you've worked in this space for a long time. Is there a compelling argument for some of the folks who might say, look, we're in SQL Server still. We have been, it's worked for us for ten years. Should we be thinking about moving into what we call this more modern architecture? I mean, what's your thought there? Is there a compelling answer? Yeah, so we actually get that question frequently from our clients. And I often revert to a comparison between a Formula One car and a city bus. So they're both moving people from point A to point B and back, but you aren't going to win a Formula One race with a city bus. Conversely, if you want to move fifty people from point A to point B, it's going to take you a while with the Formula One car, and the city bus will win that race. That's basically the same concept between transactional and analytics platforms. SQL Server is a very capable transactional platform and very good at moving small amounts of data very, very quickly. And the analytics platforms are going to be much better at moving large volumes of data in a reasonable timeframe, particularly with wider data sets. So in a real use case, your SQL Server is going to be great at handling small transactions like credit card transactions. You don't want your analytics platform doing that. That's a small amount of information, it's moving very quickly. An analytics use case from that would be reporting and doing the aggregates, year-over-year sales changes. You're not going to want to use SQL Server for that. It's not well-suited for the larger bulk transactions and aggregations that we use for most of analytics. So that's really where we end up with the analogy for the most part. Yeah, I think that's really helpful. And certainly, it's felt to me over the years that, you know, if we look at this and simply say, the warehouse is a place I can send a SQL query to and get a result, then sure, I think we look at these two things being at parity in that simple comparison. But looking at extensibility, it really feels to me like there is a huge amount more that wouldn't even have been in the realm of possibility we could do in the previous landscape that the lakehouse approach in the cloud gives us an approach for. You described a few, going back to Chris's point, combined—you know, image scans with transactional records. It's to me the interesting piece of why I have this conversation at all. I think another thing that I'd add to that actually, James, is with the approach of a lakehouse and your data effectively being stored as files in blob store, you also remove a lot of your reliance on a particular platform, because if at some point you do wish to leverage a different platform or a different solution, all of the data is already there as files. You don't need to go through a whole migration phase. You just point a different product towards those files. Yeah, Chris, I mean, I know you do, and I certainly know most of us at InterWorks have clients who are on a multi-cloud strategy, right? I mean, they may have their data lakes or lakehouse spread out across Azure and AWS, for argument's sake, right? And so they're even looking at—we don't have an appliance, certainly, but we don't even have a lock-in in terms of a single place our compute happens around the world. And so your idea does give a huge amount of flexibility there. I'm just going to share back the results of the poll back out to the group in terms of what everyone here answered, and I think this is not altogether surprising. It's a bit surprising that most folks here are not running analytics in the cloud yet. And certainly, I think that's hopefully the rest of this conversation is very compelling in terms of the argument of should I and where should I. In terms of the distribution of these toolsets, this is what we see almost every time we ask this question is that Snowflake has the lion's share of, let's say, the new cloud workloads, and SQL Server isn't going anywhere. And so, you know, I don't think this is an either-or scenario, right? I mean, I very much suspect that a lot of you who answered "I have Snowflake" maybe also answered "I have SQL Server," because we're certainly seeing this isn't necessarily a zero-sum game. At least it doesn't start out that way. And to Chris's point, migration can be gradual, which is nice. I'll move on because I know we have a limited amount of time today. And I'd like to set up the conversation around our conference attendees in the last few weeks by talking a little bit about what feel like the two most significant cloud data platforms today. But before we get into details of Snowflake and Databricks, it feels like we should address the market as a whole to an extent, because we're not going to have time to talk in detail about what's happening with BigQuery, what's happening with Redshift, where is Microsoft going? But I suppose I'll open it up to the team here. Those don't feel like they're completely different stories than we're going to hear from someone like Snowflake. And it feels to me like for most of you out there who don't have any of these today, it feels like you have a fundamental choice. A, what language does most of my team speak? And if the answer is not SQL, then Databricks feels like it stands out. B, how much do I need to completely control every aspect of how performance and cost work in my database? And the more you say, look, I don't want that, let the computer manage that, then you sort of get away from some easy choices like Snowflake. And maybe the third question there is how much does my spend have to go to my hyperscaler that I already have in-house, right? I buy everything else from Microsoft, so I'm going to buy this from Microsoft. I buy everything else from Amazon, so I'm going to buy this from Amazon. It really feels to me like to an extent we're asking those questions really more about institutional behavior than we are asking about feature comparison of the toolsets. But maybe, Matt, can you comment on that? I mean, am I wrong there? And would you say that as you look at advising our clients, you see massive differences across the spectrum, and this isn't as simple as break it down by language or by desirability of control? So I think the language definitely plays a part in some instances. I think, you know, for my take on why we're talking about Snowflake and Databricks today as opposed to BigQuery or Redshift or Synapse—all of those players are capable choices in the cloud data platform world. And that part of the workloads has become somewhat commoditized. So, you know, there's not going to be massive differences in what you're back-ending with your cloud data warehouse bar. What we're seeing with Snowflake and Databricks is really a feature parity war where they are each trying to compete and cover all the bases, but the other one is at this point. And then also trying to create new functionality that's going to really make them a better choice or the shinier option for a prospective client. And because of that battle that's going on, it's driving innovation at a wicked pace for those two in particular. So we're seeing them coming out with just amazing new features that are very exciting to talk about. It doesn't mean that the others in the space are not valid options. And really, it's going to depend on your use case and to some extent where you are currently. Yeah. Chris, I know you're very close to the Snowflake ecosystem and work a bit in Databricks as well. I mean, it certainly felt like this was originally a Spark versus SQL conversation. Is that what it is today, or have we evolved past that? Yeah, I think we have. We've moved past it, but that's certainly where it started. There used to be this very easy distinction—if you want to do Python and Spark, you're probably best off with Databricks. If you are really focused on streaming specifically, then Databricks again would probably be a good fit for you. Whereas if you have a load of people that are much more comfortable in the SQL world or maybe moving from, you know, the old world of data warehousing, then Snowflake used to be a very easy choice. And now, as Matt was alluding to a minute ago, the two have slowly crept towards each other, and at some point they've effectively just crossed over, and now that feature parity between the two is very similar. And in a way, I think that's quite exciting because, as I was talking about earlier, the whole lakehouse approach—being able to point multiple technologies at the same set of files—if you also know that those technologies are starting to become quite similar in their actual functionality, then it in a way removes some of the stress of picking a specific tool because you know that they are both strong players and they are both working to improve. And usually, if one tool has something that the other one doesn't, it's not going to last for particularly long. And it also forces all of the other tools out there, all of the third-party ingestion tools, transformation tools, governance, catalog, all of those tools have to learn how to work with both because they also can recognize that both are such major players. So again, you've got this feature parity in terms of the surrounding technology you wish to leverage. Well, let's understand then, if we look at that, you know, feature parity and that shared foundation, Snowflake started at SQL and started creating things like Snowpark, right, to allow in these Python workloads, and then we see Databricks starting at Spark and is now building Photon and Delta Lake to serve SQL workloads. Let's talk about what we heard at these two conferences in terms of where are points of differentiation either in language or in practicality, and what do we think about them? So maybe let's start with Databricks, and Matt, tell us about what you found, what you heard that was new and interesting at Databricks. And meanwhile, just for our hosts, you might bring me up the second poll. It'll give us a chance to be prepared for the next conversation, which is going to be platform extensibility. But Matt, tell us about Databricks. What did you hear there? Yeah. Number one, it was a great conference. Ten thousand people in person, three hundred breakout sessions. There is a lot of information coming out of that and a lot to process. But some of the things that were really exciting in my mind: they announced what they're calling Lakehouse Federation, which is basically their new governance and lineage and tracking based on Unity Catalog. Some interesting features coming out of this. So number one, the ability to query external data sources and include those sources in the column-level lineage end-to-end when you're using Databricks workflows. Some of those sources—I mean, it includes almost every major data platform including Snowflake and MySQL and Azure SQL and Postgres and BigQuery and Redshift. So almost anything, they're able to actually pull data from those through Databricks itself. That provides some very interesting possibilities from a data mesh architecture standpoint where you aren't necessarily locked into any single platform. You can query from a variety of sources, and whether or not that's something where you've got use cases in certain areas of your business where one tool makes sense over the other, you don't have to choose one or the other and sacrifice functionality in that space for your overall decision. What they're really saying though is that they want you to unify your data and AI governance within Databricks itself. And they are basically planning to be able to not only query data from other platforms, but also to push policy down to those platforms as well. So you truly could have your unified governance in one place for both data and AI and enforce that throughout your organization even if parts of that organization are using other data platforms. Pretty interesting approach there and exciting developments all around. Of course, the column-level lineage end-to-end is kind of a holy grail from a data perspective. But this opens up a lot of possibilities that are pretty interesting. The inclusion of the AI model there is also a big deal, and that takes me to my second big one. There was a lot of talk about the Databricks Marketplace, which of course is going to be relatively similar to the Snowflake Marketplace. That includes data sharing. It includes AI sharing. So you can actually do model sharing through this. Lots of interesting features including Lakehouse apps, which is going to be similar to Snowflake native apps. Lots of interesting possibilities coming out of that. And again, it's another area where they're really working to stay on pace with Snowflake. So both sides pushing each other. Another really interesting area that they announced was Lakehouse IQ, which is—some of this may is connected to Lakehouse. But Lakehouse IQ is basically the integration of a natural language model on the front end. So people in your organization can actually query knowledge, whether that's related to AI models, related to your data, in natural language. So they don't have to know SQL, they don't have to know Python. They can just ask the question like they would ask you. Important features here are that this adheres to internal security and governance as well. So, you know, you're not going to have a field tech be able to just ask Lakehouse IQ what their coworker makes. It's not going to work. So that was an important focus for that, and it was an important focus for the whole conference—security and governance around AI. But pretty cool integration on the front end with the natural language model there. Yeah. Matt, question for you. One of the challenges we saw when we looked at the roadmap coming out of some conferences on the analytics roundup—and Salesforce and Tableau are guilty of this—we saw a lot of talk about where AI was, and then we also heard things like "private preview in 2024." How close to reality did what you heard at Databricks feel? Is this aspirational or is this practical? That was one of the surprising things. Everything I'm talking about is either in private or public preview now. So some of the features of the policy pushdown with Lakehouse Federation is not. So that feature within Lakehouse Federation is not there, but they expect it soon, which will be interesting to see. But the other features of Federation are in—I believe—public preview now or private preview. But all of the ones that I'm talking about are in some form of preview right now, which is great because, you know, that's one of the challenges with all the conferences is understanding what's real and what's a dream that they have of making someday. Yeah, that's super helpful. And Chris, I guess, I know you—I saw you in Las Vegas, speaking of who knows what's real or not. Tell us about the Snowflake conference. What did you hear at Snowflake Summit that stood out as either doubling down or continuing down this track? I mean, for example, Snowpark, right? I think that was a big piece of what I heard them talking about, and again, moving closer to that world Matt's been spending a lot of time talking about in terms of Databricks. What stood out to you there? Yeah, absolutely. And I will say I find it funny how I think a lot of these conferences follow a funny pattern of one year, something is announced, and it's the big thing, and then the following year, sometimes it's effectively announced a second time. But the second time it's announced, it's now released or it's now available or it's now something we can use. The main example of that I thought for me was the whole Snowpark and the Snowflake apps piece. Last year, there was all this buzz about Snowflake purchasing Streamlit and having all these partnerships and the whole Snowpark being a thing, but this year the main focus was, okay, we've got it. We've shown you we have it, but now we can actually use it and we can demonstrate how it can unlock all these new functionalities. And I think that apps marketplace is definitely the biggest thing to be honest at the summit this year. There were so many demos, examples, customers, products all talking about either what they already have or what they are planning to deploy into the apps marketplace. The core real attraction to it is that Snowflake are monetizing so you can use your existing Snowflake compute to effectively purchase and leverage these apps. And when you use these apps, they are installed onto your ecosystem. So it gets around a lot of contractual issues with the standard procurement process for a new product because, contractually, it all sits within that Snowflake ecosystem you've already achieved. And that wasn't something you could do as easily before, and that was quite frankly what attracts a lot of people to the big, you know, Azure, AWS, GCP, because you have that exact contractual procurement piece, and it's there. But beyond the apps, I really could talk about the apps for a full hour, and I don't want to fill the full hour with that. The other two really cool things I think were announced: Firstly, there's the new dynamic tables that are now in public preview. They, again, were announced last year in private preview, but now they have been brought to the masses. And this is really bridging that gap for keeping fresh data into your platform as early as possible, including a whole load of transformations, and Snowflake have put a lot of effort under the hood into effectively streamlining a process to insert new records or merge existing records or delete, and all these things in a metadata-driven way that can just push everything through more efficiently and more cost-effectively. And I do realistically, I think this is one of those pushes where they have identified that Databricks quite frankly were exceeding them in terms of the streaming aspect. And this, the dynamic tables combined with all their recent Kafka and Snowpipe announcements, that in my opinion has closed that gap, and there's this really great new way of bringing your data and keeping it fresh. And the third and final, real exciting thing that I heard actually isn't yet available. There's always a few things. This one was announced as coming—I think they said coming from the start of the next calendar year—is a big push on improved observability within the platform. The whole improved ways of monitoring your pipelines, monitoring your warehouses, monitoring exactly what's going on. And that one particularly interested me because one of the big things I saw when I walked around the show floor during the conference was three or four different products that did exactly that. And it makes me wonder at what point Snowflake's own inbuilt capability may either overtake them, or they may just partner with them, or they might need to adapt what their offering is to fit more like the Snowflake apps marketplace. And if a lot of people decide to deploy their technologies and their products through the app marketplace, I think that's going to be really exciting to see how the market shifts because you could get some big names that don't move into the marketplace and a brand new player that does come onto the marketplace with something that might seem relatively simple. But if they just have actually beat the rush and get to it first, they might end up being the one that most people use. And I'm just really excited to see how that evolves in the next year, both for that and on the Databricks equivalent. Yeah. I'll double down on that. I mean, it really felt like to me one of the big innovations of the last ten, fifteen years, right, is the ability for me to make an incremental purchase from my phone in Amazon or in iTunes Store, whatever it may be, and it just removes this need of me getting in my car and going to a shop and giving someone else my three dollars or whatever it may be. It really felt like Snowflake is putting a lot of effort into that same sort of conversation of, look, you can buy it here, you can run it here, and it will be that much easier for you to get your next job done. It certainly doesn't seem to me like we're necessarily saying running apps in Python and Snowflake is the way you would design your perfect, you know, enterprise architecture, but it certainly seems like the most pragmatic approach to the next scenario you run into in a lot of ways. Chris, I'm curious just to keep Snowflake a bit honest. You know, last year, I feel like there were two conversations that they talked a lot about at their meeting. I feel like they have different stories, but I'd be curious on yours. Unistore, so their OLTP conversation—have you ever seen anyone use this? That's question number one. And question number two is data sharing, right? Which I think is tangential to this marketplace, and certainly, to me, it's always been a very compelling, but historically maybe underused feature of Snowflake that I've really been impressed with as an idea. Yeah, so the Unistore one, I will say, is, unfortunately, a rather quick answer. To my knowledge, Snowflake have not taken this past the private preview stage, and I'm not aware of many people specifically, at least that are in my space, that have access to that private preview. The goal was clearly to enable more transactional-level processing as opposed to the typical OLAP-style approach they have. And I'm assuming that it's still in the works and that there are just some very big players that just have a lot of feedback before it can graduate to public preview. But I didn't really see it talked about much at the summit, which is an interesting thing when it was like a big announcement last year, and it's gone quite quiet. So I'm not sure how Unistore's going to work out, or if it might just be overtaken by some form of approach of Snowpark and an app in something else. So we'll see. As for the data sharing, that one I do think has been a real win for Snowflake. And yes, these days a lot of the other providers, they also have their own sharing capabilities and all that. But Snowflake sharing, whilst underutilized across the general market, really enabled some pretty big names for Snowflake and a lot of core customers whose whole business is providing data. It enabled them to market their information, market that data globally without actually having to go out really and do it. They could just put it up in the marketplace. They can outline exactly what it is, and customers can come right to it, click download, maybe pay if it's a payment model, and have that data available. And that has been a kickstarter, I think, for the apps marketplace as well, because if you think about it, what is an app? It's a load of code that's usually combined with some underlying data to feed that, or maybe store user inputs or something like that. And that data sharing allows them to deploy that app. It's effectively a built framework that they were using to deploy apps before they even decided they were going to follow the apps route. So I thought that was really interesting, and it's definitely used. And we do have some customers out there that are using it for pretty interesting concepts, especially the idea of having one central organization that has lots of children, or lots of reportees, if you will. The ability to share, have one dataset that's processed for everybody, but then shared with security baked in—it's really, really neat in my opinion. Yeah, and I think going back to the very beginning of this conversation, we saw referenced, or I think I said, there's so much more you can do in the modern cloud space than you could with your SQL Server from fifteen years ago. I think this is a great example. Let's just say you're right. You have a central organization that wants to federate data out to affiliated entities, whether they be purchasers or customers, whatever it may be. I mean, how many times have we seen the world of we export this thing into a CSV, and then we ship it somewhere, and then we ingest it—the same things coming back, but the format changes, and something breaks, and everyone's circling around frustrated. The notion that that is secure, automated, seamless, and easy in the modern data space can't be understated in terms of the impact to, frankly, someone's job whose job it was to do that every Monday morning, for example. I feel like this has had a lot of big wins for organizations in that way. But okay, I'm mindful of time. I'm going to move on just to the next segment. And by the way, in terms of the poll results, we are sharing them back out. I think a lot of what we're seeing in just this quick poll here is validating the need for these tools like Snowflake, frankly, to have good Python support, right? We're seeing that being more and more an essential language next to SQL as being the language of data. And I think we're only going to see that trend continue as we move forward, particularly as we see more and more metadata as code. And I think we'll cover that in the next section here. Okay. So when I went to the Snowflake conference, I walked around and I saw a lot of vendors. I asked a lot of people what their pitch was. And I saw a few things that I'll just comment on really quickly and then see if the team here, the panel, can observe on it. Number one, there's a very big ecosystem. And a lot of folks who are very specifically Snowflake vendors, for example. It really feels like Snowflake Ventures has been very successful in creating lock-ins of a de facto stack of their own, which is in my mind very clearly competing for decision-making from our clients in terms of the AWS stack, Microsoft, etcetera. B, code versus no-code, particularly in the ELT space, seems to be a huge conversation these days, and I think ties directly into this narrative we have around cataloging, observability, and sort of next-gen semantics. And frankly, you know, I think when we talk about AI and generative AI particularly, the notion of being able to identify what data matters and the relationships in data points in code certainly seems like it's going to be necessary to get the most value out of training models, for example, in the generative AI space. Chris, you were at the conference with me as well. I know we spent a lot of time talking about this landscape. I mean, what were your takeaways in terms of what you saw in the ecosystem around Snowflake that was notable? And if you're somebody, again, who's on this call who says, look, I run SSIS and SQL Server, right, what do those folks need to be thinking about in terms of moving, not just their warehouse, but their whole data staging and interaction layer into the modern space? I think the main thing really that I took from the conference is that there's a huge saturation in the market in pretty much any of your different options. So if you just take, let's say, the cataloging space, off the top of my head, I can think of four, five, six different possible solutions, and they're all very similar. They have their differences, but they're very similar in terms of the actual features that they deliver. And the same can be said for the governance tools, observability and data quality tools, the ingestion tools, transformation tools. Yes, they all have their USPs, which are always useful and often stop being USPs a couple of years down the line when the other products catch up and they have to find a new one. But they're all just very similar really. So I think one of the main things I would say to anyone that's trying to explore any of these particular areas is I wouldn't worry too much about getting the exact right tool. I would focus on which one just feels the best for you at the time. Which one do you get on best with when speaking with their services team or their partner managers or their account managers? Which one matches the specific things that you are trying to achieve in a way that you say, yeah, that works for me? But there is a huge saturation. We have our own preferred partners in a lot of these spaces, and I will very happily talk about all the pros of those particular technologies. But at the end of the day, if a customer turns to me and says, I'd rather use this one for this reason, that is more than fair, and they're not—you know, they can all lift and shift the data or they can all catalog or they can all govern. They just have their own smaller nuances. Yeah. I mean, it certainly seems like in a lot of cases, the best tool is the one you're going to use, particularly when we look at cataloging. I really felt like there were a few standouts, technologically. And I'll just give a shout-out to—I think the people at data.world had a really interesting argument of we can leverage a lot of automation to generate the catalog and to keep the catalog dynamic. And it really felt—what it feels like for me in the governance, observability, cataloging space, the problem is you need somebody to go in there and populate it, and then they need to maintain it. Otherwise, it's immediately useless. And it really feels like what we saw from those folks, and in general, that whole new space, is the automation and the application of the AI tech is going to give us a much more usable product. Since we are, you know, getting really a little close on time and I do want to get to generative AI, Jenny, can you please run the last poll where—I'd be curious to understand as we look at this space, you know, where are our folks on the call today? Ben, I mean, I feel like we haven't heard from you yet, right? And I think this is the connection to the last conversation we had. Semantics in terms of a BI layer have, you know, historically often been very siloed inside that layer. I mean, if we go back to OG, if you had Business Objects, you built a universe for consumption in Business Objects. Well, now that we're thinking about this world where code interprets the objects, it feels like what we need to see is the evolution of a semantic layer that feeds not just BI tools—I do mean tools, because it certainly seems like most of our clients are moving to the world where they have more than one BI tool, and they would like to have a shared semantic layer—but also feed a generative AI or at least an ML space. Can you comment on what you're seeing in what I think is a major coming battleground and how you think about working with our clients in terms of what should they do now in that landscape? Yeah, so I mean, I think one of the threads in this whole conversation is consolidation of workloads, right? So, I mean, whether it's SQL and Spark together and being able to look at the same data, because we're seeing the results of everything being split and fractured, right? And that there's a cost to that in the integration. I think when we think about semantic layers or metric layers or what you'll hear the industry call headless BI, it's solving a similar issue just closer to the analytics layer, right, where we have data—like you said, we have things in Tableau or Power BI or a data science notebook or in some embedded analytics, and the cost of trying to get all those metrics to agree is difficult, right? We have separate teams, and it causes a lot of inefficiencies. And I also say catalog is one possible solution to this. We can look at the landscape, see what's out there, and try to reconcile it, but that really does kind of say, hey, what problems do I have? And then I can go manually fix it and go talk to those teams. Versus saying, hey, we have a foundation and everyone should build off of that. So it's kind of coming at the problem in different ways. But the industry was already talking about this before generative AI. I think what happens with generative AI is it can do analytics pretty well. And we're already seeing lots of interesting projects there and lots of vendors exploring it, but it needs to know about our business. You can't just give it a table and hope that it understands from the column name, right? It needs to understand the calculations and metrics we care about. It needs to understand the context. It needs descriptions there. So that needs to live someplace, and traditional data warehouses or data lakes aren't really designed for that or for that collaboration between teams. And so I think that's where you see things like Cube.js or Transform—who was bought by dbt—and dbt has their own semantic layer. I think that's where you're seeing all these things happen. And I also say that you're seeing the data platforms try to develop these things. So we've talked a lot about Databricks and Snowflake, but you know, Microsoft recently really—with Fabric—announced Fabric and they're doing their OneLake where everything comes in, S3 and you can Spark and SQL and all the things we're talking about. But I think kind of hidden in there is that they're using Synapse Data Warehouse not to transform your data into a normal SQL Server, but to add semantics—what is the business logic, what are the metrics that can be used by Power BI and notebooks and everything else that you want, right? And so they are kind of silently trying to come in and capture all that, so you have one obvious place to connect all your data and run all the great dbt models that, you know, they also offer. It feels to me like the single biggest problem or the single most frequent challenge I hear from our client base around the world is some version of we deployed a better system. So let's just talk about Tableau, for example. A self-service BI or analytics toolset, and we don't know what the business is doing with it. And because we don't know, we don't really believe anybody knows, and we want to have an understanding of what is the business doing? And if we need to shift directions, how would we even shift? Because we need to know where we are in order to know how to get to where we need to go. And so this feels like it's certainly the most prolific challenge in our business. If I look at this poll result, eighty percent of the folks responding either are looking for a solution for this or have a failed solution for this. And that doesn't surprise me in the least. It certainly feels like it's going to be worth it for us to have a deeper discussion just on this conversation of observability, catalog, and governance as we move forward. Because as we look at the intersection—again, looking at the poll here, seventy percent of the folks here said that AI, ML, data science is either very important or top priority for them in the next year. And certainly, in terms of actionability right now, I mean, it feels to me like there are two things everybody here can do: focus on data quality, and focus on data interpretability. How would a machine know what's important in my data? Well, again, we need to establish this notion of semantics for either a machine or individuals inside your business to be able to leverage it effectively. So certainly, I think look for that as one of our next conversations. In terms of Q&A, we have just a few minutes left on the call, but I have been monitoring, and there isn't a whole lot in the Q&A. We do have a few InterWorks alumni on our attendee list today. One of them from Cube has asked a question around vibes-based vendor selection. Look, I'll say it, I'll say it again. The thing that matters most with any of these tools is will you or will your team engage with using the tool? So that needs to ask the question of, does it speak the right language? The best GUI in the world, I guarantee you Matt Woods and Chris Hastie aren't going to open up the best GUI in the world and use it. They're going to ask, is there a console, right? We need to have the right tool for the right team, and I think that alignment matters a lot more than maybe feature parity in selection here. Also, shout-out to another one of our alumni from Greenhouse on the call, so really happy to see you all coming. Thanks to everyone for joining. Really appreciate tuning into this series. We're definitely going to have a follow-up coming up on governance and cataloging and that entire space. Our next conversation on August 3rd is going to be looking at practical application. We'll bring in Rouse Dogs, CIO for Chicken Salad Chick, and hear a lot about a practical application of this in the RCG space. So with that, I know we've just about stayed our welcome. I'll say thank you. Hope everyone's enjoying their summer. And we'll talk to you again soon. And thanks, by the way, to our panelists. I really appreciate you all joining. And if anyone has any questions or follow-up, don't hesitate to be in touch. We'd love to have the conversation.

In this InterWorks Data Forum episode, James Wright hosted a panel discussion with Chris Hastie, Matt Woods, and Ben Bausili, exploring the modern data landscape following major conferences from Snowflake and Databricks. The panel covered lakehouse architecture fundamentals, the evolution from traditional data warehouses to cloud-native platforms, and the converging feature sets of leading data platforms. They discussed key announcements including Databricks’ Lakehouse Federation, Snowflake’s native apps marketplace, and the growing importance of semantic layers for both traditional analytics and generative AI applications. The conversation emphasized practical considerations for platform selection, the ecosystem of supporting tools, and the critical role of data governance and observability in enabling successful AI initiatives.

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!