Data Prep for Really Good Analytics

Muted all along. So alright. We are five minutes past the hour, so let's get started. Welcome to today's webinar on data prep for really good analytics. I'm Carol, and I'm your analytics consultant based out of Melbourne. And I'll be your MC for today. Again, I'm joined by Ryan from our analytics team in the US region, and we will introduce him in just a minute. So next slide, please, Ryan. So before we get into today's content, I wanted to take a few minutes to introduce IntoWorks. Maybe this is your first webinar with us, or maybe you are returning. Either way, we welcome all of you, and we're super glad to see you here. You might be wondering who Interworks is. We do a lot, and sometimes it's a lot to explain. If I put it simply, we specialize in data strategy. So if you work in analytics, you know the challenges of an ever changing tech landscape, and the pressure of keeping up with high demand of insights that's needed to drive change within any organisation. So that's where Interworks, that's us, we come in. Our specialty is in building the best data strategies alongside you, and to be your trusted advisor when you need it. Further, everything that we do here is backed by our people. So we're constantly learning too, and we want to share everything with you, during this learning journey. Next slide please. And beyond our mission and our people, we can also help you navigate the right tools that align with your goals. So you can see some of our partners that are on screen. And if you're looking for more resources on data analytics or any of the technologies that is going to be discussed today, be sure to visit the Interworks blog. It's world famous and it's a great knowledge base for anyone in your organisation who might be working with data. And some quick reminders, we hold webinars like this every single month, and we value feedback from our different customer communities and audiences, so we can curate content based on your biggest challenges. And as mentioned before, today's webinar will be recorded. And in a few days, we'll send you an email with a replay. So if if you don't get an email from us, you can find this recording on our blog. And if you'd want to catch up on previous webinar replays, you can find the catalogue on our website. Finally, just one request for today's presentation is we will take questions towards the end of the session. But just to help us out, please use the Q and A function, which is at the bottom of your Zoom control. It's right next to chat. Kindly refrain from putting questions into chat because sometimes it can be missed. So use the Q and A button and put your questions there, and then we will take those questions at the end of the session. Okay, so let's meet our presenter and get into the main, event for today. So as I said previously, I'm Carol. I will be your emcee for today. And we have Orion Callahan, is the analytics lead from the u US region. And I will let Orion introduce himself. So Orion, take us away. Hello, everyone. I'm as Carol said, I'm Orion Callahan, and I'm I'm on the US side. I'm I'm in the States currently in well, I'm living in Oregon. If anybody doesn't know what that is, that is the state that's above California. So it's a little hat on top of California. And if you ever get a chance to to come to this region, Pacific Northwest, it's beautiful here. There's mountains. There's the ocean. There's beautiful lush forests, waterfalls, and rivers. It's it's a great place. But I'm trying to get all my friends to move out here after I moved out here. I'm an analytics lead at InnerWorks. I've been here for about four years, and in that time and before that, honestly, I've really gravitated towards data prep. Kind of a strange thing to gravitate towards, but I'll show you why. It's a it's a pretty interesting topic when you when you pop up in the hood. There's a lot there. It's obviously a really important topic for us because we spend a lot of our time doing that. So let's let's get right into it. This is data prep for really good analytics. So this session is who is this session for? This is for the analyst. This is for the dashboard developer. If you're doing those things, you may be doing both of these things, especially in a self-service environment, which is probably if you're using Tableau, you're using similar tools like that, you're probably in roughly a self-service environment. A lot of environments are like that now. And also, build either dashboards or you build analytic tools or products, and maybe you even put those things into production. So more this is geared towards you if if that's that's you. Also, if you just simply do data preparation, this is also geared towards you. What we'll cover we'll cover what the what data prep looks like. Look in the modern analytics stack in a modern BI environment, and the what the just what the landscape looks like there. And then we have a few suggestions for setting up a good, you know, solid data prep environment or some tips that you might want to employ when you're doing data preparation, and that'll be in the last half of the presentation. So before getting into everything, I just want to acknowledge, if you're an analyst, a data analyst, a business analyst, you have a lot of responsibilities. You wear a lot of hats. There's a data visualization piece. You're doing analysis. You're having to think about data quality. There's governance, which is a big part of everything. Sometimes you're building dashboards. Sometimes you're interpreting data. Sometimes you're doing storytelling around that with those dashboards and data visualizations. And then also in there, you're doing doing data preparation. Again, this is a big part of what we do, but it's just one of those many things that we have to worry about. But when I'm talking about data preparation, what do I mean what are the activities that someone's doing data prep would do? When you get a new dataset and this is generally what I do, and and this maybe is what you do too. You spend some time exploring that dataset. You you check out the distributions of of the of some of the columns. You look for any patterns. You might do some summary statistics. So you you take a column, you do an average of that, or you sum, or you you group by another dimension and then and then sum, you know, sales across that. You look for anomalies. Right? You want to you want to see if there's something weird about your data. For example, maybe there's some pretty strong outliers. Maybe there's a bunch of nulls in there that shouldn't be there, or maybe you have integers and string fields. And once you identify that, then you kind of go into the cleaning step and figure out what to do with those anomalies. Right? Things like split. If you're taking a column, maybe an address column, you wanna pull out the city and the state, you'd do something like a split. You'd rename the column if the column was renamed funky, which it often is when it comes from some other source system. Yeah, this isn't totally comprehensive of everything that you would do in the cleaning step, but there's a lot of functions here. Hopefully, a lot of these are familiar to you. There's sort of an enriching step. You want to take the data that you have and then join in some other information, maybe more dimensions that will help you in the storytelling, that will help you in the analysis. So, yeah, you might be taking one table and then joining in another table, or you might be doing a union where you have multiple years of data, but they're all split off into different tables, and your analysis requires you to have a continuous view of that data. So you union all those years together. You stack all those tables on top of each other for your analysis. And then, you know, maybe towards the end, you you go into the shaping phase. This is where you you really transform your data, and you you transform it in a way that is useful for analysis, which is useful for building dashboards. You might pivot it. You might use something like a group by to aggregate your data. You might change the fields from, like, maybe a string type. If it was a string type before, but it was just full of integers, maybe you would change it to an integer type or convert something to a date, etcetera. So this is roughly the not totally comprehensive because data preparation means a lot of different things, but this is some of the things that you might be doing for data preparation. I'm curious here how much time you all spend preparing your data, and I'm gonna I'm gonna add a poll here. There's a common stat that gets thrown around a lot, and I'm just curious how that aligns with this audience here. So if you could go ahead and fill that out. And I'm just gonna I'm not gonna give too long, maybe ten more seconds. But I appreciate all of the responses so far. Okay. Awesome. I'm gonna end the poll, share the results. So hopefully you see that. Looks like people gravitated towards around somewhere between fifty and seventy five percent. If there was a mean or median, they'd probably be in there somewhere about sixty percent over seventy five percent. There's nobody that said their data is always perfect, which is no surprise. But even if we spend fifty to seventy five percent of our time preparing data, that is an enormous amount of time that we spend preparing our data for dashboarding or analysis or whatnot. And it really creates the need for us to spend more time thinking about how we actually prepare data and if we're doing that correctly. Right? So luckily, we do spend a lot of time preparing data. Another stat you see get thrown around is we spend, like, eighty percent of our time preparing data. So, you know, maybe people in this audience had have clean data coming in, or they're just really efficient with their time, which is great. Speaking of efficient with your time, there's a lot of modern data prep tools, and some of this presentation will gravitate around actual dedicated data preparation tools. These data prep tools the screenshot below is from Tableau Prep. The value that these provide is pretty great. I've been using something like Alteryx. I've been using for four or five years. I've been using Tableau Prep for a few years. I love using these tools because they've they've changed the way I think about data, and, also, they've saved a lot of hours of my life. So faster time to delivery when you're cleaning data. These tools are often very visual. You can do things at the speed of thought. You're not writing a bunch of code and then having to compile it and see the output of that. You're just you're just sort of iteratively going along, making changes, doing minor you know, adding functions, seeing the output of that. And and that's a really powerful tool that we have. These tools can be used by non engineers. You don't have to be, you know, super technically savvy from a data engineering perspective to be able to use them. They're used by analysts, used by business analysts. Data scientists sometimes use these to prepare their data for that. And also, they are used by data engineers because they are intuitive and easy to use and for another reason that they're often used used as a prototyping tool, which you would hand over to a data engineer. I'll I'll talk about that in a bit. They're really great for detecting data quality issues because you can you can quickly poke around, and and this is similar to Tableau Desktop if you use that. You can just quickly poke into the data, interrogate it, find issues with it, and then what's great about Dataprep tools is you can build out a workflow to deal with it. And then my favorite part is that once you build out a a Dataprep workflow, you can actually operationalize it. You can you can publish that somewhere, and then you have a repeatable workflow where that you don't have to do over and over again. So I don't if anybody's done a lot of data preparation in Excel, but that's of an environment where it's really hard to automate what you've just done. And with this kind of tool, the automation's automatically built in. So to give you an idea of what data prep tools are out there in the landscape, When I say data prep tool, I'm often talking about something maybe in this box here, Tableau Prep, Alteryx, Dataiku. But I do want to mention that you can do data prep in basically the entire modern data pipeline. Upfront here is an ETL or ELT phase. That stands for Extract, Transform, Load or Extract, Load, Transform. These tools are often in the realm of the data engineer. They're closer to source systems. They're dealing with lots of transactional data. They're just trying to figure out how to organize and deal with data and get it to a place where then eventually, you can plug into it maybe with something like a data prep tool, and a business user can get in and then start developing reports or analytics and that kind of thing. But you can do essentially data prep in all of these steps. So as I said, further to the left is more in the realm of engineers, and this is very rough. And then further to the right, you might you might have more analysts using that kind of thing. And with that, there's a sort of technical nature to it. The tools on the left might require more code, maybe more SQL, maybe more Python. And as you go to the right, the requirement for code is way less. You can still use code in a lot of these. I think every single one of these data prep tools, you could probably put Python and use Python for. And then finally, this one to me, I feel is very important, but there's a path to production element to these. If you did your data prep in the tools closer to the left, it is much easier than if you've built out your data prep there to put that into production and feel comfortable with putting it into production. And we'll talk a little bit about this later, but it's not always the case. You wanna take whatever you've done in Tableau Prep for cleaning and put that thing into production. For example, if you only have Tableau Prep locally, you might not want to use that to prep data for production dashboards. So there's path to production part of this that I think is also important when thinking about all of these tools. All these tools here, by the way, are not totally comprehensive. All the data tools out there, there's a million of them. These are the ones that we find best in class and we partner with. So that was the data prep landscape piece of this. I'm going to move on and just talk about a few tips that we have as IntraWorks for doing data prep. I'm a consultant. I see a lot of different environments, and I see a lot of good practices, and I see a lot of not so great practices. So with that, I've gleaned a few insights, and I just want to provide them here for everyone here. These are ten thousand foot overview tips. I'm not going to necessarily demo anything or show you the specific function that's going to change your life, But hopefully, some just broader thoughts that will help you along in doing data prep at your organization. First one here is to create good data structures for analysis. That might seem so obvious because in the end, you're doing data prep. You're cleaning your data for analysis, and you want to do a good job doing that. But when you or use a framework or or have an actual process down for how you want to clean your data, it'll help you immensely in the long run because you'll have expected expected outcomes when after you've cleaned your data. You'll you'll recognize the data to be a certain way, and then you can quickly do analysis based off of that. That sounds a little heady, but but I'll show you what I mean here. When I was first getting into data preparation, and this is back when I was using Alteryx a lot. I still use that a lot, by the way. I was just trying to figure out I was building out all of these data pipelines, but I wasn't sure if I was doing it right. I was like, How does everyone else prepare their data? What's the best way to prepare your data for analysis? And I came across Tidy Data. If anybody If anybody's used R before, the person that created R, Hadley Wickham, He published this concept called Tidy Data. It's in the journal of statistical software. If you really wanna get nerdy about it, there's the link here, and hopefully someone can post that in the chat. But in this, he provides a couple just key principles or things that you would do to make your data tidy. At the beginning of this, he has some a couple great quotes, which I I really like the way this is framed here. He has he quotes Tolstoy, and it's the quote, happy families are all alike. Every unhappy family is unhappy in its own way. And then he likens up the data, which is, like families, tidy data sets are all alike, but every messy data set is messy in its own way. And what this what this is saying is that data comes to us in so many crazy different messy formats, and we have to deal with it. And it's it's never messy in the same way. However, if we clean the data in a more standardized way and get it to a a form that that is useful for analysis, those clean datasets, those tidy datasets tend to look roughly the same. You'll look at a table that's tidy, and you'll be like, that's a that's a good dataset for me to analyze. You'll obviously look at a messy one and be like, that looks like a nightmare. So what are those things that we can do to create these these tidy or happy datasets? In tidy data, just there's three very basic rules. Each variable must have its own column. Each observation must have its own row, and each value must have its own cell. And let me let me dig into that a little bit here. This is just a very basic table that has individual classrooms, the teacher so some and then some attributes or dimensions about that. The teacher, room square foot, and then the count of students. When we're thinking about a value, a value is just what's in a cell. Right? This is, in this case, count of students. This value here of twenty is so twenty students for that classroom. Obviously, per the rule here, you wouldn't want to have multiple values in the same cell, though I've seen it. But hopefully that one's a little bit more intuitive for everyone here. Each variable must have its own column. So a variable here would be teacher, room square foot, count of students. And the values going down are consistent to that variable. So I'm not for count of students, I also don't have room square feet also in the same column. And then finally, each observation must have its own row. So our observation here is a room, and we have some attributes about that room going across. So room one hundred one, Ms. Frizzle, five hundred square feet, twenty students. What we don't have is also in the same row the values and variables for room two fifteen. So this is a very tidy data set. You can do some analysis with this. To kind of hit this home further, this is straight from the paper, but I like these examples. This example here, A, and think of this as maybe results from a clinical trial. We have treatment A and treatment B as columns. The problem here is that treatment itself is a variable, and variables must have their own columns. So what we really need is a column that says treatment, And then in that column as values, we would have a and b. So this is a messy data set. You don't want to analyze this. And then the same goes for this next for this is b. This is also messy. It's basically the same thing, just flipped. We have variables name, which are across columns. They're taking up three columns. They're rich in one column. So the correct way to format this down below is we have name, which is a variable, has its own column. The treatment, which is a variable, has its own column, and the results also have their own column. If you want to dig into this more, again, read that paper or at least read part of the paper. It was pretty compelling. Anyway, yeah, tidy day set, happy family. Another concept you'll hear, especially if you're preparing data for Tableau, you see this mentioned quite a bit, just narrow versus wide. This is essentially the same idea, but generally, your datasets will be a little bit more narrow going into Tableau, meaning that you're you're making you're making your dataset longer by the transformation you're doing to clean the data. Data often looks like this first piece here. This is something that you would see coming from an Excel spreadsheet. It's human readable, but not necessarily machine readable. And then thinking back to tidy data, the problem here is that for each column here, there's actually three different variables. We have biology, which is a course, and that's that should be have its own column. There's January, which is a time of some sort. That should also have its own column. And then there's students or count of students. That also should have its own column as well. So we would do something like a pivot. So in your Dataprep tool, you use the pivot. You probably actually use two pivots to to work this one out. And you can see down here, we have course and month and the number of students all in their own column. Again, another way to think about this is this up up above, this is human readable. It's easier for our eyes to track, but to do analysis with for a machine to actually to to ingest this information, this top one is not great. The bottom one is much better. Second tip or thing to think about here with Dataprep is keep performance in mind. With this this thought, I don't necessarily want to get away from what one of the big value propositions are for data prep tools and that you can iterate quickly, you can fail fast, you can quickly come up with things. I don't want you to obsess over performance that entire time you're coming up with you're you're cleaning your data. Though if if the thing that you're building ends up in production, that's when you need to start obsessing over performance here. And what I'm gonna talk about here is how you can use data prep to, especially specifically for dashboards, and this will be in Tableau's case, how you can use Dataprep to improve the performance of those dashboards. So when it comes to performance, especially with Tableau dashboards, if you're a Tableau developer and haven't read this yet, I highly, highly recommend it. This came out from one of my colleagues last year, two of my colleagues. Very smart colleagues of mine, and I feel very fortunate to be coworkers of theirs. They spent a lot of time obsessing about performance in Tableau and how to just squeak out as much performance as they possibly can from a dashboard. This is still basically the latest and greatest on this. I keep a version printed off near my desk because I use it so often. I reference it so often, so I highly recommend looking at this. But it gravitates towards four central areas that we, as developers, can control when it comes to performance in your dashboard. Those four areas are the data that we feed into the dashboard, the calculations that we use within Tableau, the visual controls and layout of that dashboard, and then the design and user experience of that dashboard. And I see can can someone send that link out that was on here to the white paper? So those are the four things that we can actually control with dashboard performance. When it comes to data prep, two of those would be in the purview of data prep. One is the data that we feed into the dashboard. And then two are the calculations that we use within Tableau, and I'll explain what both of those mean. So focusing in on the data part of this, these are just some of the summary ideas from the white paper on the data side of things that we can control, how you would improve performance. One is limit data size. Limit the number of columns. If you have a bunch of extra columns that you don't need in the dataset, don't have them there. Do aggregation. So aggregation would limit the row size. So say if you had your dataset was at the transactional sales level, You wouldn't want that in your dashboard if the most granular thing that you were looking at was sales per store. It'd be unnecessary to have all that data in there, so you do an aggregation. Or you filter your data. Say, you have a global dataset, but your analysis is only in Australia, so why would you have the entire globe on there? Just filter out for Australia. You're limiting the data size. So that's something that we can control during the data prep. You can you can remove the columns. You can clean the data that way. You can do aggregations, and you can do you can do some sort of filtering. Use well modeled data sources. For example, if your if your data source is messy, if it's not tidy, that will cause you to do a bunch of wacky calculations on the Tableau side to make it work, which then causes performance issues. Materialized calculations in the data source. This really comes is really a data prep topic, and I'll explain what this means. But essentially, the dashboard if you have calculations happening in the Tableau workbook, when you publish that dashboard say if you're just dividing two columns, when the user is interacting with that dashboard, that math is happening on the fly. If you materialize the calculation, that calculation is already done in the data source. So when the user is messing with the dashboard, it's not having to to perform that calculation. Therefore, the user is just having an overall better experience. So when we're talking about materializing, I know it's a kind of a if you haven't not familiar with that word, it's kind of a a strange one. But essentially, all this is is is in one of those upstream data sources, do those calculations in advance. That way, the dashboard isn't doing those calculations. Some I some calculations that you really wanna consider moving upstream into the data prep phase. Heavy stream manipulation. These calculations, I'm sure you're all familiar, can get very hairy. The more complex it looks, it's probably the case that that's the more complex and taxing it is from a compute perspective. So, you know, this is just example of a a a regex expression, but it it's probably taxing enough that you would wanna move this up upstream. And so and sometimes these these types of calculations are it doesn't necessarily make sense for the dashboard to be doing this anyway. It just makes sense for this to be closer to the data source. Date conversions are equally taxing. They can also equally get gnarly. Here's an example of one. And this is honestly just a basic date conversion. So these can also get gnarly, something you'd want with upstream. Even just casting fields. So if you have a string field and that should be a date, casting that to a date is one is compute intensive. But also, doesn't really make sense for the dashboard to be doing that anyway. In a better governed environment, you would be dictating what those columns are, just more upstream, closer to the data source anyway. So another example of something to materialize. Grouping, if you're doing if you're reorganized reorganizing dimensions using, like, case logic or if then else or using grouping in Tableau, also something else to to move upstream. But then heavy usage of LODs. You can't really move LOD calculations upstream, but if you're heavily reliant on level of detail expressions, there's probably something wrong with the granularity of your data or your data model, so you might want to rethink how you've you set up your data model or the granularity of the data going into dashboard. When I'm talking about moving things upstream, I just wanna have a little visual for that. This is an example of of roughly how the pipeline of data might work at your organization. But again, this is something that really is organization environment dependent. But you might have some data sources. Maybe these are maybe ones in Snowflake and is governed by some data engineers. Use your data prep tools to pull some of these tables out, pull some tables from other data sources, join them together, enrich your data. And then what often happens, and this is this is in a Tableau environment, is you will use a Dataprep tool to then push and extract to Tableau Server, and that has, you know, that has the data that you need for then that the dashboard is pulling from to, you know, to create data visualizations. So where are all the places that logic can happen and calculations can happen? Where where can it live in this in this data pipeline? Well, basically, everywhere. You can have your calculations up here in the source. You can have it in doing it in the data prep tool, or you can have it in the in the dashboard side of things. And so the whole point of this is that maybe we should start moving some of these things further upstream. So you can move them to source. It would probably require a conversation with some data engineers, but that might be a good place for it, especially if it's if it's critical logic. You don't wanna have highly critical calculations that may be happening on the dashboard side of things, but maybe in a more governed environment like like up here. You could also do it in the data prep tool. So if you did it here, then those calculations would be materialized in the extract that you then the workbooks are looking at or the dashboards are looking at. So the calculations materialized, and then, therefore, that would basically effectively improve performance because of that. And I'm just acknowledging another setup here is you might not have a data prep tool. Tableau Desktop itself is a is a very good data prep tool, so you might do that data modeling here, pull an extract from a bunch of data sources. And there's a way and I apologize. I'm not showing that here, but there's a way when you're creating an extract to materialize calculations. It's not gonna materialize all the calculations. Like LODs, it it can't materialize, but some of the row level calculations, it will materialize upon creation of the extract. That way, those calculations are already done, and it's not happening when the user's interacting with the dashboard. The third tip here, and this is very this seems obvious and put us avoid technical debt. These data prep tools that we have are very powerful because we, as analysts, are building out data pipelines that then often feed production analytics, they feed production dashboards. And because of the ease of these tools, we can just spin them up. We can create a bunch of them. You can create five or six data pipelines a day if you wanted to. And that's a very dangerous thing from a technical debt perspective. And so what I mean by technical debt, there are basically, there are long term costs to short term decisions, especially if those decisions aren't the best decisions, but you're just trying to get something done. And I I know that we're all in we've all been in this situation. Some examples of that might be and that's on the right side here. But, you know, just something as basic as inconsistent naming conventions for fields. Maybe it's not aligned with the rest the the way that the rest of your organization has those naming conventions. You rush something into production. You didn't have time to create documentation because you're on to the next project. And then I just mentioned this, but rushing something into production is another one. It's very easy for us to create prototypes with Dataprep tools, and so often things get rushed in production. It feels like things are working, but there's probably a lot of technical debt built into these data pipelines that you've built with these Dataprep tools. And again, it's the nature of these tools. They're very powerful, but it comes with some possible technical debt if we use them the wrong way. I just wanna give a quick example of what this might look like, and this is certainly a scenario that I've fallen into before. But say you're you're in marketing and Susie, your very well intentioned boss, has a dashboard request for you. For some reason, that dashboard request is in Papyrus font, but we'll just ignore that. And you know where that data is. You feel good about it. So you find that data source. You create a data prep flow for it because you have to do some manipulation of it, and you build that dashboard. Great. You're good to go. She passes that dashboard around, and it turns out Frank from sales really loves that dashboard. He wants to see some of that same data but with a couple additional columns that help him out in his sales department. He's asking for three dashboards, but can probably build that out in two. And because you've already built this in production and it's working, you don't wanna touch it. You don't wanna mess with it. So you just build two other flows that go along with this. So you're feeling okay. But Susie comes back. She has another dashboard request. This time, it's in Comic Sans, but you ask any questions. And it's just an additional a couple additional views that are similar, but not different enough that you feel like you have to I'm sorry, not similar enough that you feel like you can use the same data. Maybe you build up two other pipelines here. You can see where I'm going with this, but because we can have logic here in the data source, because we can have logic here in the data prep phase, and then we can also have logic in the workbooks, You've essentially created a ton of technical debt here, and it's a big house of cards. What what if what if you leave? What if one of the metrics that gets reported in across all of these dashboards, if that gets the calculation for that changes. You have to go into a bunch of data prep tools and flows and change all of that and hope it doesn't break anything, or maybe that logic is sitting here. And there's a lot of examples of how this can break down. But in general, you should maybe think about when you're pushing things into production, what's already there and if you can start to reuse these things. It might be the case that you, because of overlapping logic and overlapping data, you didn't necessarily need two of those Dataprep pipelines. You could have combined some of them. And so that flow then was sourcing three different dashboards. And that removes the need for five different flows. And perhaps maybe there was one of those flows that were required because the reporting requirements were just sufficient enough that you needed a whole another prep flow, which is fine. But the nature of these tools, it's very easy to rush these things in production. It's very easy to spin things off, so I just want to caution you to be careful about these things. Finally finally, data prep tools don't always need to be the thing that you push into production and use. They also make really great prototyping tools. I've seen a few really good examples of my clients where they actually have started to use data prep not as the thing where they're they're pushing this this flow into production and using that for the production dashboards, especially the really important ones where you have to do a lot of refreshes and the data needs to be up to date and there's a lot of data involved. No. They they use this as a prototyping tool. And because you, as the person closest to the business, the data analyst, the business analyst, you know your data really well, and you know the logic you need to prep your data, to how to enrich your data. It makes a lot of sense for you to build out these pipelines. But because and if we think back to that previous slide I had about how some of those more data engineering tools are better to productionalize. It might not always make sense for you to productionalize a Tableau Prep flow or maybe even an Alteryx flow. And so this makes a great prototyping tool that because it's visual, it's very clear what's happening. It's linear. And so you can hand this off to a data engineer, they can pretty quickly know what's going on here and then and then convert this to to SQL or some other code. So wanna put that out there. These data prep tools are really powerful, but they're also just really decent prototyping tools. And with that, that's it. So just kind of quick summary of those tips there. Think about performance. Don't create technical debt for yourself. And then and then try to create and have some standards around the the data structure that you're creating for analysis. With that, do we have any questions? We do have one question in the q and a. It says, how do we regularly run a Tableau Prep flow? Great question. So if you have Tableau Prep Conductor, which is part of the the data management add on for Tableau and Tableau Server, you can take a flow that you've created in Tableau Prep, the desktop version of it, and then you can publish that up to Tableau Server. When it's up there, then you can schedule it so you can have this run on whatever cadence you want. Oftentimes, it's nightly. And in latter versions, you can actually have that that flow contingent on other flows succeeding, which is interesting. So say if if you have two data prep processes and one one is on the tail end, you can have that tail end one run, and that first one succeeds. So, yeah, that so, basically, to answer that question, you would you would need the the data management add on in Tableau Prep Conductor, but you you can absolutely schedule publish and schedule flows to Tableau Server. I see a couple I see in the chat question. Materializing calculations, when that would be helpful. I have I I had that slide that showed the types of calculations in which you wanna think about materializing. It's hard you can use the performance recorder tool in Tableau to to get a sense of which calculations are are especially taxing. If you wanna go down that route, you can, and then you can you can decide to to materialize, you know, on the Tableau side of things, Tableau Desktop side of things, you don't really have control over which fields are materialized. You just materialize them, all of them that Tableau can. So I would say it's mostly helpful, especially if you have row level calculations. It's helpful then. But, I mean, materializing the calculations will simply just improve performance, though. You might run into some analytical issues, say, if you've if there are some columns that shouldn't be computed in advance and that need to be a part of the aggregation or the slicing and dicing that you do in Tableau, might not make sense to materialize them. And to answer your there's another question about data management being a must. I think that I think that is true. Think you need the data management add on to to have Tableau Prep Conductor, but I might double check on that. Any other questions? Awesome. Well, thank you, everyone. Just a reminder that there will be a replay of this that will get sent out in an email after this, maybe one or two business days. There also will be a blog post that will have this posted in it. So if you missed everything or wanted to watch this again, this will absolutely be out there for you, a recorded version. And then also there'll be a survey, a short survey. We really want to hear your feedback on this. If you liked it, something you didn't like about it, something you think we can talk about more or elucidate more on, we'd love to hear it. So please fill out that survey, and thank you for joining.

Transcript

I was muted all along. So alright. We are five minutes past the hour, so let's get started. Welcome to today's webinar on data prep for really good analytics. I'm Carol, and I'm your analytics consultant based out of Melbourne. And I'll be your MC for today. Again, I'm joined by Ryan from our analytics team in the US region, and we will introduce him in just a minute. So next slide, please, Ryan. Before we get into today's content, I wanted to take a few minutes to introduce InterWorks. Maybe this is your first webinar with us, or maybe you are returning. Either way, we welcome all of you, and we're super glad to see you here. You might be wondering who InterWorks is. We do a lot, and sometimes it's a lot to explain. If I put it simply, we specialize in data strategy. So if you work in analytics, you know the challenges of an ever-changing tech landscape, and the pressure of keeping up with high demand of insights that's needed to drive change within any organization. So that's where InterWorks, that's us, we come in. Our specialty is in building the best data strategies alongside you, and to be your trusted advisor when you need it. Further, everything that we do here is backed by our people. So we're constantly learning too, and we want to share everything with you during this learning journey. Next slide, please. And beyond our mission and our people, we can also help you navigate the right tools that align with your goals. So you can see some of our partners that are on screen. And if you're looking for more resources on data analytics or any of the technologies that is going to be discussed today, be sure to visit the InterWorks blog. It's world famous and it's a great knowledge base for anyone in your organization who might be working with data. And some quick reminders: we hold webinars like this every single month, and we value feedback from our different customer communities and audiences, so we can curate content based on your biggest challenges. And as mentioned before, today's webinar will be recorded. And in a few days, we'll send you an email with a replay. So if you don't get an email from us, you can find this recording on our blog. And if you want to catch up on previous webinar replays, you can find the catalog on our website. Finally, just one request for today's presentation is we will take questions towards the end of the session. But just to help us out, please use the Q&A function, which is at the bottom of your Zoom control. It's right next to chat. Kindly refrain from putting questions into chat because sometimes it can be missed. So use the Q&A button and put your questions there, and then we will take those questions at the end of the session. Okay, so let's meet our presenter and get into the main event for today. So as I said previously, I'm Carol. I will be your emcee for today. And we have Ryan Callihan, is the analytics lead from the US region. And I will let Ryan introduce himself. So Ryan, take us away. Hello, everyone. As Carol said, I'm Ryan Callihan, and I'm on the US side. I'm in the States, currently living in Oregon. If anybody doesn't know what that is, that is the state that's above California. So it's a little hat on top of California. And if you ever get a chance to come to this region, Pacific Northwest, it's beautiful here. There's mountains. There's the ocean. There's beautiful lush forests, waterfalls, and rivers. It's a great place. I'm trying to get all my friends to move out here after I moved out here. I'm an analytics lead at InterWorks. I've been here for about four years, and in that time and before that, honestly, I've really gravitated towards data prep. Kind of a strange thing to gravitate towards, but I'll show you why. It's a pretty interesting topic when you pop the hood. There's a lot there. It's obviously a really important topic for us because we spend a lot of our time doing that. So let's get right into it. This is data prep for really good analytics. So who is this session for? This is for the analyst. This is for the dashboard developer. If you're doing those things, you may be doing both of these things, especially in a self-service environment, which is probably if you're using Tableau, you're using similar tools like that, you're probably in roughly a self-service environment. A lot of environments are like that now. And also, you build either dashboards or you build analytic tools or products, and maybe you even put those things into production. So this is geared towards you if that's you. Also, if you just simply do data preparation, this is also geared towards you. What we'll cover: we'll cover what data prep looks like in the modern analytics stack, in a modern BI environment, and what the landscape looks like there. And then we have a few suggestions for setting up a good, solid data prep environment or some tips that you might want to employ when you're doing data preparation, and that'll be in the last half of the presentation. So before getting into everything, I just want to acknowledge, if you're an analyst, a data analyst, a business analyst, you have a lot of responsibilities. You wear a lot of hats. There's a data visualization piece. You're doing analysis. You're having to think about data quality. There's governance, which is a big part of everything. Sometimes you're building dashboards. Sometimes you're interpreting data. Sometimes you're doing storytelling around that with those dashboards and data visualizations. And then also in there, you're doing data preparation. Again, this is a big part of what we do, but it's just one of those many things that we have to worry about. But when I'm talking about data preparation, what do I mean? What are the activities that someone doing data prep would do? When you get a new dataset, and this is generally what I do, and maybe this is what you do too. You spend some time exploring that dataset. You check out the distributions of the columns. You look for any patterns. You might do some summary statistics. So you take a column, you do an average of that, or you sum, or you group by another dimension and then sum, you know, sales across that. You look for anomalies. Right? You want to see if there's something weird about your data. For example, maybe there's some pretty strong outliers. Maybe there's a bunch of nulls in there that shouldn't be there, or maybe you have integers in string fields. And once you identify that, then you kind of go into the cleaning step and figure out what to do with those anomalies. Right? Things like split. If you're taking a column, maybe an address column, you want to pull out the city and the state, you'd do something like a split. You'd rename the column if the column name was funky, which it often is when it comes from some other source system. Yeah, this isn't totally comprehensive of everything that you would do in the cleaning step, but there's a lot of functions here. Hopefully, a lot of these are familiar to you. There's sort of an enriching step. You want to take the data that you have and then join in some other information, maybe more dimensions that will help you in the storytelling, that will help you in the analysis. So, yeah, you might be taking one table and then joining in another table, or you might be doing a union where you have multiple years of data, but they're all split off into different tables, and your analysis requires you to have a continuous view of that data. So you union all those years together. You stack all those tables on top of each other for your analysis. And then, you know, maybe towards the end, you go into the shaping phase. This is where you really transform your data, and you transform it in a way that is useful for analysis, which is useful for building dashboards. You might pivot it. You might use something like a group by to aggregate your data. You might change the fields from, like, maybe a string type. If it was a string type before, but it was just full of integers, maybe you would change it to an integer type or convert something to a date, etcetera. So this is roughly, not totally comprehensive because data preparation means a lot of different things, but this is some of the things that you might be doing for data preparation. I'm curious here how much time you all spend preparing your data, and I'm going to add a poll here. There's a common stat that gets thrown around a lot, and I'm just curious how that aligns with this audience here. So if you could go ahead and fill that out. And I'm just going to, I'm not going to give too long, maybe ten more seconds. But I appreciate all of the responses so far. Okay. Awesome. I'm going to end the poll, share the results. So hopefully you see that. Looks like people gravitated towards around somewhere between fifty and seventy-five percent. If there was a mean or median, they'd probably be in there somewhere, about sixty percent, over seventy-five percent. There's nobody that said their data is always perfect, which is no surprise. But even if we spend fifty to seventy-five percent of our time preparing data, that is an enormous amount of time that we spend preparing our data for dashboarding or analysis or whatnot. And it really creates the need for us to spend more time thinking about how we actually prepare data and if we're doing that correctly. Right? So luckily, we do spend a lot of time preparing data. Another stat you see get thrown around is we spend, like, eighty percent of our time preparing data. So, you know, maybe people in this audience have clean data coming in, or they're just really efficient with their time, which is great. Speaking of efficient with your time, there's a lot of modern data prep tools, and some of this presentation will gravitate around actual dedicated data preparation tools. These data prep tools, the screenshot below is from Tableau Prep. The value that these provide is pretty great. I've been using something like Alteryx. I've been using it for four or five years. I've been using Tableau Prep for a few years. I love using these tools because they've changed the way I think about data, and also, they've saved a lot of hours of my life. So faster time to delivery when you're cleaning data. These tools are often very visual. You can do things at the speed of thought. You're not writing a bunch of code and then having to compile it and see the output of that. You're just sort of iteratively going along, making changes, adding functions, seeing the output of that. And that's a really powerful tool that we have. These tools can be used by non-engineers. You don't have to be super technically savvy from a data engineering perspective to be able to use them. They're used by analysts, used by business analysts. Data scientists sometimes use these to prepare their data for that. And also, they are used by data engineers because they are intuitive and easy to use and for another reason that they're often used as a prototyping tool, which you would hand over to a data engineer. I'll talk about that in a bit. They're really great for detecting data quality issues because you can quickly poke around, and this is similar to Tableau Desktop if you use that. You can just quickly poke into the data, interrogate it, find issues with it, and then what's great about data prep tools is you can build out a workflow to deal with it. And then my favorite part is that once you build out a data prep workflow, you can actually operationalize it. You can publish that somewhere, and then you have a repeatable workflow that you don't have to do over and over again. So if anybody's done a lot of data preparation in Excel, that's an environment where it's really hard to automate what you've just done. And with this kind of tool, the automation's automatically built in. So to give you an idea of what data prep tools are out there in the landscape, when I say data prep tool, I'm often talking about something maybe in this box here: Tableau Prep, Alteryx, Dataiku. But I do want to mention that you can do data prep in basically the entire modern data pipeline. Upfront here is an ETL or ELT phase. That stands for Extract, Transform, Load or Extract, Load, Transform. These tools are often in the realm of the data engineer. They're closer to source systems. They're dealing with lots of transactional data. They're just trying to figure out how to organize and deal with data and get it to a place where then eventually, you can plug into it maybe with something like a data prep tool, and a business user can get in and then start developing reports or analytics and that kind of thing. But you can do essentially data prep in all of these steps. So as I said, further to the left is more in the realm of engineers, and this is very rough. And then further to the right, you might have more analysts using that kind of thing. And with that, there's a sort of technical nature to it. The tools on the left might require more code, maybe more SQL, maybe more Python. And as you go to the right, the requirement for code is way less. You can still use code in a lot of these. I think every single one of these data prep tools, you could probably use Python in. And then finally, this one to me, I feel is very important, but there's a path to production element to these. If you did your data prep in the tools closer to the left, it is much easier if you've built out your data prep there to put that into production and feel comfortable with putting it into production. And we'll talk a little bit about this later, but it's not always the case. You don't always want to take whatever you've done in Tableau Prep for cleaning and put that thing into production. For example, if you only have Tableau Prep locally, you might not want to use that to prep data for production dashboards. So there's a path to production part of this that I think is also important when thinking about all of these tools. All these tools here, by the way, are not totally comprehensive of all the data tools out there. There's a million of them. These are the ones that we find best in class and we partner with. So that was the data prep landscape piece of this. I'm going to move on and just talk about a few tips that we have as InterWorks for doing data prep. I'm a consultant. I see a lot of different environments, and I see a lot of good practices, and I see a lot of not-so-great practices. So with that, I've gleaned a few insights, and I just want to provide them here for everyone here. These are ten-thousand-foot overview tips. I'm not going to necessarily demo anything or show you the specific function that's going to change your life. But hopefully, some just broader thoughts that will help you along in doing data prep at your organization. First one here is to create good data structures for analysis. That might seem so obvious because in the end, you're doing data prep. You're cleaning your data for analysis, and you want to do a good job doing that. But when you use a framework or have an actual process down for how you want to clean your data, it'll help you immensely in the long run because you'll have expected outcomes when after you've cleaned your data. You'll recognize the data to be a certain way, and then you can quickly do analysis based off of that. That sounds a little heady, but I'll show you what I mean here. When I was first getting into data preparation, and this is back when I was using Alteryx a lot— I still use that a lot, by the way. I was just trying to figure out, I was building out all of these data pipelines, but I wasn't sure if I was doing it right. I was like, How does everyone else prepare their data? What's the best way to prepare your data for analysis? And I came across Tidy Data. If anybody's used R before, the person that created R, Hadley Wickham, he published this concept called Tidy Data. It's in the Journal of Statistical Software. If you really want to get nerdy about it, there's the link here, and hopefully someone can post that in the chat. But in this, he provides a couple just key principles or things that you would do to make your data tidy. At the beginning of this, he has a couple great quotes, which I really like the way this is framed here. He quotes Tolstoy, and it's the quote, "Happy families are all alike. Every unhappy family is unhappy in its own way." And then he likens it to data, which is, like families, tidy datasets are all alike, but every messy dataset is messy in its own way. And what this is saying is that data comes to us in so many crazy different messy formats, and we have to deal with it. And it's never messy in the same way. However, if we clean the data in a more standardized way and get it to a form that is useful for analysis, those clean datasets, those tidy datasets tend to look roughly the same. You'll look at a table that's tidy, and you'll be like, that's a good dataset for me to analyze. You'll obviously look at a messy one and be like, that looks like a nightmare. So what are those things that we can do to create these tidy or happy datasets? In tidy data, there's three very basic rules. Each variable must have its own column. Each observation must have its own row, and each value must have its own cell. And let me dig into that a little bit here. This is just a very basic table that has individual classrooms, the teacher, and then some attributes or dimensions about that. The teacher, room square foot, and then the count of students. When we're thinking about a value, a value is just what's in a cell. Right? This is, in this case, count of students. This value here of twenty is twenty students for that classroom. Obviously, per the rule here, you wouldn't want to have multiple values in the same cell, though I've seen it. But hopefully that one's a little bit more intuitive for everyone here. Each variable must have its own column. So a variable here would be teacher, room square foot, count of students. And the values going down are consistent to that variable. So for count of students, I also don't have room square feet also in the same column. And then finally, each observation must have its own row. So our observation here is a room, and we have some attributes about that room going across. So room one hundred one, Ms. Frizzle, five hundred square feet, twenty students. What we don't have is also in the same row the values and variables for room two fifteen. So this is a very tidy dataset. You can do some analysis with this. To kind of hit this home further, this is straight from the paper, but I like these examples. This example here, A, think of this as maybe results from a clinical trial. We have Treatment A and Treatment B as columns. The problem here is that treatment itself is a variable, and variables must have their own columns. So what we really need is a column that says treatment, and then in that column as values, we would have A and B. So this is a messy dataset. You don't want to analyze this. And then the same goes for this next one, this is B. This is also messy. It's basically the same thing, just flipped. We have variables, name, which are across columns. They're taking up three columns instead of one column. So the correct way to format this down below is we have name, which is a variable, has its own column. The treatment, which is a variable, has its own column, and the results also have their own column. If you want to dig into this more, again, read that paper or at least read part of the paper. It was pretty compelling. Anyway, yeah, tidy dataset, happy family. Another concept you'll hear, especially if you're preparing data for Tableau, you see this mentioned quite a bit, just narrow versus wide. This is essentially the same idea, but generally, your datasets will be a little bit more narrow going into Tableau, meaning that you're making your dataset longer by the transformation you're doing to clean the data. Data often looks like this first piece here. This is something that you would see coming from an Excel spreadsheet. It's human readable, but not necessarily machine readable. And then thinking back to tidy data, the problem here is that for each column here, there's actually three different variables. We have Biology, which is a course, and that should have its own column. There's January, which is a time of some sort. That should also have its own column. And then there's students or count of students. That also should have its own column as well. So we would do something like a pivot. So in your data prep tool, you use the pivot. You probably actually use two pivots to work this one out. And you can see down here, we have Course and Month and the number of students all in their own column. Again, another way to think about this is this up above, this is human readable. It's easier for our eyes to track, but to do analysis with, for a machine to actually ingest this information, this top one is not great. The bottom one is much better. Second tip or thing to think about here with data prep is keep performance in mind. With this thought, I don't necessarily want to get away from one of the big value propositions for data prep tools, and that you can iterate quickly, you can fail fast, you can quickly come up with things. I don't want you to obsess over performance that entire time you're coming up with, you're cleaning your data. Though if the thing that you're building ends up in production, that's when you need to start obsessing over performance here. And what I'm going to talk about here is how you can use data prep, especially specifically for dashboards, and this will be in Tableau's case, how you can use data prep to improve the performance of those dashboards. So when it comes to performance, especially with Tableau dashboards, if you're a Tableau developer and haven't read this yet, I highly, highly recommend it. This came out from one of my colleagues last year, two of my colleagues. Very smart colleagues of mine, and I feel very fortunate to be coworkers with them. They spent a lot of time obsessing about performance in Tableau and how to just squeak out as much performance as they possibly can from a dashboard. This is still basically the latest and greatest on this. I keep a version printed off near my desk because I use it so often. I reference it so often, so I highly recommend looking at this. But it gravitates towards four central areas that we, as developers, can control when it comes to performance in your dashboard. Those four areas are the data that we feed into the dashboard, the calculations that we use within Tableau, the visual controls and layout of that dashboard, and then the design and user experience of that dashboard. And can someone send that link out that was on here to the white paper? So those are the four things that we can actually control with dashboard performance. When it comes to data prep, two of those would be in the purview of data prep. One is the data that we feed into the dashboard. And then two are the calculations that we use within Tableau, and I'll explain what both of those mean. So focusing in on the data part of this, these are just some of the summary ideas from the white paper on the data side of things that we can control, how you would improve performance. One is limit data size. Limit the number of columns. If you have a bunch of extra columns that you don't need in the dataset, don't have them there. Do aggregation. So aggregation would limit the row size. So say if your dataset was at the transactional sales level, you wouldn't want that in your dashboard if the most granular thing that you were looking at was sales per store. It'd be unnecessary to have all that data in there, so you do an aggregation. Or you filter your data. Say you have a global dataset, but your analysis is only in Australia, so why would you have the entire globe in there? Just filter out for Australia. You're limiting the data size. So that's something that we can control during the data prep. You can remove the columns. You can clean the data that way. You can do aggregations, and you can do some sort of filtering. Use well-modeled data sources. For example, if your data source is messy, if it's not tidy, that will cause you to do a bunch of wacky calculations on the Tableau side to make it work, which then causes performance issues. Materialize calculations in the data source. This really is a data prep topic, and I'll explain what this means. But essentially, if you have calculations happening in the Tableau workbook, when you publish that dashboard, say if you're just dividing two columns, when the user is interacting with that dashboard, that math is happening on the fly. If you materialize the calculation, that calculation is already done in the data source. So when the user is messing with the dashboard, it's not having to perform that calculation. Therefore, the user is just having an overall better experience. So when we're talking about materializing, I know it's kind of a, if you're not familiar with that word, it's kind of a strange one. But essentially, all this is is in one of those upstream data sources, do those calculations in advance. That way, the dashboard isn't doing those calculations. Some calculations that you really want to consider moving upstream into the data prep phase. Heavy string manipulation. These calculations, I'm sure you're all familiar, can get very hairy. The more complex it looks, it's probably the case that that's the more complex and taxing it is from a compute perspective. So, you know, this is just an example of a regex expression, but it's probably taxing enough that you would want to move this upstream. And sometimes these types of calculations, it doesn't necessarily make sense for the dashboard to be doing this anyway. It just makes sense for this to be closer to the data source. Date conversions are equally taxing. They can also equally get gnarly. Here's an example of one. And this is honestly just a basic date conversion. So these can also get gnarly, something you'd want upstream. Even just casting fields. So if you have a string field and that should be a date, casting that to a date is compute intensive. But also, it doesn't really make sense for the dashboard to be doing that anyway. In a better governed environment, you would be dictating what those columns are, just more upstream, closer to the data source anyway. So another example of something to materialize. Grouping. If you're reorganizing dimensions using, like, case logic or if-then-else or using grouping in Tableau, also something else to move upstream. But then heavy usage of LODs. You can't really move LOD calculations upstream, but if you're heavily reliant on level of detail expressions, there's probably something wrong with the granularity of your data or your data model, so you might want to rethink how you've set up your data model or the granularity of the data going into the dashboard. When I'm talking about moving things upstream, I just want to have a little visual for that. This is an example of roughly how the pipeline of data might work at your organization. But again, this is something that really is organization, environment dependent. But you might have some data sources. Maybe these are maybe ones in Snowflake and governed by some data engineers. You use your data prep tools to pull some of these tables out, pull some tables from other data sources, join them together, enrich your data. And then what often happens, and this is in a Tableau environment, is you will use a data prep tool to then push an extract to Tableau Server, and that has the data that you need for the dashboard that's pulling from it to create data visualizations. So where are all the places that logic can happen and calculations can happen? Where can it live in this data pipeline? Well, basically, everywhere. You can have your calculations up here in the source. You can have it doing it in the data prep tool, or you can have it in the dashboard side of things. And so the whole point of this is that maybe we should start moving some of these things further upstream. So you can move them to source. It would probably require a conversation with some data engineers, but that might be a good place for it, especially if it's critical logic. You don't want to have highly critical calculations that may be happening on the dashboard side of things, but maybe in a more governed environment like up here. You could also do it in the data prep tool. So if you did it here, then those calculations would be materialized in the extract that the workbooks are looking at or the dashboards are looking at. So the calculations are materialized, and then, therefore, that would basically effectively improve performance because of that. And I'm just acknowledging another setup here is you might not have a data prep tool. Tableau Desktop itself is a very good data prep tool, so you might do that data modeling here, pull an extract from a bunch of data sources. And there's a way, and I apologize, I'm not showing that here, but there's a way when you're creating an extract to materialize calculations. It's not going to materialize all the calculations. Like LODs, it can't materialize, but some of the row-level calculations, it will materialize upon creation of the extract. That way, those calculations are already done, and it's not happening when the user's interacting with the dashboard. The third tip here, and this seems obvious, is avoid technical debt. These data prep tools that we have are very powerful because we, as analysts, are building out data pipelines that then often feed production analytics, they feed production dashboards. And because of the ease of these tools, we can just spin them up. We can create a bunch of them. You can create five or six data pipelines a day if you wanted to. And that's a very dangerous thing from a technical debt perspective. And so what I mean by technical debt, there are basically long-term costs to short-term decisions, especially if those decisions aren't the best decisions, but you're just trying to get something done. And I know that we've all been in this situation. Some examples of that might be, and that's on the right side here. But, you know, just something as basic as inconsistent naming conventions for fields. Maybe it's not aligned with the way that the rest of your organization has those naming conventions. You rush something into production. You didn't have time to create documentation because you're on to the next project. And then I just mentioned this, but rushing something into production is another one. It's very easy for us to create prototypes with data prep tools, and so often things get rushed into production. It feels like things are working, but there's probably a lot of technical debt built into these data pipelines that you've built with these data prep tools. And again, it's the nature of these tools. They're very powerful, but it comes with some possible technical debt if we use them the wrong way. I just want to give a quick example of what this might look like, and this is certainly a scenario that I've fallen into before. But say you're in marketing and Susie, your very well-intentioned boss, has a dashboard request for you. For some reason, that dashboard request is in Papyrus font, but we'll just ignore that. And you know where that data is. You feel good about it. So you find that data source. You create a data prep flow for it because you have to do some manipulation of it, and you build that dashboard. Great. You're good to go. She passes that dashboard around, and it turns out Frank from sales really loves that dashboard. He wants to see some of that same data but with a couple additional columns that help him out in his sales department. He's asking for three dashboards, but you can probably build that out in two. And because you've already built this in production and it's working, you don't want to touch it. You don't want to mess with it. So you just build two other flows that go along with this. So you're feeling okay. But Susie comes back. She has another dashboard request. This time, it's in Comic Sans, but you don't ask any questions. And it's just a couple additional views that are similar, but not different enough that you feel like you have to—I'm sorry, not similar enough that you feel like you can use the same data. Maybe you build up two other pipelines here. You can see where I'm going with this, but because we can have logic here in the data source, because we can have logic here in the data prep phase, and then we can also have logic in the workbooks, you've essentially created a ton of technical debt here, and it's a big house of cards. What if you leave? What if one of the metrics that gets reported across all of these dashboards, if that gets the calculation for that changes. You have to go into a bunch of data prep tools and flows and change all of that and hope it doesn't break anything, or maybe that logic is sitting here. And there's a lot of examples of how this can break down. But in general, you should maybe think about when you're pushing things into production, what's already there and if you can start to reuse these things. It might be the case that you, because of overlapping logic and overlapping data, you didn't necessarily need two of those data prep pipelines. You could have combined some of them. And so that flow then was sourcing three different dashboards. And that removes the need for five different flows. And perhaps maybe there was one of those flows that were required because the reporting requirements were just different enough that you needed a whole other prep flow, which is fine. But the nature of these tools, it's very easy to rush these things into production. It's very easy to spin things off, so I just want to caution you to be careful about these things. Finally, data prep tools don't always need to be the thing that you push into production and use. They also make really great prototyping tools. I've seen a few really good examples from my clients where they actually have started to use data prep not as the thing where they're pushing this flow into production and using that for the production dashboards, especially the really important ones where you have to do a lot of refreshes and the data needs to be up to date and there's a lot of data involved. No. They use this as a prototyping tool. And because you, as the person closest to the business, the data analyst, the business analyst, you know your data really well, and you know the logic you need to prep your data, how to enrich your data. It makes a lot of sense for you to build out these pipelines. But because, and if we think back to that previous slide I had about how some of those more data engineering tools are better to productionalize, it might not always make sense for you to productionalize a Tableau Prep flow or maybe even an Alteryx flow. And so this makes a great prototyping tool that because it's visual, it's very clear what's happening. It's linear. And so you can hand this off to a data engineer, they can pretty quickly know what's going on here and then convert this to SQL or some other code. So I want to put that out there. These data prep tools are really powerful, but they're also just really decent prototyping tools. And with that, that's it. So just kind of quick summary of those tips there. Think about performance. Don't create technical debt for yourself. And then try to create and have some standards around the data structure that you're creating for analysis. With that, do we have any questions? We do have one question in the Q&A. It says, how do we regularly run a Tableau Prep flow? Great question. So if you have Tableau Prep Conductor, which is part of the Data Management add-on for Tableau and Tableau Server, you can take a flow that you've created in Tableau Prep, the desktop version of it, and then you can publish that up to Tableau Server. When it's up there, then you can schedule it so you can have this run on whatever cadence you want. Oftentimes, it's nightly. And in later versions, you can actually have that flow contingent on other flows succeeding, which is interesting. So say if you have two data prep processes and one is on the tail end, you can have that tail-end one run once that first one succeeds. So, yeah, basically, to answer that question, you would need the Data Management add-on and Tableau Prep Conductor, but you can absolutely publish and schedule flows to Tableau Server. I see a couple, I see in the chat a question. Materializing calculations, when that would be helpful. I had that slide that showed the types of calculations in which you want to think about materializing. It's hard, you can use the Performance Recorder tool in Tableau to get a sense of which calculations are especially taxing. If you want to go down that route, you can, and then you can decide to materialize. On the Tableau side of things, Tableau Desktop side of things, you don't really have control over which fields are materialized. You just materialize all of them that Tableau can. So I would say it's mostly helpful, especially if you have row-level calculations. It's helpful then. But, I mean, materializing the calculations will simply just improve performance, though. You might run into some analytical issues, say, if there are some columns that shouldn't be computed in advance and that need to be a part of the aggregation or the slicing and dicing that you do in Tableau. It might not make sense to materialize them. And to answer your, there's another question about Data Management being a must. I think that is true. I think you need the Data Management add-on to have Tableau Prep Conductor, but I might double-check on that. Any other questions? Awesome. Well, thank you, everyone. Just a reminder that there will be a replay of this that will get sent out in an email after this, maybe one or two business days. There also will be a blog post that will have this posted in it. So if you missed everything or wanted to watch this again, this will absolutely be out there for you, a recorded version. And then also there'll be a survey, a short survey. We really want to hear your feedback on this. If you liked it, something you didn't like about it, something you think we can talk about more or elucidate more on, we'd love to hear it. So please fill out that survey, and thank you for joining.

In this webinar, Ryan Callihan presented strategies for effective data preparation in modern analytics environments. He explored the data prep landscape across the analytics stack, from ETL tools to dashboard development platforms including Tableau Prep, Alteryx, and Dataiku. Ryan shared key principles including creating tidy data structures based on Hadley Wickham’s framework, optimizing dashboard performance through materialized calculations and upstream data preparation, and avoiding technical debt when operationalizing workflows. He emphasized that data prep tools serve dual purposes as both production platforms and prototyping tools for data engineers. Carol Prins hosted the session, which included audience polling revealing that most analysts spend fifty to seventy-five percent of their time on data preparation activities.

Back to Videos

Data Prep for Really Good Analytics

Company Information

Tax Information

Payment Options

Other Information

Terms