Go! Dataiku

Transcript
Welcome everyone to today's InterWorks webinar on Go Dataiku. Before we dive directly into the content, let me take a few minutes to introduce InterWorks just in case this is your first webinar with us. Either way, whether it's your first or whether you've been here before, we're glad to see you! If you're wondering who InterWorks is, we do quite a lot and sometimes it's a lot to explain. If I put things simply, we specialize in data and analytics strategy. If you work in analytics, you know that the tools and the landscape always keep changing and the pressure of keeping up with the high demand of insights to really drive change in your organization can be tough. This is where we come in. Our specialty is really building out analytics platforms and data strategies alongside you and to be a trusted advisor when you need it. Everything we do is backed by our incredible people that I'm lucky to work alongside of, and as we continue to learn, we love sharing our insights with you through these webinars. These run every month or so. Beyond our mission and our people, we can help you navigate the right tools as well. Some of our partners you'll see on the screen here. If you're looking for more resources surrounding these partners or data and analytics in general, do check out the InterWorks blog. It is world famous at this point, and it is a great knowledge base for anyone working with data. For some housekeeping items for this webinar: By default, your lines are all muted. We will take questions near the end of the session. In your Zoom, there are two different interaction options at the bottom for Q&A and for chat. Please use the Q&A for any questions that you may have. The chat can be for general discussion. But the Q&A are the questions that we'll tackle towards the end of the session, and it'll be easier to not lose them along the way. We'll also put some links and things in our chat as we go. This webinar will be recorded. The replay will be sent to your email within a few business days. So if at any point you have to jump, don't worry, we'll send this one out to you. Now, I hadn't introduced myself yet, and I've brought along my lovely coworker as well. My name is Robin. I'm an analytics consultant for InterWorks based in Amsterdam, the Netherlands. I've been doing that for about three years, and I've really been enjoying digging into different industries and building out analytics products in various departments and industries. Rowan, would you introduce yourself? Yeah. Thanks for that nice intro, Robin. So I'm Rowan Bradnam. You can see we don't have very original acronyms for our names here, both RB. But I've been at InterWorks for a few years now and have two roles as an analytics consultant and also as a public sector lead, helping various different bits. Included with that is working with Dataiku in the public sector. We've got quite a few NHS clients who are starting their journey with Dataiku. And so we've been part of making that happen and setting that up, which has been really exciting. We are quite excited about what can happen with Dataiku. So we're looking forward to talking to you more about that over this time. But I'll hand back to you, Robin, and I'll see you guys in a minute. Thanks, Rowan. Yeah, Dataiku is my latest tech tool obsession. What we'll cover today is a bit of background on advanced analytics before we look at how advanced analytics fit within the landscape and then how Dataiku can help us do advanced analytics. So that's our intro, but the majority of this session is a hands-on workshop to go through all the Dataiku essentials from connecting to data and exploring that data through preparing data, training a model, and finally deploying that model. To partake in that, two ways. One is join along with us. And for that, we'll drop some links into the chat. Rowan's already posted the product Get Started Dataiku link. This is where you can set up an online trial or download and set up the free edition. The online trial is quickest there. And then secondarily, there's going to be a link for the dataset. We'll be working with a Scooby-Doo dataset we found on Kaggle. We've collected that on our InterWorks Box, so it's easy for you to download as well. So I'll send that to everyone in the chat here. And I'll open this up myself. If you go through the Box link, you'll get over here and you can click this and download in the top right. You shouldn't need to make an account on Box for this, and this is a CSV that will go to your machine. Now you'll have a little moment to get that set up as we'll talk about the landscape. The links are in the chat. The other way of following along is simply watching, staying a little more hands-off, but watching this all in progress, and you can always try this out yourself with the recording that you'll be sent in a couple of business days. With that, Rowan, I'll hand over back to you for you to take us through a bit of the landscape. Do you prefer to share your own screen? Yeah. Perfect. Just taking a while to find all the buttons. Thanks, Robin. So yeah, I'm going to take you through a little bit of an introduction to Dataiku and just have a look at the landscape of it. Hopefully, it won't take too long, about fifteen minutes or so, and then we'll get into the actual demo itself. But I just want to set the scene and show you what we're going to go through and tell you a bit more about Dataiku because I think we'll have people from a variety of backgrounds here. So first of all, let's talk about a few of the buzzwords that we're going to be using today. Words you've probably heard quite a lot, words like advanced analytics, machine learning, AI, those sorts of things. So this is the Gartner definition here. Advanced analytics is the autonomous or semi-autonomous examination of data or content using sophisticated techniques and tools, typically beyond those traditional business intelligence to discover deep insights, make predictions, or generate recommendations. Now, it's maybe helpful to talk a bit about what advanced analytics can do for you rather than to really hash out where it starts and stops. And I think as you go through today's example and stay with us, you'll probably get a better picture of what's going on there anyway. So really what this helps, what advanced analytics is about, is about helping decision making, enhancing decision making, and boosting competitive advantage. So some examples of this would be where we've worked with Nike to use predictive analysis to forecast consumer demand on a hyper-local level, which helps them to optimize the inventory and develop more targeted campaigns, or using regression analysis with car dealers to forecast the price of a car given its mileage, brand, and other variable conditions, or using clustering in retail to help create opportunities for upselling or cross-channel marketing, or predictive maintenance and condition monitoring, which you can use in manufacturing, or finally something like airlines using time series forecasting to identify peak travel times, which can then help them anticipate flight volumes and schedule flights accordingly. So really it's just about taking the data that you have at your disposal and using advanced techniques which come packaged in with AutoML software like Dataiku to be able to give you the insight that you need automatically back from that data. So it's not just pouring through that data yourself as an analyst and trying to find those insights yourself, but it's allowing the power of the software and techniques we have and statistical tools at our disposal to be able to generate that back for us automatically and easily. So let's also talk about artificial intelligence and machine learning. So AI or artificial intelligence, it's really about a whole combination, constellation of theories, technologies, and research which act like a human or simulates human intelligence using computer algorithms. So that's things like natural language processing, which you've heard quite a lot about. So sentiment analysis and so on from customer feedback forms and things like that and taking that through and pulling out overall trends and things. Computer vision or machine learning itself, which is a subcategory. So machine learning specifically is about improvement and iteration. So it's the science of getting computers to learn from experience and perform tasks automatically. There are so many different things that you can do within machine learning. There are a few examples up on the screen for you over there to read through, such as within medical research, market research, transactions. Machine learning can be used anywhere and everywhere, really. So moving forward, let's look at a little bit of the data landscape. We've kind of come up with this diagram within InterWorks. It's a quite a complicated and comprehensive diagram. Don't be too frightened by it. We're not going to go through the whole thing today, and there's a whole lot to it. If you work specifically as a data developer or data engineer or in some sort of technical data role, you're probably familiar with a lot of these things. If you are at a sort of a lighter level, at more management or business analyst or something like that, this is probably not what you have to concern yourself with too much, but you probably have some ideas around some of these processes. So really, we're just looking from left to right: data sources all the way to the right to becoming an opportunity for insights in the interaction layer that consumers or managers or staff or whoever needs the data, the data users can then go and get that insight out. So as you can see, I've highlighted two areas. One within local data prep and the other one down the bottom, machine learning. And that's where Dataiku sits within that. So we're not going to talk about all the other technologies that we work with or the other processes that are involved in the data landscape from start to finish. We're just going to talk about those two. So you can see machine learning sits across quite a lot of it, across the ELT layer, analytics layer as well. And it's really about taking that data and enriching the data. So it takes the data in, it does some machine learning with it or something else, some AI, and then it'll need to put that somewhere again for the business insight tools to be able to draw that out of. So whether that's Tableau or Power BI or ThoughtSpot, Click, anything that you're using. So it kind of sits across taking the data, giving data back, but giving that enriched data, which is giving you the insight that you really need. So we'll talk quite a lot about machine learning as we go through our example. And the other one to bear in mind, which is a sort of an underrated and slight hidden gem of Dataiku, is its last mile data prep or ETL capabilities. In reality, especially in a world of self-service analytics where we want customers to be able to not have to push back to other parts of the other departments and other parts of the business to get the data they need. They want them to be able to work with the data and put it into shape themselves and get exactly what they need. Last mile ETL is a really big part of a lot of companies' data landscape. So what that really means is it's just taking the data that's been prepared and making little tweaks that you need to make. So not major big enterprise-level data preparation, which is slightly different, but just those little different things. So the analogy that one of our colleagues uses within this, which I think is quite good, is it's not really about preparing the food in the kitchen. It's about like, you know, adding some salt or pepper or adding some sauce. If you were sat in a restaurant, you wouldn't sit there and send your food back and say, can you put some ketchup on my plate, please? Or can you add some salt to my chips? You would do that yourself. If you had to send it back to the kitchen, it'd be quite a frustrating kitchen to work in or a restaurant to be at if you wanted to make those little changes. But that's the reality quite often in businesses. If they just want a little tweak or a little change, people have to ask the data engineering departments to make those little things, whereas a tool like Dataiku actually enables you to make those little tweaks and changes yourself. And I'll take you through some of that data prep a bit later on. So Dataiku does a lot of things and it does a lot of them well, which is really great. It's not completely an end-to-end tool. We wouldn't recommend that to our customers like just buy Dataiku, you don't need anything else. Dataiku itself is humble enough to admit that it needs other supporting programs like Snowflake for data storage or Tableau for data visualization, which help it create the ultimate landscape to be the best. It doesn't try and be all singing, all dancing, but at what it does, it does pretty well. So we're going to look at all these comprehensive features that it has and what we like best within that. So data preparation, visualization, machine learning, DataOps, MLOps, and analytics apps. Those last three you might not understand, but we'll kind of touch on as we go through them a little bit later. So starting off with data preparation. This allows us to connect, cleanse, and prepare data for analytics and machine learning projects. It's one of the great features which we'll talk about of Dataiku as we go through, is that it's for coders and non-coders alike, for people who have very good technical knowledge, are very comfortable with Python, R, SQL, all those more technical ways to interact with data, and also for people who want a lighter touch, but still want to be able to gain insight from it. The word that Dataiku often uses is clickers and non-clickers, which is synonymous with coders and non-coders, but it's kind of the other way around. So a clicker is someone who doesn't necessarily want to be typing proper code into their computer all day to get what they need, but they want to click around to see what they need. And there's a lot that you can do in Dataiku just by clicking without typing at all on your keyboard, which is one of its great features. And this is true for data preparation as well. Over twenty-five leading data sources that it connects to, including Amazon S3, Azure Blob Storage, Google Cloud Storage, Snowflake, SQL databases, NoSQL databases, HDFS, and it can do on-premise and it can be in the cloud as well. So within that as well, within the data prep part of it, after the data connections, over ninety built-in data transformers as it says there. And the things that aren't built in and uncustomized already, you can create your own ones with formulas in similar ways that you might do in Excel. So there's a lot that you can do within the data prep, and you'll see a little picture of some of that later on. Visualization. What we really like about Dataiku is it's not a black box. You don't just stick your data in at one end and then bring out your machine learning at the other end and hope that some good things have happened in between. Only once you run and generate, you know, your million rows of data can you actually figure out what you did and if it worked and how many rows you lost and what happened. But Dataiku has really good visualization. And as you go through it, you can actually right at the start with your data prep, you can have quick analysis of your columns to look at distribution, top values, outliers, overall statistics, null values, all those sorts of things as well. You can build charts and graphs as you go and create dashboards within Dataiku. They're not amazing, but they're decent and they're good and they're very good for ad hoc analysis as well. You can do statistical analysis and it's pretty good within Dataiku. Things like univariate and bivariate analysis and there's a whole raft of things within that that you can do. And then of course, it also just integrates out of the box with things like Power BI and Tableau as well and other BI tools. On the machine learning side, this is always the key feature of Dataiku and what most people get it in for, which it does really well. It has some great features in and around it other than the pure machine learning itself that you would expect. First of all, to say it is very customizable. So it has a support of a variety of notebooks for code-based experimentation and model developments from scratch using Python, R, and Scala based on Jupyter. But it also has pre-built notebooks and has a lot of AutoML features, which we'll come on to in a minute, delivering those models automatically. So with that, you can have your best practice techniques with built-in guardrails, which allow business analysts and people who are not data scientists themselves to build and compare multiple production-ready models. And it uses leading algorithms within that and frameworks like scikit-learn and XGBoost to find the best modeling results. So that's a bit about its machine learning and you'll see that in action in a little while. DataOps. What we really like about the DataOps is you don't just build your machine learning within Dataiku, but you can find ways to productionalize that and to keep a real key eye on what's happening with your data as it goes. So this includes scenarios and triggers for automation. So what this is, is automatic assessments of flow elements and allows those checks so that you can see if the timeframes and expected results are coming in exactly as you want to. And warning messages and errors and fail checks can allow you to investigate and resolve things as you need to. It's also very visual. It has a central flow, which shows you the pipeline of your datasets and recipes as well as your machine learning, which you'll see in action in a little while. It integrates with Git, which is really handy as well. And it's yeah. As I say, it's a central hub for everyone. It's one of my favorite features of Dataiku is it can enable really high-level excellent conversations between multiple business functions. So I find that business analysts and senior leaders within a company find it really difficult to have meaningful conversations with data scientists and people working and data engineers because they just can't understand each other's worlds and they don't really understand where things start and stop. But with Dataiku and bringing things into a central flow, you can really keep an eye on how things are working and where things are going and people can bring their different expertise and insight into the conversation. Onto MLOps or machine learning ops. Won't go in too much detail about this, but you can do batch scoring with automation nodes. You can monitor updates and retrain models based on schedule or triggers. You can do real-time scoring with API nodes. And so there are a lot of different things you can do within this world of experimentation and then deployment as well and productionalization and automation. Dataiku has you covered there. And then finally at the front end, analytics apps, including things like what-if analysis and Dataiku supports various leading frameworks including Dash, Bokeh, R Shiny, JavaScript, and more. What I really like about Dataiku apps and analytics apps is that you can, for example, I work a lot in the NHS and what we're quite excited about is it's a very siloed landscape. So what can happen is you can build one set of pipelines and machine learning and a whole raft of things from common data, which is often labeled very similarly in different places. And then you can take that and you can turn that into an app and then you can add that into the Dataiku framework. And then someone else can come along from a completely different organization across the other side of the country, maybe studying something similar or slightly different. And they can take that out of the box and plug it into their data and they've got everything they need already and then they can amend it as they need to to make it bespoke for what they need. So it just means that not everyone has to be reinventing the wheel, which happens a lot in the NHS and in other similar large organizations. And it just means things can be neatly packaged and put together within that, which is really helpful. So that's it for kind of the landscape that we've gone through. If you have any questions, please do put them down and we can get to those later or during the presentation, one of us can type an answer out if it's appropriate. And just to talk you through what we're going to be doing today, just five simple steps. So we're going to be connecting to data. We're going to be exploring, doing exploratory data analysis. So before we do our actual proper data prep and machine learning, we're just going to have a little look and see around our data so we can get an understanding of it. Then we're going to do our data prep. Once we've prepared our data and it's ready, we're going to create a model using Dataiku. And finally, we're going to score our data at the end of that. So I'm going to hand back over to you, Robin. Just finding my unmute there. Thank you, Rowan. And let's start with our data connections. So back to the beginning, connecting to data. Let's get our data into Dataiku. And as Rowan mentioned, there's a bunch of connectors possible within Dataiku, including just files on your machine, but also all your cloud storage solutions and databases. So the first exercise that we'll go into is look at those connection options within Dataiku and then connect to our Scooby-Doo dataset. So let me jump over to our Dataiku instance. Now, if I open up our Dataiku instance, yours might look a little more empty if you're working with a trial, but now's the time to get that open. And you'll see personal items that you've viewed or worked on. You'll see projects, workspaces, applications that were created, project folders, dashboards, wikis. So there's a good organization of different types of things to do. What we'll go for is this New Project button in the top right. We'll create a blank project. And we can name that Scooby-Doo. And I'm just going to use an underscore and type it in all caps. You see that there's a one behind it, so a Scooby-Doo project already exists somewhere. So I might as well type that one in myself. Once you have clicked New Project, given it a useful name, go ahead and click Create. And here we see the interface for our projects. Now I'm going to go ahead and draw on my screen a little. At the top we have an option to go back to home by clicking the little bird if we would want to. We'll be working a lot with this black bar at the top, especially the little icon here because that's how we go back to our flow or if we can jump to different parts. We'll be working with that a lot, but what you're seeing on your page is some summaries of your project and who created it. You'll have statistics of what's happening in this project, so how many datasets do we have? How many recipes are we using? How many models are working in this workflow? Any notebooks and analyses we've created in our lab, dashboards related, and then things surrounding documentation. We can add a description here so everyone knows what's going on. And we can also have a to-do list specifically on this project. So what still needs to happen, what changes that we want to make, and we can tick them off as we go along. On the right, you see a little timeline of what has happened to the project. And we have different tabs for Summary, Activity, and Metrics. So Summary is the page we're looking at, and if I click over to Activity, you'll see some information on who has contributed when, and even a big punch card for when this was running or when this was being worked on. And finally, Metrics, we can later use this to monitor our project. For now, I'm going to go back to Summary. And what I'll do is I'll use that top bar to go over to our flow. And as you can see, there's keyboard shortcuts behind most of these, so whatever your preferred way of working is, use that. But I'm going to go ahead and click over to our Flow. So once we've clicked over to our flow, we can start by looking at adding a dataset. Up in the top right of our panel, we have the options to add datasets and you'll see that we can upload our files, which is what we'll be using for our Scooby-Doo dataset. But you'll also see network connections, SQL databases, NoSQL databases, cloud storage, and you can expand the options with plugins. Now some of them may be grayed out depending on whether they have been configured on your instance, but at the very least you should be able to upload your files here. So I'll go ahead and click that. Now you can drag and drop or add file, and looks like I already clicked it, which will just browse. And this is where I'll select that Scooby-Doo file and it will start to load. Go ahead and do that. You can also drag it within your browser, whichever you prefer. Now you'll see a storage type here and something incredible about Dataiku that I really like is we can push things to our database. So we can store things in our database by uploading here, and often when we compute things, we can push the computation to our database as well, so that our database takes care of it and has all the resources, and Dataiku simply orchestrates. Now, I've gone ahead, uploaded my file, I could preview this if I want to, and I'll see a whole lot of information on my different columns. If I drop back to Files and I click Test, we'll see simply the acceptance. I was looking for a different screen. But you'll see that Dataiku has automatically used the column headers within our file as our column names here as well. I see my data come through correctly here. And at the bottom here, I can see what it has used as separators for our CSV, how quoting was done, and I can customize this if it didn't come through quite correctly. Now if I go ahead to the top right, I can give this a useful name and click Create. Now I'm simply going to name this Scooby-Doo to shorten it. And then go ahead and create your dataset. That is us, step one complete, connected to our file. Which means we can start exploring. And in exploring our data, of course, we want to understand our data, check if it came in correctly, but also identify what kind of problem we're trying to solve, what kind of target variable we're going to be looking at, which fields we might want to use for that, and look at any data preparation that we may need to do before we can train the model. So we'll be looking at our different fields, our target variable, and anything data prep that we will want to be doing. So let's take a peek there. In our Dataiku, we have our first one as an index, simply going down here. Underneath the name of each column, you'll see two sections. So you'll see string, integer for this first one. Now it's a string for all of these currently. That's because we've uploaded a CSV file and all of them will be strings. If we're working with a database, the database will tell Dataiku what kind of column storage was being used. So this is the column storage type of your database. And then below it you'll find Dataiku's interpretation of what data it's seeing. So for example, we see Date Aired, so when was one Scooby-Doo episode on TV, and you'll see this is an unparsed date, meaning we're not quite sure. Is it days and then months? Or is it months and then days and then years? So we haven't quite defined this date, but it is a date. And if we look further down the line here, we have Monster Gender on the right, and this comes through as a gender. Now if I scroll further right, you'll see we have seventy-five columns and six hundred fifteen rows, so that's in our top left. And you'll see we have information for each episode. What kind of monster was the mystery gang dealing with? What kind of species was it? How many monsters? So Monster Amount. Who of the team got captured as true-false fields, if any of them got unmasked. We see whether they've eaten a snack or not in the episode. And then we see some more about location, culprits, and different words that were used in the episode. So lots of information, props to the guy who collected all of this. And if I go back to—Robin, can I interrupt you? Sure. Something just occurred, we didn't talk about this beforehand, but something just occurred to me for the first time here is it's possible some of our audience may not have watched Scooby-Doo before. Maybe you should contextualize that it's just a cartoon program. It's like four or five. How many friends are there? Do you know, Robin? I don't know. I want to say five or six. A couple. Yeah. Something like that. And there's a dog who's called Scooby-Doo himself, and they go around solving mysteries. And spoiler alert, virtually everyone ends up with a mask being pulled off the monster's head to find a small business owner. I think that's usually the formula within that. So if you've not watched it, it's a bit of a classic, but you're probably not missing out on too much. But yeah, this is the dataset built for that TV program cartoon. Great. Thanks, Rowan. Sure. So what we are looking at in each of these rows is one episode of this TV series, which aired a long time ago or started airing a long time ago. The reruns are endless. What we'll see here is an IMDb score, so movie database, the ratings that people have given these episodes, as well as an engagement score. Now, you'll see this little red bar at the top of my column, and let's explore what's happening there. So if I click for these columns, depending on the column type, I'll have different actions that I can do, but for all of them I can analyze. And Dataiku will grab a sample of my data and it will check out the values in this column. You'll see a distribution of the rating of the episodes. And I see here that there are eleven invalid values in here. And if I go Categorical, then I see here those are eleven nulls. So we have episodes without a rating. Now I can tell you that those episodes do have all the information on the monsters, on the mystery gang who are trying to solve it, on Scooby-Doo, on words used in the episode. So using that information, maybe we can predict this score, this IMDb score. So that's what we'll be doing in our example here. We'll try to fill in the missing IMDb scores based on the attributes of the episode. So if that's what we're looking at, then what are we going to feed into our model? Text is often difficult, so we might want to get rid of some text columns, like the monster name, and the monster type subtype is quite text heavy. If I scroll right, we also have a culprit name. I believe we also might have a monster name. So those are very text heavy, I don't think those are quite it. Monster Gender could be interesting, but we see that there's something going on with this column as well. So let's go ahead and analyze, and we'll see that here for our invalid data points we have nulls, but we also have kind of arrays in this column where we have repeated values or female, male. So if there are multiple monsters in an episode, they're still all on the same line. So in this case, we get several genders for the several different monsters. Now that's a bit complex to deal with, so we might want to exclude rows with more than one monster. But let's check out Monster Amount and see how many rows we would exclude. So on the right, I'll go a little more right, you'll see I have a column with Monster Amount. And indeed, when there's more than one monster, we have repeating values in several of these columns. So let's analyze this one. And of course, we could also run this on the whole data. Let me just save and compute. Doesn't allow it. It's stored as a string. My bad. But what we can see is that one and zero have about fifty-eight and fourteen percent of our data, and then we have this long tail of multiple monsters. Cumulatively, we get seventy-two percent of our rows if we exclude ones with more than one monster. I think that's good enough. So that's another data prep step that we'll put in. We'll exclude rows that have more than one monster, so only the ones with zero or one monster. So to me, this is really cool. We can scroll through our data, see what the distribution is of our columns, see if all the data that we're looking for is there. For example, if I was expecting five different categories, well, we have five different categories here as well. So I know that my data has come through correctly. This is running on a sample, but depending on your data source, the whole data might be appropriate as well. So I love just pulling a dataset in, exploring what's going on, and seeing what we want to do to it. So we've identified a couple of things. Now Rowan, you might identify some more as you'll take us to the prep here. But I'll hand it over to you for that so we can go ahead and prepare our dataset before we predict on it. Great. Thanks, Robin. That sounds good. Let me steal the share again. Okay, there we go. That should be fine. Okay, so yep, so we're going to be looking now. We've identified what we need to do, talked a bit about what our data is like, what shape it's in, and let's now look at how we can prepare our datasets so that we can do that work on it. So we talked a little bit about data prep already. And what's really nice, which we're going to emphasize here is we can do all of this data stuff collaboratively. We can do it very visually as well. So Robin's built that flow so far in that project. And then I'm going to go and look into that and take it on from there. There's a few things we're not going to look at today. Just wanted to mention here and included in that is geospatial data preparation. So Dataiku provides built-in geospatial transformational functions for working with geospatial data. So that's just, you know, longitudinal, latitudinal data and shapefiles and things like that. And so we can extract that from geopoint data and vice versa. We can use geo IP location to resolve location data like country, region, state, city, postal code from an IP address. And we can also connect datasets using geographical coordinates. So there are lots of really cool things we can do with geographies, which is slightly outside of the scope of this dataset, but just wanted to call that out within this. So let's look at what we're going to try and get done today within our data preparation of our next exercise. So we looked at the Date Aired, and we realized that we're not a hundred percent sure how that's come through because it's not been parsed yet. So we're going to look at how we can parse dates to ensure that they're coming through exactly the way we want to and control that. We're going to look at making sure that we just keep the rows with monster amounts as zero or one. We're going to remove rows where the monster type is null. We're going to remove any unwanted columns that we don't need just to slim down our dataset a little bit. We're going to standardize some of the monster attributes and then we're going to just change a couple of the data types as well. So let's go across to our project then. So here we have it. So this is the project we're looking at already. We're in the flow now. So I think if we go back to here, we can see our whole list of projects. You can see here Scooby-Doo. Once we go into that, you can go here to the Summary page as well. I've put a bit of a heading on here, but you can see there's a dataset in already that Robin has put in. So this is really the heartbeat of our work here is the flow. This little icon here looks a bit like the share icon, but it's the flow and it doesn't look too pretty at the moment. It's just our Scooby-Doo thing. But what we're going to do is we're going to click on our dataset and we're going to have a look at some of the options we have within that. So within our flow, there are a lot of different things that we can do. There are plugin recipes. There are code recipes we can build in. We can look at exporting and publishing and sharing and all these different options at the top. Lots of different, lots of different visual recipes here: joins, splits, grouping, filtering, pivoting, sorting, stacking, lots and lots of different options. But we're just going to be looking at a Prepare recipe within that. So just clicked on my dataset. My little Zoom button's in the way there. This little arrow, just clicking that out so we can see the side panel that's going on, and we're going to click on Prepare. So what we're going to do here is just going to decide the name. So I'm happy with Scooby-Doo Prepared. Happy with all these default options. Let me create the recipe. And it's going to take us into a separate page. It's going to show me my dataset again, as you can see. And there are lots of different areas and ways in which we can do data preparation. First of all, we can just click straight here on the left and click Add a New Step, or we can do a whole bunch of different things actually just by playing around in the columns over here. So the first one we're going to do is we're going to find our Date Aired. So let's go and find that somewhere down here. It's quite a wide dataset, so just stop me if I go past it, Robin. I feel like I may have gone past it. I feel like you may have gone past it. Thank you. By the time I get there. No. It's right at the beginning. That's why. So here it is, Date Aired, and it's telling me it's unparsed. It's giving me an idea of what to do. So what I can do here is just click straight onto the column itself, and it's giving me some options of what I might do with that and included in that is to parse the date. So I'm going to go through on this way, and it's going to bring up this option, and it's going to make some suggestions. So it's saying to me that when it says 04-28-2013, which is an example from my dataset, what do I mean? And it's saying maybe you mean month month, day day, year year year year. Or perhaps you mean day day, month month, and then the years after that. So it's probably something you've come across in your work within data before because it's a very common thing. I'm sitting here in the UK. I'm in the Peak District in a village called Tideswell, but a lot of my colleagues are in America and a lot of work I do is with them. And, you know, Americans always have this weird and funny way of doing their dates the wrong way around. Right? So you never know what you're going to get. But it tells me here on the right-hand side over here, you know, what's how many of my dates will parse successfully. So it's pretty clear it's the first one, but maybe I know better. Maybe I know the data well and actually is the second one, the rest are all just incorrectly entered or something like that. But I'm going to take the first one here and click Use This Format here. If I change my little hint here for, if I change my output column here from being to make it exactly the same as the input column. So I just say Date Aired. Then, what's going to happen is that it's going to actually overwrite the old one, which can be quite handy within that. So that's a little hint for you. If you rename the output column to the same as the input column, it actually automatically deletes the input column as well if that's what you want to do. Otherwise, if you name it something else, you can just go back and delete the old column. But there we go. Got our Date Aired and it's all parsed through now. So let's go and look at our next step, which I think was to do with keeping and removing rows. So we're going to do this using the Add New Step button using this menu here. So we can add a new step and we're going to do a filter step. So if I click here on Filter, it takes me through a whole bunch of different options. This is what I'm looking for. So for this, in this case, I'm going to filter rows on a value. And I think this is the, let me just check my notes here. I think this has to do with the monster rows with just zero and one. So if we look for our column, I think this is Monster Number. Believe it may be Amount. Monster Amount. That's it. Thank you. So Monster Amount. And we want to only keep the rows that have the value of zero or have the value of one. So just click the little Add Value button there. And there I have my action there and the match mode and normalization we can leave as they are. So you can see here it's also telling you what's happening. I'm losing five hundred and twenty-eight rows where it's not zero or one. So that's my next step there. And you can see what's happening within that. If I find Monster Amount, let's close that off so I can scroll across. You can see here one, one, one, one, everything's fine there. And then there's three. So it's just deleting off those rows that we are not interested in because we're just interested in the rows where we have monster amounts of zero or one. Okay. And then let's add another step. I think in this one, we want to look at removing any rows where the monster type is null. So let's go in here to Monster Type. And I think we can do it straight from here. Yeah. We can filter. Actually, no. I don't know if I want that step that way. Let's leave that. Let's do it from the new step here. So again, Filter. This time, we're going to filter rows on a value again, and we're going for Monster Type. Start typing it in, and it gives me the list there, Monster Type. And then I want to say remove the matching rows this time that have a value of null in there. So that is getting rid of all the rows where the monster type is null. But I think it doesn't seem to be removing any of this. So what if I've mistyped null? Do I have to do that in all caps? Potentially. Or maybe they've already been removed because they're in here with the zero-one on the monster type. I think if I'd done that the other way around, we'd be losing some of those and then we'd lose some more from that. So I think that's a bit of a redundant step. Let's just go check. So if we go here to Monster Type and look. It does look like it's removed somehow. Now that it's—It doesn't. Oh, there we go. I just hadn't saved it. So yeah. So I've removed eighty-seven, and you can see here that it's not coming through in my, within that. It's not bringing up the nulls. Okay. So I've got through that. That's good. So that's the Monster Amount and Monster Type. And then I think the next thing we wanted to do was to remove some of our unwanted columns. So there are a few columns that we don't want to use in the model such as names. They're not standardized. So let's get rid of them. So let's go in here, Add a New Step, and let's say we want to filter. We want to get rid of columns in this case. I'm going to delete or keep, remove some of our columns. So I think these are things like the number of snacks, I think was one of them. So we can get rid of—trying to remember. Robin, can you remember which columns we got rid of? I haven't actually got it down in my notes here. Definitely the name columns. I think a lot of the text columns are—Oh, yes. Thank you. So it's a name, something like Series Name, Monster Name, Culprit Name. Yeah. Yeah. There we go. So let's get rid of that. Remove that. And then what we can do is we can go through the whole step, the whole thing again. So you add a new step to get rid of more columns or what we can do as well, which is quite neat within Dataiku, we can just duplicate a step. So if we duplicate this step and we're getting rid of Series Name, we can just duplicate that and get rid of Monster Name and then duplicate that as well and get rid of the Culprit Name as well. So there we go. Getting rid of a few of those columns. I'm just looking at time here. We can also group those into a single step. I think I'm just going to skip that for the sake of time at the moment, but you can put multiple steps within a recipe into a single grouping, which you can then kind of expand and minimize, which can be quite a helpful feature. But let's just do the last bit of the data prep, which is to standardize some of the monster attributes and change one of our data types to a double, which I think was our, was it our IMDb score we want to change to a double? Yep. As well as our engagement. And our engagement score. So here we have it set in as a string, but we can change it here to a double, which I think if we look at it now should actually—I was curious about this. I wonder if this allows us to do the analysis. What were you trying to do earlier, Robin, to look at another sample? The full dataset. Not the sample, but the full dataset. But I believe you are currently preparing data. Okay. Okay, that's good. Okay. But I think if we saved it now as a double, went back into that old menu, and then went and looked through it, it probably would allow us to do that. So and then the engagement one, doing the same again, changing that to a double as well. So that is all the steps that we need to do for our data prep. We can save it up here, or we can run down here. I just want to show you it in two separate steps. So what I'm going to do is save. And if we go back to our flow now, you can see this icon here with the dotted lines, which means that we've done the Prepare recipe. We've got Scooby-Doo Prepared, but it's kind of empty because we haven't run it yet. So what we can do is go back to our Prepare recipe and run it from here, or we can go back into our Prepare recipe and run the recipe. So everything we've been doing, you've seen all the transformation changes has just been on the sample. But if we run it, it'll go through our whole dataset. Now with this particular CSV, Scooby-Doo CSV, we don't have that many rows. I think it was like seven fifty or something like that when we started. It wasn't a huge amount at all. But when you work in the real world, often you're looking at millions or hundreds of millions of rows. And so this sampling that Dataiku does automatically, which allows quick dev work and visual interaction of what's doing separate to the actual running of it, is really, really useful. So if we go back to our flow now, you can see here a different icon. Our Scooby Prepared dataset is looking a little bit different now. So let's jump back to our PowerPoint to just talk about our next little piece, which is to split our dataset up. So sometimes in machine learning you might have two datasets separate for you already prepared from whatever team is working with. Quite often, you just have a single dataset and you then have to split it, run a model on one as a training to train the model, and then a second one to test it or to score it, to see how you've been doing. In this case, we can neatly divide across between the ones that have IMDb scores and ones that don't, which is what we're going to do. But it's very common to split things. Sometimes you just split them randomly into two different sets for training and for scoring. So that's what we're going to do now. So let's jump back across to our dataset. We're going to click on it here and we're going to choose the Split recipe, which we're going to choose over here. So we're just going to be using our single dataset Scooby-Doo Prepared and we're going to change it into two different datasets. So a Scooby-Doo Labeled and a Scooby-Doo that is not labeled. So let's put the first one down here, Scooby-Doo. And we're going to call this. We should probably actually keep the Prepared suffix as well just to make things simpler. Prepared. Labeled. Okay. And let's create the dataset. And let's add another one, which we're going to call our Scooby-Doo Prepared Unlabeled. And create that. Okay. And let's go ahead and create a recipe. And so we've got a few different choices as to how we're going to split. In this case, we're just going to look at a single column, and we're going to split on that column. So the column we're going to choose here is our IMDb scores. Okay. And we want to put everything that is from zero through to ten into our Labeled. So and then I believe you need to set it to Range, Robin. Oh, sorry. Thank you. There we go. Ranges. Yep. So from zero, thank you for that, Robin, through to ten, and that will go into our Labeled set, and then all the remaining values will go into our Unlabeled set. Another way we could have done that was to use null to say those that are null can go into our Unlabeled set and then the remaining go into our Labeled set, either way, just whichever way we go around with this. So that should be fine and we can run this now. And it just tells us at the bottom the job is running. We look down here at the bottom letting us know. Shouldn't take very long to do this. And once it's complete, tells us that our job has succeeded and we can go back and look at our flow and you can see what's happened. Split my data into a Labeled set and an Unlabeled set. I go in and look at my Labeled set and I find my IMDb scores and I click to analyze that, I shouldn't be finding any nulls, and you can see that here. Conversely, if I go into my Unlabeled set, not that many that have been unlabeled. I think there's just a few, and you can see it just for yourself right here. There are all nulls throughout my IMDb score. So I'm all complete on preparing my data. I've done all the preparation I need, cleaned up all the columns and rows that weren't in the structure that I wanted. And I have prepared my two datasets, one that's ready to have some training for a machine learning model and one that is ready for some scoring and testing afterwards. So we're still doing okay for time. We're a little bit over, but we're still doing alright. I'm going to hand back to Robin to talk us through a bit of machine learning for our last section. Excellent. Now that we have the two datasets, let's go and create a model. It sounds easy. It has a whole range of opportunities from clicking auto machine learning and clicking run to coding your own complex model. When we're looking at creating a model, Dataiku has a couple of useful functionalities I want you to know about. So some automatic feature engineering takes place, missing values are filled up, non-numeric data is encoded. We have options for AutoML that uses leading algorithms and frameworks like scikit-learn, XGBoost to find the best modeling results. I do believe Rowan mentioned this before as well. And that allows you to kind of let Dataiku take the wheel and build something for you. However, there's also a variety of notebooks supported for code-based experimentation and model development. You can still use Python, R, Scala based on Jupyter. So kind of whether you're a coder or a clicker or however we want to say it, you can build your models with Dataiku or on Dataiku, depending on your preferences. Which is what we'll do, and of course we'll go for the simple version in this workshop. We'll create an analysis on our labeled dataset, and we'll look at our model, some features of our model, and how to interpret this as well. So if I jump over and make sure that my flow is all up to date with what Rowan was doing, I have my Labeled and Unlabeled datasets. I'm going to go ahead and click on my Labeled dataset. And this is where we go to the Lab. And in the Lab, in the top right, the big button, we can start doing analyses. And you'll see different options here. The very generic option is to create a new analysis, which we're going to name. Now I'm not sure what we're calling this to be honest, but I might just go with Analyze Scooby-Doo. And we can go ahead and create this here. Where we'll see our data again. And in the top, we can start adding a model. So in the top right, Models, we can create our model here. Are we clustering? Are we predicting? In this case, we are predicting. What are we predicting? So you'll see that Dataiku guides us through what we need to build our model. And we're trying to predict that IMDb score. So I'm going to select IMDb. And you'll see within AutoML, you can get things like decision trees, which are easy to interpret, or models that train a little longer but tend to have more accurate results. We're going for some quick prototypes that we're running in memory. Again, we have options to push this to our database. And for more control, we have options below for deep learning, selecting algorithms, and really refining what we want to do here. So I'll click Quick Prototypes and Create. Click Train, and what it'll do is it'll run two different models here based on the type of data we're working with, which we're comparing on the best fit score here, so R2 in this case. I get this as a fit measure. And on this dataset we see that our Random Forest model has performed best. Now if I go ahead and click that model, I get some information on the model itself that has run. And this is where we can go through and look at the importance of different variables, for example. So if we go to Variables Importance, you see that our most important predictor here is that the format is TV series. So we also had a Scooby-Doo movie in there. And we had, I think some other formats, some one-off formats in there as well. But TV series is a good predictor. The engagement score, when it was aired, in year cycles. The network being Cartoon Network. So there's some explanation here in what is useful to predict that IMDb score. I'm going to go ahead and go to our Subpopulation Analysis. Now, we're working with TV episodes. It's not all that sensitive. But when we are using machine learning algorithms on real-world data that impacts people's lives, we want to make sure that our datasets are representative, and that our models perform similarly for different groups, for example. So Dataiku helps us analyze the subpopulations in our dataset. Did we have representative data? For example, if I go for Gender and go for our Monster Gender, and click Compute, it's going to check our data source and how it performs for the different monster genders. And you'll see that it does not perform the same. So we have a big split in how many rows we have. And our model doesn't fit very well if we have female monsters, which isn't a surprise, because we didn't have that many rows with female monsters. But this is something where you need to adjust your model and really keep in mind that you're still responsible and in charge of this model, keep things fair, and review things like this. So I wanted you to know that this is over here as well. If I go down towards the model and look at the different features that were used, you'll see that Monster Amount was rejected. Our zero, one, our model said that wasn't very useful. Our index, also not useful in our case, but it's taken everything else. There's a page two. It's also rejected the title of episodes. So that was a piece of text that either was difficult to work with or not valuable to our model, this was dropped as a feature as well. So this will give you some information on what the model actually worked with as well. And then a last section that I want to show you is the Training Information here. And here we get some sanity checks by Dataiku already. Hey, your training set and test set are both pretty small. We know our dataset itself is pretty small. We get some information on the partitions for training and validation. So we had an eighty percent train ratio. It was split randomly. So we can see our training set had about two eighty rows and our test set had about seventy. And in the top right, you'll see information on how long the different steps took for training our model. So with our dataset, that's not super interesting. But as soon as you train bigger models, this may be where your performance tuning, or at least looking at the statistics that you're working with. That's a quick peek at what goes on with these models. But as you can see, there's a lot more information on the interpretation, on the model, on the performance that is present here. I'm going to go ahead and click Deploy for us. Predict IMDb Regression, that's fine by me. I'm going to go ahead and create this. And it jumps me back to our flow and you see we now have a little training bubble and a diamond for our model. So this dataset is training our model. Which brings us to our last step and last exercise, which is scoring those rows that didn't have an IMDb score. We can do that directly in our pipelines, we see it within our flow. But of course, we could also deploy this to production and look into automating batch scoring or even real-time scoring with API so that we can set it up and kind of forget about it, monitor it from a distance and it runs automatically. That's beyond the scope of this introduction webinar. But it is possible. For now we're going to put this right into our flow. So deploying it to our pipeline and we'll score our dataset. So the way we can do that is clicking on our diamond, and then on the right you can see Score. So I can apply the model on data to predict. And as an input, the one that I want to score is this Unlabeled data source here. The model is that Predict IMDb model we just trained. And then I'm simply going to name this Scooby-Doo Scored. So I'm going to take out the Prepared Unlabeled here. Now you'll see that we've secretly been storing this into someone's S3 bucket. But you can always choose to store this in your connected servers and what format this is as. So at any point in time, you can move your dataset to your database, from your database, push the compute down, which means your Dataiku instance can be a lot lighter, and you can really utilize the power of your databases. Now, simply clicking the diamond, Score, and then putting in that Unlabeled should do the trick. Go ahead and create that recipe. I'm going to turn on Explanations as well, but you don't have to. And we can go ahead and run all this. So in the bottom left, we have our Run. Now give this a moment to complete. And we can go straight into exploring our dataset at the bottom here. We could also jump back to our flow. But for now, I'll click the dataset right away. Don't be scared when IMDb and Engagement are empty here. Your predictions will go all the way to the end. So if I scroll all the way over, you'll see here we have our predictions for the IMDb score, and we get a column of explanations as well. So the top three most important features that made it that this row was a seven point eight are over here, and it's the Date Aired, the Suspect's Amount, the Format in order of importance. And that varies for the different rows, what really brought us to this prediction. There's some information on why did the model think this was the number as well. So we have our prediction, and we could put our data sources back together, but for now this is where we'll leave it. If I jump back to our flow, you'll see our final scored set over here. And a little champion's trophy because we made it. So end to end, we've uploaded our file, cleaned it up, split it into Labeled and Unlabeled. We used our Label to train our model, and then we used that model to score our Unlabeled. And we now have predicted the IMDb scores of those last couple episodes that weren't rated. Now I'm curious to check that against the actual. Our fit wasn't amazing, so I'm curious what other factors could influence it. But this is where we'll keep it for now, which means we have a good couple minutes left for Q&A. Please feel free to drop those in the Q&A or in the chat as well. At this point, we can definitely see both. And let us know. So I see one regarding the grouping of the Prepare recipe and I'll go ahead and show you that. So in our Prepare recipe, we had removed a couple columns, and we could group those together. So we could have also taken multiple in this case, but also when you have various steps that are of the same category, you could go ahead and select them. You can color them if you want to, or you can group them together. And this is Removing Columns. And now they're neatly summed up here in this group, three steps removing columns and a bunch of rows have gotten adjusted. And when we click it, it unfolds in the different steps. So this really helps with documentation, keeping things clean in your Prepare recipe as well. We can use the little View Impact to go back to different steps. I love that as well. And there are options to leave comments and write about anything. So you can really build in the documentation, coloring, grouping. They give you lots of tools to keep things organized. Alright. Rowan, do you have anything popping up? No. Nothing popping up on my side. I think we can probably wrap it up there. And I see one more come in. I see one more here that if you wanted to export this dataset to use it in a visualization tool, how would you do that? So I'll answer live here. To export this dataset, there are different ways, but generally, you won't be exporting this dataset. So in our case, we saved it straight into a data warehouse. And we would connect our visualization tool to our data warehouse. So this has pushed it back. So for example, you can push your table back to Snowflake and go from there. Now I do believe we have options to explore this data. But that's generally kind of changing the file type and getting it downloaded. Yeah, I just say typically within most companies, I mean, if you think about it, you've pulled the data from somewhere. So wherever that data has come from, it's probably an adequate place to put it back to more or less. So it depends on the infrastructure of your particular company. So most, if you remember that diagram that we had at the beginning of the overall flow of the data landscape. Then the machine learning part sits sort of at the bottom across all of it. So data comes out, gets transformed through the machine learning and then gets put back into where it was before. And then your BI tool would point to that. So, I mean, you could download that as an Excel from here and then upload that and then, you know, store that on your machine and connect to Power BI or to Tableau directly from that storage spot. But more likely, you're going to be working with some sort of database. So wherever you would normally connect to, you just push the data back into that same spot, and then you would connect to it there. Don't know if that sort of answers the question. I'd see a thank you there. So thanks, Arati. All right, well, don't hesitate to reach out to us if there is anything else. We're happy to answer more questions and go deeper into lots of these topics. Reach out to our team for any of that. The replay of this webinar will be sent to you within a couple of days. We'll also, there's already a version available on our blog, I believe. And I believe that's it. Anything from you, Rowan? No. That's it. Thanks very much for listening to us and giving us your time, and hope it was useful. And yeah, if you need anything from us in terms of Dataiku or any of the other tools or softwares we talked about, please do reach out. We're happy to talk to you. Thank you so much guys. Have a good one.

In this webinar, Robin Bergmans and Rowan Bradnam guided attendees through Dataiku’s core capabilities for advanced analytics and machine learning. The session explored Dataiku’s comprehensive features including data preparation, visualization, AutoML, DataOps, MLOps, and analytics apps. Using a hands-on approach with a Scooby-Doo dataset from Kaggle, the presenters demonstrated end-to-end workflows from data connection through exploratory analysis, data preparation with visual recipes, model training, and deployment. The workshop showcased Dataiku’s accessibility for both coders and non-coders, emphasized its collaborative flow interface, and highlighted model interpretability features including subpopulation analysis. Robin and Rowan positioned Dataiku as a versatile platform for last-mile data prep and machine learning within modern analytics landscapes.

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!