Go! Dataiku

Transcript
Welcome to our webinar, Go! Dataiku. We will be starting shortly in five minutes. In the meantime, if you need a coffee or a tea, there's time. Hurry up. We have some of our products showcased on the right hand side. We have Assist, Curator, and ServerCare. Go ahead and check out those links to get more information, but we will start shortly in about five minutes. Thank you. Thank you all for joining. We will start five minutes past the hour. I've popped links into the chat. We have three amazing products that InterWorks brings to you. We have Assist, which is an on-demand support platform where you can access our experts. We have Curator, bringing everything into one place: dashboards, data, all of that. Check out Curator by InterWorks. And lastly, do you need someone to help you manage your server? We start off with server management, server infrastructure, and everything related to your Tableau server. Check out ServerCare by InterWorks as well. We will start shortly in about two minutes. Thank you all for joining today. Again, grab a tea or coffee. This would be the perfect time. We still have two more minutes to go. Thank you. I can see a raised hand already. If you have any questions, just pop that into the chat button. Chat is at the bottom of your Zoom control. Just use the chat function if you have any questions or queries at all, and we'll be able to take it from there. I want to also take this opportunity to segregate two important functionalities in Zoom. You have the chat option and also you'll have another option called Q&A. If you have questions related to Dataiku or anything specific and technical, pop that into the Q&A so we can actually answer those questions. I'm able to help our presenter catch those questions and give you the answers. But if you are trying to share some general commentary about how you're feeling today or anything very generic, pop that into the chat because that is for your general commentary. Two options: you have Q&A for questions and chat for your general comments and exclamations. I think we are at time now, so let's get started. Perfect. Awesome. Next slide, please. Perfect. Before we jump into today's content, I want to take a few minutes to introduce InterWorks. Maybe this is your first webinar with us, or maybe you are returning. Either way, welcome. I am super glad to see you all here. You might be wondering who InterWorks is. We do a lot of things, and sometimes it's really hard to explain, but if I put it simply, we specialize in data strategy. If you work in analytics, you know the challenges of an ever-changing tech landscape and the pressure of keeping up with a high demand of insights that we need to drive change within an organization. This is where we come in. Our specialty is building the best data strategies alongside you and being your trusted advisor when you need it. Further, everything we do is backed by our people. We're constantly learning and we always want to share our learnings with all of you. Next slide, please. Beyond our mission and our people, we can also help you navigate the right tools to align with your goals. Some of our partners you'll see on the screen right now. If you're looking for more resources on data analytics or any of the technologies that we discuss today, be sure to visit our InterWorks blog. It's world famous and it's a great knowledge base for anyone in your organization who might be working with data. Some reminders for today: we hold these webinars every single month. We value your feedback because we are constantly trying to make sure that we are putting out the best content and the delivery based on the needs of our user communities or customer community. As mentioned before, today's webinar will be recorded, and in a few days we'll send out an email to access the replay. This will only be available to the people who have registered. If you want to access previous webinars, previous monthly InterWorks webinars, you will find that catalog on our website. Finally, one request throughout today's presentation, and I mentioned this at the beginning as well, we will take questions using that Q&A function, which is at the bottom of your Zoom control. We'll mostly take these questions towards the end of the session because there's a lot to cover, and you can make use of the chat function for any sort of general commentary. Again, I repeat: use the chat function for general commentary, use the Q&A if you have questions. We will be recording this session and the replay will be sent to all the people who have registered. Next slide, please. I think it's time to meet our presenter for today. My name is Carol, and I am an analytics consultant based out of Melbourne. I will be your emcee for today. Our presenter is Azucena Coronel. She's a data architect with InterWorks. I will let Azucena introduce herself. Go for it. Hi everyone. Thanks for joining today. I'm Azucena and I'm based in Sydney. I am a data architect at InterWorks. I have been here almost three years, a lot of fun, good stuff, good stories, and it's great that you are joining us for this webinar today. Awesome. Let's get into what we will be covering today. There will be a little bit of introduction into what is advanced analytics. We want to set the landscape for you and what exactly is Dataiku. Then we get into the fun stuff wherein we look into data connections, explore, do a little bit of data analysis prep, and train the ML model. It'll be really intense as well, so it's really important that we are all tuned in. Lastly, we'll score our data as well. When we go to the next slide, we'll see what we require for today's session. Even if you get the replay, we do need to make sure that everyone has the trial version of Dataiku. I will pop that link into the chat right now. You can go get a trial version. The other thing is that we will be making use of a dataset called the Scooby-Doo dataset, which is actually available on Kaggle. If you haven't downloaded that link or the material, it will be available, and I will just pop that link into the chat as well. Two things you need today: you'll need an online trial version of Dataiku, the free edition up and running in your machine. And the last one is you'll need the dataset which we'll be using today to go through the Dataiku essentials. That is the Scooby-Doo dataset available on Kaggle. I've popped in both the links. Use them if you'd like to follow along today's session. Over to you, Azu, without any further delay. Thanks very much, Carol. Let's get started. For that, first of all, let's set the landscape a little bit, and let's talk about what is advanced analytics and why is it important to do it. Advanced analytics is the autonomous or semi-autonomous examination of data or content using sophisticated techniques and tools, typically beyond those traditional business intelligence ones, to discover deeper insights, make predictions, or generate recommendations. Why is it important? It helps decision making and boosts competitive advantage. Advanced analytics can help companies enhance this, and using advanced analytics can be the difference between keeping up with the competition or falling further behind, so it is really important. Here we can see on the screen a few logos, and I will explain examples of how these companies are using advanced analytics. For example, Nike is using predictive analysis to forecast consumer demand on a hyper-local level. This helps them optimize their inventory and develop more targeted campaigns. Car dealers, for example, use regression analysis to forecast the price of a used car, given its mileage, brand, or other variables and conditions. That's another very good and very typical regression analysis, a very typical example of advanced analytics. The airlines, for example, have a lot of data going through, and they use time series forecasting models to identify peak travel times, anticipate flight volumes, and be able to schedule flights accordingly. Very important. No one likes to get stuck in the airport, and there are a lot of things and variables going around. Time series forecasting is a very good method to be able to use advanced analytics for this. Another good example is retail. Retail uses a lot of clustering to create upselling and cross-channel marketing. You would have been browsing, for example, The Iconic or maybe even Amazon. You can see those suggestions made to you. Very typical example: cross-selling, upselling. I like that dress. Yeah, actually it goes well with these shoes. Okay, let's get them. Why not? Last but not least, manufacturing. Manufacturing, for example, uses advanced analytics for predictive maintenance. We all know that it is more expensive to actually have the line stopped, for example, and do some maintenance as an emergency, versus predicting for that, planning for it, and being able to keep our manufacturing correctly going. Good predictive maintenance there and condition monitoring. Let's make a distinction a little bit here between artificial intelligence and machine learning. Artificial intelligence refers to the constellation of theories, technologies, and research that surrounds the simulation of human intelligence processes by computer algorithm. For example, computer vision. Nowadays, we are hearing a lot about computer vision being used in, for example, medical scans to be able to detect maybe some tumors or maybe some anomalies. That is a very good way of using computer vision. Natural language processing, for example, a good example could be the chatbots. Natural language processing is for that as well, or even spam filtering. It detects some words that could make the emails be spam or not. Finally, we have here machine learning, and this is the science of getting computers to learn from experience and perform tasks automatically. How or why would we like to do that? Well, to uncover patterns in market research. Remember, we have a lot of data and what is best than computers to go through all of that in a more efficient manner. We can also use it to flag errors in different transactions. We can use it to personalize shopping experience based on browsing history and also to signal anomalies in medical research, as we just commented with computer vision. In other words, machine learning typically works by combining large amounts of data with fast, iterative processing and intelligent algorithms, allowing the software to learn automatically from patterns or features in the data. That's where we get our power for machine learning. I want to take a moment here to briefly talk about the current tool landscape that we use here at InterWorks. This is our reference architecture and the tools that we generally use in each of the layers. We have partnerships with the best-in-breed technology partners. On the left-hand side here, you will see all the different sources of data that you can imagine. We can have APIs, we can have different types of databases, we can have file-based structures, unstructured data, sensor data, Internet of Things data. All the data is lying here. Then what we do with that: we have here in the middle all our data warehousing and our ELT tools. We want to put it in Snowflake, for example, for the data warehouse to have it all in one single application, in one single place, so that we can actually use it in a clean and organized manner. We have one of our partners here for extraction, being Fivetran and Matillion. We have here the integration layer. It can be in any of the clouds: AWS, Azure, or GCP. Loading, the same as extraction: Fivetran and Matillion. Finally, here we have Snowflake, where we build all the data warehouse. Once we have ready our data clean and ready to use, we have on the right hand the analytics platform. We have a bunch of data, it's ready to use. We have some technology partners to actually go and make the best use of it and communicate within the organization. We have Tableau and ThoughtSpot here, and we are going to talk about today Dataiku. On top of being very good with machine learning, it's also very good for local prep and analysis. For example, for your data prep, it can be like cleansing. It can help with adding formulas and business rules, new measures, derived fields, renaming columns, all this stuff that is very small towards the end of the data preparation. It can also help you with enriching with sources not stored in your data warehouse. For example, if you have all your sales data ready but you want to add maybe a forecast that is very specific to you, why not? You can use Dataiku to consume that, join it as more appropriate, and then use it in a dashboard or in your prediction models. It also helps, for example, with light reshaping of data: joining, pivoting, filtering, removing unrequired columns and rows, etcetera. We have Dataiku here as local prep and analysis. Finally, we have here the area of machine learning and artificial intelligence. Again, best in class, Dataiku can build there not just AutoML models that we are going to talk about later in this presentation, but also it has great capability if you have a data scientist that is really into coding and they have already developed their models in Python or R. You can use them in Dataiku as well in a single platform to have them all readily available. Let's get rid of these drawings and talk now a little bit more. We set in that reference architecture where Dataiku fits. Let's talk a little bit more deeply about all the capabilities that it has. Dataiku is a one-stop solution for design, deployment, and management of all your artificial intelligence applications. It has, for example, here we can see the six top capabilities: data preparation. We talked a little bit about that. It has over ninety different built-in data processors to help you with tasks such as binning, concatenation, date conversions, etcetera. It's ready there, drag and drop, and then you connect your data and use them. When a processor is not available, you also have a formula that is very similar to Excel. In that way, it's very easy to any level of knowledge. It's very easy to just use Dataiku in that sense for data preparation. We also have the capability of visualization. Dataiku's visualization capabilities help you accelerate your exploratory data analysis, where you can create quick visual analysis of columns, including the distribution of values, top values, outliers, invalid, and overall statistics. It's also very important to be fast in an exploratory phase, and Dataiku helps us with this. What about showing your work and explaining your data well? Dataiku also has charts and graphs that make it easy to use visualizations to accomplish this. Out-of-the-box dashboarding helps you with this. On top of this, Dataiku also provides statistical analysis, for example, univariate and bivariate analysis. Everything is there in the platform, readily available for you to use. Let's talk about the machine learning capabilities. Dataiku provides an AutoML capability to get you started. It also helps you with feature engineering to automatically fill missing values and encode categorical data. For example, no more coding in Python, being very specific and iterating a lot. You can do it easily in Dataiku. On top of that, Dataiku provides notebooks for code-based experimentation. As I was just commenting before, if you have your data scientists who have developed a lot of models and who like to work in notebooks, Dataiku provides the environment to develop notebooks using Python, R, and Scala. This is based on Jupyter, so the data scientists out there will be very familiar with Jupyter notebooks. It really doesn't stay just with creating something. For a product to be successful into production and the whole lifecycle, you need DataOps and MLOps. This is the second part of the capabilities that I have here. With DataOps, Dataiku provides data quality checks to automatically assess that your flows run within expected timeframes and with expected results. For example, operating artificial intelligence projects require repetitive tasks like loading and processing your data, making sure it's clean, making sure it's ready. Dataiku has scenarios and triggers to allow you to automate all this by scheduling periodic executions or triggers based on condition. You have all that ready for your data to be processed. On the side of machine learning operations, with Dataiku's Unified Deployer, deploying the projects to production for batch and real-time scoring is easy. Again, in one single platform, you have all the ability to deploy, to manage between environments, and to monitor, because monitoring in machine learning is very important: that your model is not drifting, that it is still accurate, that it's providing good predictions. You have the MLOps abilities there. Finally, we have towards the right there the analytics apps. Within Dataiku, it's possible to create different analytics apps. For example, Dataiku has a what-if analysis scenario that allows data scientists and analysts to check different input scenarios and publish the what-if analysis for business users. I will show you later, but basically, if you want to play with the different variables, it is very easy to see how it affects the output with that what-if scenario. With Dataiku apps, data scientists and business analysts can easily create apps with a few clicks and publish a project, including the app, to production. With this, business users can easily interact because it's very important to not just keep within the team all the knowledge and what's happening with the data science and the datasets. It's important to show to the business how this can be used. Finally, Dataiku supports various leading web app frameworks. For example, Dash, Bokeh, R Shiny, JavaScript, and more. This is to allow for more ways to share your data and applications. These are all the Dataiku capabilities, all in one single platform. Let's keep going. Okay. We are ready to jump hands-on with the tool. I hope that by this time, you have opened your local instance of Dataiku. Make sure that it's open. If you have created your Dataiku free trial, make sure that you are connected and ready to go, because we are going to get started. Today we are going to explore the data connections. I'm going to show you several options that you have, and we are actually going to connect to the CSV of Scooby-Doo. We are going to do some exploratory data analysis. Before we even start preparing the data, we need to understand what we have available. We need to understand also what we need for our model to work and which steps we need to get there. Dataiku offers visual tools to make this step very, very easy, and we will see. We also are going to do the data preparation itself. After we have identified the different steps that we need to take to use our data and to model our ML model, we are going to do all this and prepare the data in the correct form that we require for this. Once the data is prepared, we are going to go and actually create our model. That's going to be a very interesting part. It doesn't end just with creating a model, right? We want to see how it is applied and how it is used and what happens to the data when we push it through it. We are going to score some data that we have there as well in our Scooby-Doo dataset. Without further ado, let's start chatting about data connections. Dataiku provides different connectors to over twenty-five leading data sources on-premise and in the cloud. For example, Amazon S3, Azure Blob Storage, Google Cloud Storage, Snowflake, SQL databases, NoSQL databases, HDFS, and more. We have different possibilities to connect to the different data sources. Let's go to the first exercise. In this first exercise, first of all, we are going to explore the data connection options inside Dataiku, and then we are going to connect to our Scooby-Doo dataset that you have downloaded with the links that Carol provided and that we sent by email earlier. Let's first of all go to your Dataiku instance. Again, if you are using a local instance, you will be able to access that with localhost, and the port is 11200. The user and password to access initially is admin admin, so you can access through that. First of all, you can see here that this is the very first screen that you will see in Dataiku. This is the DSS desktop, and you can see several sections here, such as Projects in here. We also have some Workspaces, Applications, and Project Folders towards down here. Applications, Project Folders, Dashboards, and Wikis. All your objects, you will be able to access them from this initial homepage. Okay? We also have here this button for New Project. That's what we are going to do first. We're going to click here, New Project. We are going to select Blank Projects. This is interesting because Dataiku offers a very useful Dataiku Academy for people who want to get started, and there are several Dataiku tutorials already there in the instance. If you want to, if you fall in love with Dataiku in this session and you want to know more, you can always start doing that academy and getting your hands dirty in the tool. For now, let's go and create a blank project, and we are going to call it Scooby-Doo. In my case, I'm calling it Scooby-Doo 2, because I already did a little project here with the name Scooby-Doo. I put the name here, it will generate the project key, and I click and create. Okay. This is our project screen. In here, we have, again, different sections as well. We have, first of all, the Summary here, where you can put a little bit more about what your project is about so that people can understand what you are achieving with this. You have here also the summary of different objects that you have: datasets, recipes, and models. You have here a summary of the notebooks and analysis that you have in the lab, and you also have your objects: dashboards, wiki, and tasks. This is a capability I quite like. You can also put your list of to-dos here for the project. Let's say that you are collaborating with another two people in your team, so you can always put here the list of tasks that you need to do, and then just put the name, and then you can start working collaboratively and see what everybody's up to within the project. Okay. Okay, next step is in the top black ribbon here, the very top black ribbon. You will have access to the diverse data type of functionality. Here we have access to the Flow, the Datasets, Recipes, etcetera. We have access to the different types of analysis, the notebooks for those people that want to experiment a little bit with coding, notebooks, web apps, libraries, jobs and scenarios for all that automation. You have also the wiki here, and dashboards and insights. All that different functionality, you can access it from this black top ribbon. Okay. For this demo, we are going to connect to one single CSV, the Scooby-Doo dataset that you have downloaded. For this, let's get started. First of all, go again here in the top black ribbon to Flow. Once you are in Flow, everything is empty because we haven't created things, but you will be able to see here that you have the different objects that you can use. For now, let's go and click in Dataset. Let's take a moment to just explore the different connections that we were chatting about before. We have here the ability to upload our files, which is the one that we are actually going to use for the Scooby-Doo dataset because we download it in CSV, but we have other options as well. Because here I'm in my local instance, I don't have the full options available, but you will be able to see in your free trial that you have more options. For example, in the network, you here have the possibility to connect to FTP, SFTP, HTTP, and all these areas that you need. You can also connect to HDFS and Hive. You can connect to a whole big range of SQL databases: Snowflake, PostgreSQL, MySQL, Amazon Redshift, Google BigQuery. There are really a lot of options there. You also can connect to cloud storages and social. For example, Amazon S3, Blob Storage, Google Cloud, and social is Twitter for now. We also have the possibilities to connect to NoSQL: MongoDB, Cassandra, Elasticsearch. You can see all the different connectors that are available there. Cool. Once we have explored a little bit the different connectors available, let's go ahead and upload our files. Once you are here, the next step is go and click on that Upload Your Files. We click there, and you have the option to either select the file or to drag and drop it. I love drag and drop, so I'm just going to do that. Drag my CSV to here. I wait a little bit for it to upload, and I am going to rename to something short and easy here my dataset. I'm just going to call it Scooby-Doo. This is quite nice because you can see here a little preview of what the data looks like. You will see that it took the first, the top row of your dataset and used it as the name of the column. That is quite good. Okay. We just scroll here. Everything looks good. Everything is as expected. Then we are going to go and Create. Perfect. We have created our dataset, and now we're going back to our flow screen, and we will be able to see it in the canvas. Again, remember, the top black ribbon is your access to pretty much all the functionality. Go here, top black ribbon first, Flow, click, and you can see here your dataset. Once you explore more, you will be able to see that Dataiku is very, very visual. You will have different icons for the different types of datasets. This is a file upload, very visual here. Snowflake has a little snowflake, and the Azure databases have some other icon. When you have a very complex, or yeah, more complicated flow, you can easily use those icons and understand where your data is coming from. Again, remember that Dataiku is all about being very, very visual. Cool. Let's go back to our exercise and just make sure that we achieve all the objectives. We explored the data connection options that we have available in Dataiku, and then we connected to our Scooby-Doo dataset. Check and check, we are ready to go to the next section. This is a fun one: exploratory data analysis. Once we are connected to our data, we want to understand it a little bit more and understand what we can do with it and what we are working with. Exploratory data analysis is done to understand your data. In this step, you identify the kind of problem that you are trying to solve. Is it a prediction problem? Does it need a supervised or unsupervised approach? Do you have actually a target variable that you need to predict? If you have a target variable, are you trying to classify it into some categories, or are you trying to predict a continuous number? You can start thinking about all these questions. Once you identify your target variable, if you have a target variable, you can also start identifying the relevant fields. You can start thinking, okay, if I am predicting, maybe I am predicting, I don't know, sales, or maybe I am predicting the possibility if someone is going to convert or not to convert in an e-commerce pipeline, something like that. Then I start thinking about, okay, so what could impact that decision? Maybe the different number of pages that they browsed. Let's talk about, for example, The Iconic again: the different type of pages that they browse, of dresses, or maybe if they put it in the checkout or not, different things. You can start identifying all the relevant fields that are going to be useful for your prediction. Once you identify this, you also identify all the data preparation that you need to do. Maybe the data is not one hundred percent clean. Maybe you need to separate some fields, or maybe you want to get rid of some rows. You can start thinking as well on all those steps that you are going to require. Let's go now to our exercise number two, and this is going to be the exploratory data analysis. Main objective here: identify the target variable, identify our relevant fields, and identify our data prep opportunities. Okay? Let's go back to the tool. Let's look, first of all, at what we have in our datasets. I'm going to double-click our Scooby-Doo dataset, and we can see here that each of the rows contain information about each of the Scooby-Doo episodes. We have all these episodes that were aired on TV. We got episode information such as the title. We have Title here. We have Date Aired. We have Runtime. We have Format. We have all the information of the episode itself. We also have some information about the monsters that appear in that episode. We have Monster Name, Gender, Monster Type, Subtype, Species, and if it was real or not, and also the Motive. Moving on, we have some information about who caught the monster, if they caught it or not. We can see here that for each of the characters of the Scooby-Doo gang, we have here if they caught it, if they captured it, and if they unmasked. We have those three alternatives of interaction with the monster. We have some extra information. For example, if they ate a snack or not, if some other person that was not of the main characters unmasked, then was unmasked, caught, or if it was not caught, the monster. It also has some information about the landscape itself or the setting, the terrain, the country or state, and it has some information about the culprit itself: Culprit Name, Culprit Gender, Motive, and some more information. While we are thinking this, we can start thinking what we want to do with this data. If some other character was included in that episode, etcetera, okay? We have all this information, and we can explore our information here. In here, Dataiku offers us two very important things. First of all, it offers us the data type that was detected from the data source. In this case, it is detecting everything as string because we uploaded a CSV, so it doesn't really have a data type. If we are connecting to a SQL database, for example, it will give us the data type of the source: integer, string, Boolean, etcetera. It also has this blue type that is the Meaning. Dataiku has the ability to identify what kind of information we have in the field. For example, here, Natural Language, it is basically just text. We have here as well Natural Language text. We have some date here. Even though it is a string, because again, it's coming from a CSV, it has identified that it is a date, but it's not parsed. We are going to do something with that. Let's actually go and show a functionality of the natural language. Dataiku has some capabilities here to analyze directly from this screen. Again, because everything is very visual, you can do the exploratory data analysis quite fast here. Let's try and understand what is the word, for example, because we have natural language here. What is the word that most appears in the title? For that, I click Analyze, and I'm going to use the Natural Language Processing there. I click here in Natural Language Processing. I am going to keep the defaults here, so I'm normalizing the different words, and I am clearing of stop words. I am not interested in those stop words such as "the" or "to," etcetera, like articles and things like that. I'm not interested in those words, I just want the real nouns. I click and compute, and we can see here that the word that appears most in these titles is Scooby-Doo, and really no surprise because after all, he is the star of his own show, right? Scooby is the word most used there. That was good. Let's explore. Okay, so we have that natural language processing. We can also, we want to explore here parsing the data. Dataiku has a very good ability to parse the data. We sometimes know that managing dates are difficult because probably we don't exactly know in which format they are. For that, we can go here and see the Analyze Schema. We're actually going to parse it in the next step. That is fine. Okay. Here we have identified that we have here IMDb. That is what we are going to try and predict. We have some null values there. Again, if I go and Analyze, we have Invalid Values, and I want to see what those invalid values are: null. We are going to use IMDb as our target variable, so that we can prepare this and predict the IMDb. Okay? Let's go back to the flow and start our data preparation. Okay, so for exploratory data analysis, we have identified our target variable, which is IMDb, and we have identified some of the information that we are going to use in the model that we are going to build. Okay? Let's go and start our data preparation. We were chatting about the different capabilities of Dataiku in data preparation. Data preparation in Dataiku: the Dataiku visual flow allows coders and non-coders alike to easily build data pipelines with datasets, recipes to join and transform datasets, and the ability to build predictive models. It has a very good and easy-to-use visual flow that we will see in a moment. It also provides an easy-to-use visual interface that speeds data preparation. Dataiku offers ninety-plus built-in data transformations to easily aggregate, clean, normalize, deduplicate records, etcetera. Common tasks that are required while data cleaning, we can find them there in the data transformers. It also offers some geospatial data preparation functions when working with geospatial data. For example, these geospatial data preparation functions include the ability to extract latitude and longitude from geopoint data and vice versa, with geo IP location to reverse locations data like country, regional state, city, postal code, and more from an IP address, for example. It has that embedded capability there. Cool? This is our exercise number three, and we are going to start preparing the data. Let's review very quickly what we need to do. Okay. First of all, we identify that we have a date, but we need to parse it to be able to use. We are going to parse the Date Aired. We are going to keep the rows where Monster Amount is zero or one so that we are able to predict that IMDb variable. We are going to remove the rows where the Monster Type is null, because we are going to use the monster characteristics to be able to see which episode was the most interesting one, so which has a better IMDb review. We're going to remove those rows. We're going to remove the unwanted columns. There was a lot of natural language processing columns there that might not add to the model, so for now, we are going to remove them. We are going to standardize the Monster Amount, and we are going to set IMDb and Engagement to double data type. We are going to do those six steps. Let's go back to the tool itself. Again, I'm going to click here my dataset, and we remember that Dataiku is very visual, so we are going to use a Prepare recipe for this. Once I click here, I'm going to go to the right side of the visual recipes, and I'm going to click Prepare. Okay? Click Prepare. This has given me the option already to have an output dataset that would be called Scooby-Doo Prepared. I'm going to leave that one for now. I want to store it in a CSV format in the file system managed. Pretty standard steps. Okay. The first transformation that we want to do is to parse the Date Aired. How we are going to do that is we are going to go here. Here it is, the Date Aired. This is what I was trying to show before, but it is in the Prepare recipe. We have here the option to click here, and we have Parse Date. Again, sometimes it's difficult to work with different dates because we are not sure if it is month, month, day, day, year, year, year, or if it's day, day, month, etcetera. I really like this ability of Dataiku, where it can explore the whole dataset, and it tells me, suggests me the format that I should use. Here it is telling me, okay, day, day, month, month, year, year, year actually doesn't fit in half of the cases, or more than half of the cases. Probably that's not a good format to take, but month, date, year fits well. Let's go and use month and year. We leave the default here, and we use that date format. As you can see, this action built the very first step of data preparation here on the left-hand side. As we start building, you will be able to see that all the preparation steps will be able to be seen here, and this is how we are going to organize. Okay? We are happy with this data preparation step. We are going to go to the next one, and we can see here as well the new column with the date already parsed. Okay. Let me look at what is next. Next, we are going to keep and remove rows. We are going to add a new step here. Okay. We want to remove the rows that we are not going to use. We are going to get rid of the rows that have monsters zero. Okay. Remove rows. I'm going to use here Filter Row Cells on Value. Okay. I am going to keep the one that has just one monster. Only keep matching rows, the column is Monster Amount. Okay. We are going to use only the one that has one monster there. Okay? We could go and get creative and do a little bit more of transformation here to use the other ones, but at the moment, we are going to keep it simple for this first workshop, and we are going to just use the one that only has one monster. Okay? I click here and add new step. Another nice functionality before we move on to Dataiku has is this little eye so that you can actually see what is the effect of each of these steps here. Right? If I click that little eye, it has just applied this step. I am not deleting or getting rid of any rows here. If I click this eye, now I see what that step is doing. Okay? We have got rid of the rows that have several monsters. Let's go and review. Keep rows where Monster Amount is zero or one. Remove rows where Monster Type is null. Okay. Next step is we want to use the monster information to predict our target variable. If we don't have monster information in there, we just want to get rid of them. Again, I will click Add New Step, Filter Row Cells on Value, and I'm going to use Monster Type. I'm going to remove the rows where Monster Type is null. Monster Type. Okay, perfect. Done. Okay, so I got rid of the rows that don't have Monster Type, and I am just going to check that that is true. I am going to go here, Analyze, and see that everything has monster. But okay, we are seeing here that we need to do some extra cleaning here. We are also going to do that. Okay? We're going to standardize that. Next step is removing the unwanted columns. We were chatting about having different natural language processing columns here that we are not going to use for our model. We have a processor here to delete or keep columns by name, so we are now going to use that one. Clicking there, that's a new step, and we want to remove different columns. I click there in Multiple, and we are going to get rid of the Title. We don't want the title there. We don't want the Series Name either. What else do we have there? Monster Name, we wanted, let's keep it. There were some other names there. Culprit Name, let's remove that as well. Okay. There was some, "If It Wasn't For," that was another phrase that they said, like, "If it wasn't for those kids," etcetera. We don't want also that natural language processing text getting into our model. That's fine. We're going to get rid of that. Perfect. That's it. Okay. The next step is to standardize the monster attribute. The monster attributes are the gender, Monster Gender, Monster Type, Monster Subtype, and Monster Species. We are going to do some data standardization, because remember that before we saw that some of them have commas, and it looks like an array. We are going to go and get rid of that. Okay. For that, we are going to add a new step, and we are going to do it first with the Monster Type. I'm going to use a formula for that. Okay. Actually, I'm going to here open the editor panel. I was talking in the beginning about if you cannot find a processor and you wanted to do something very specific, we have the ability here to have this formula capability, and this is a formula very similar to what you will do in Excel. In this case, we are going to utilize the split, and it also gives you some advice there as to how to use those functions. That split tells me that the first part of the formula is the string that we actually want to split. For now, I'm going to take first the Monster Type. I type Monster Type there, and it helped me here with the value of the Monster Type that was put by itself. The second part is how I want to divide it, and we saw earlier on that we have commas there. I'm going to use a comma to divide it. Okay. That's it. It is returning an array with the different values here. Because we just want one of the values, we are going to use here just a square bracket so that it gives me the very first value that it has. Okay. I am going to select this, and I'm just going to copy to the output column so that it gets replaced there. Apply. If we go and check here our Monster Type, we will see that we got rid of those weird arrays that we had there. Now we have just one single word here. Okay? Let's go, and because we want to do exactly the same transformation for the Monster Subtype, Monster Species, and Monster Gender, let's go, and we don't want to type several times the same thing, so let's go and use this function: Duplicate Step. I went here to the three dots, Duplicate Step, and I'm going to use the same for the other fields. This is Monster Subtype. Same idea, copy paste. Cool. Duplicate once more. I'm going to do it for the Monster Species. You can start seeing here why it's easier to just come and use this rather than probably exploring and doing everything by code. It just takes a little bit more time. I'm going to apply here again, Monster Species, and I'm going to do it one last time for the Monster Gender: duplicate. Cool. Okay. Again, let's just analyze here and just make sure that everything is looking great. Okay. Male, female. Perfect. Single values. Monster Type, we already checked it. Analyze. Subtype, it's looking good. And the Species. Perfect. Okay. We have done that of standardizing our monster features. That is what we are going to use in our model. Finally, we want to go and parse the Monster Amount, IMDb, and Engagement. We are going to cast it to double. We go here once more, IMDb, and I'm going to change this to double. Engagement as well, I'm going to change it to double. The Monster Amount, I am going to change it. Monster Amount is already an integer, so we're good with that. Okay, perfect. We are ready to get going. We have finalized our data preparation. We just save it. Remember, return always to black top ribbon. Let's go back to the Flow, and we are going to run it from here. As you can see, we have in this flow, we have our input here, our Prepare recipe, and we have our output dataset here. To run this, I'm going to use the Flow Actions to the bottom right side, Flow Actions, Build All. Yes, we want to build required dependencies, Build. It is building. Job started, job finished. I'm going to do a quick refresh here, and I can see that I have now a solid figure here. It means that my dataset is ready to go. I can double-click and see all that I did. I have the data that is parsed. I don't have the columns that are natural language processing, and we don't want to use them in the model, and everything is looking standardized. Let's go back and just review that we accomplished all the objectives of this section of the exercise. Number one, we parsed Date Aired, so now we have a proper date in the correct data type. We keep the rows where the Monster Amount was one. We remove the rows where the Monster Type is null, and we remove the unwanted columns. All those columns that were natural language processing, we didn't want them, we removed them. We standardized the monster attributes, so Subtype, Type, Species, and Gender. We got rid of those arrays that were showing in some of them, and we set IMDb and Engagement to a double data type. With this, our data preparation part one is done. Let's go to data preparation part two, because it is well known, right? Like in data science, eighty percent is data preparing, but this makes it easier. Let's go to the data preparation part two. For this, we already have all our data ready, we just need to split it in the labeled and in the unlabeled datasets, so that we can train our model in the labeled dataset, and then we're going to score our unlabeled dataset. Black Ribbon, top of the page, Flow, we are going to use a split here. Again, I find here the split, and I'm going to call the first dataset Scooby-Doo Labeled. Labeled. Create Dataset. Scooby-Doo. Scooby-Doo Unlabeled. Unlabeled. Okay. We will have our labeled and our unlabeled. Let's go and create the recipe. What we want to do with this step is to just get in our labels all those rows that already have an IMDb, which is the one that we are going to use to train our model, and we are going to keep separate the ones that don't have an IMDb, so that we can score those ones. I'm going to use here Map Values of a Single Column. I click here. I want to split in the IMDb field based on discrete values, and we know that IMDb score can go from zero to ten. Actually, I'm going to not use discrete values, I'm going to use a range there. Everything that has an IMDb from zero to ten, that is going to go to the labeled dataset. Everything else is going to go to the unlabeled dataset, okay? I am going to show here another way of running this. If you see here, your bottom part on the left, you have there a big green button of running. I will just go there and run. Everything is running. Job succeeded, and we go back to our flow. Once more, top black ribbon, click on the first icon and Flow. We can see here now that we have our two different datasets. Okay. Exactly the same structure, because that's exactly what you want with your model. The labeled and the unlabeled datasets need to have exactly the same structure. Cool. We achieved that exercise number four step, and we split our dataset into labeled and unlabeled. Let's go. We are ready to do the machine learning model. Let's go and do it. With Dataiku for creating a machine learning model, it has different functionalities such as feature engineering. It also has AutoML. It has the capacity to have notebooks in Python and R so that you can do a little bit more of a research kind of thing in notebooks, and it can have time series visualization. Lots of capabilities here. Let's go to our exercise number five. In here, we are going to create an analysis on our labeled dataset, and then we are going to just explore a little bit of the model interpretability sections that the model in Dataiku offers. Let's go back to our flow, and how we're going to start here is we select our labeled dataset, and we are going to go to Analysis, to Lab, link here, and New Analysis, okay? I'm going to just keep the name, Analysis Scooby-Doo Labeled, Create Analysis. In here, we have different options for the models. I'm going to go here to the top right, Models, Create My First Model. Here, again, Dataiku offers different types of models available here. We are going to use for this workshop the AutoML Prediction, but there are also capabilities for deep learning prediction, image classification, object detection, time series forecasting, and AutoML clustering. It has much more capabilities than the ones that we are showing for now here. I clicked on the AutoML one, and I am going to use IMDb. I want to predict IMDb as the target variable. There, I am just going to select IMDb, and again, once more, even within AutoML, we are going to use a Quick Prototype for now, but there are other types of algorithms that are either more interpretable so that business analysts can understand better what's going on, and it's not a black box, or high performance models. Let's go and select Quick Prototypes, click and create. Okay, and here, very quickly, we can see in this tab of Design, different options that we have. We are not going to move the default for now, but we have the possibility to explore a little bit more and to experiment different things with our design of the AutoML. Different percentages for the train and test dataset, different metrics to optimize. For now, we are going to optimize in that R2 score. We have here debugging. It gives us some diagnostics and suggestions as to what to do with that model. We can see here Include or Exclude. Remember that we didn't want to have any natural language or text field, but we left the Monster Name. Okay, I left it there. For some reason, I really don't want to use it in the model. It's as easy as go here, and I don't want to use it anymore. I have the possibility there as well to tell the model what to use for the modeling. Okay? Feature generation, feature reduction, there are several options there. The good thing is also, if you are starting in data science, this is very interactive, and it gives you a lot, there is a lot of advice and a lot of documentation on how best to use that, so you can explore a lot. Cool. For time's sake, I am going to go to Result here and start the training. We have twenty more minutes including Q&A, so we'll get going. I didn't show, but there is also, at the moment, it is just going to give us random forest and ridge L2 regression. You have more algorithms available there if you want to play a little bit with those ones, so you can select them to be included in the AutoML. Cool. Here, our model indicated that the random forest is the best one as per the optimization of that R2 score, and we want to actually go deploy that. Okay, I want to save it first, of course. To deploy it, again, everything is managed from the flow, so it is as easy as selecting that random forest one, that is the one that I want to deploy, and go here to the top right, and I have my Deploy here, so I want to deploy it. I am going to give the default name here, Predict IMDb Regression, Create, and voila, our flow, all of a sudden, has another two objects. It has a training object, and it has the prediction model here ready for using. Okay, so we have created our model. We created an analysis on the labeled datasets, and we reviewed a little bit of the design of the model. Okay. I was going to show you the interpretability, but I might leave it to the end if we have a little bit more time. I'm going to go to the final step, which is the exciting one, right? We have created our machine learning model. We actually want to see what it does. In this step, we are going to use that unlabeled dataset that we prepared just before, and we are going to apply the model to it. Dataiku has two options to deploy the models. You can have batch scoring with automation nodes, so that means that you will be scoring in batch, in packages of several rows, and outputting predictions, which is really what we are going to do at the moment with this flow. But you also have real-time scoring with API nodes. Let's say that you are in a use case of, you want to see maybe with the car dealers, you want to give a prediction of the price of the car. That needs to happen in real time really. The person is putting several features and selecting different stuff. It needs an API to go, apply the model, score it, and then come back immediately. We cannot wait till one hundred rows are ready for batch scoring. It has that possibility with the API nodes. Okay. As we said before, we have our model there in the pipeline. We are ready to use it. Final exercise, we are going to score the data. Let's go back here to our pipeline. I'm just dragging here just to make a little more space, but how I'm going to use that is I click here, and I am going to use the recipe for scoring. Here it is, Score. Okay. Once I click here, it appears on this side. Click Score, and I'm just going to give the dataset that I want to use for scoring, which is the Scooby-Doo Unlabeled. Okay. Yes, I'm going to leave that name here, Unlabeled Scored, Create Recipe. Okay. Here, I am just going to put Compute Individual Explanations. It's a little bit slower, but our dataset is small, so we can still do that. That's fine. I am going to run again from here. I'm going to go back to the flow, and we can see here that a new icon appeared there, which is the scoring icon, and we have finally the scored dataset here. I'm going to refresh so that I can get the full scored dataset here. Done. If we scroll all the way to the right, we have for those fields, for those rows, we have our prediction here, and there are a little bit of explanation of how the different features played into that prediction. That's our dataset scored. We have fifteen more minutes. I am going to just take two more minutes to show you the interpretability section of the model. To go back there, I go to, on top here, Visual Analysis, Analysis Scooby-Doo Labeled, that is the one that I created before, Random Forest, and this is the section where once the model is created, we are able to see here a little bit more and understand it a little bit more here. We have some decision trees, to see, like, for example, the first step that the algorithm took is to see if the format was series, yes or no, and then it starts dividing the tree like that. Very important features of subpopulation analysis. We as data scientists, as data engineers, as people in data need to be very aware always to not be biased, and our algorithms not to be biased. This subpopulation analysis helps with that. Let's imagine that we have Monster Gender, where it doesn't make a lot of sense to show this functionality, right? I can select there my variable Monster Gender, compute, and the idea here to make sure that it is not biased is that the metrics behave similarly in all of the populations. Here we have very few female monsters there, and our metrics are different within the populations. If this was a real case scenario, I would be worried about bias towards females. I would go and investigate a little bit further that, make a little bit more tests, see what's happening with the data. This is very, very important: subpopulation analysis. We have some other things like scatter plots there for interpretability, error distribution, metrics and assertions. We have here which features were actually used, etcetera. You can also research a little bit more on this. But yeah, just to show that, and in general, data science is not just about creating the model and who knows what's happening inside. We also need to be very conscious and aware of what is actually going on and be able to interpret it. Cool. Any questions so far? I talked a lot. I hope that we have some questions there. We don't really have any questions in the Q&A. I'm just looking at chat as well. Looks like we don't have a lot of questions. Thanks Azu for that really insightful workshop. I hope everyone who attended this webinar got a lot of tips and tricks and especially how to just start off with Dataiku DSS. This was a ninety-minute workshop and you will get that webinar replay as you've registered. Again, we're going to pause a couple of minutes to see if anyone has any final Q&A, any questions at all. Feel free to pop them into the Q&A section of your Zoom control. Otherwise I'm monitoring chat as well at the moment, so you can pop it into chat if you'd like. Again, just a recommendation: if you are interested to start exploring your machine learning use cases, to go one step beyond, right, and do that predictive analytics part, Dataiku is a very easy way to start, and it really lowers the barrier of entry. Just make sure to grab a free trial account or to download the limited free edition and explore the possibilities. Yes. I've also popped in a blog article, which gives you a little bit of a recap of this event, because this is the second time we're running the Go Dataiku workshop. This is just a little bit of a recap, and you'll also see some of the questions that we had last time, and you can get a little bit more information. Feel free to reach out to us if you have more specific questions on our website. Next slide, please, Azu. Yes, again, I will just remind everyone there will be a replay that will be sent to you again within two to three business days. You will not be able to see this replay on the blog. It'll only be available to the people who have registered. Also at the end of this webinar, there will be a short survey. Please do give us your thoughts. We are looking to continuously improve our content as well as delivery. Do let us know your thoughts and give us your feedback. Otherwise, reach out to us if you have any questions or queries on anything that we have discussed today. We do have all our links on our website. But otherwise, thank you all for joining today's webinar and hope you all have a lovely rest of the day. Thank you. Thanks Azu. Thank you. Thanks everyone. Thanks, Carol. Bye bye.

In this webinar, Azucena Coronel guided attendees through Dataiku’s essential capabilities for advanced analytics and machine learning. She explored the platform’s comprehensive features including data preparation, visualization, AutoML, DataOps, and MLOps functionality. Using a Scooby-Doo dataset from Kaggle, Azucena demonstrated hands-on workflows covering data connections, exploratory analysis, data preparation with visual recipes, and building predictive models. The session included creating and deploying a random forest regression model to predict IMDb ratings, with emphasis on model interpretability and subpopulation analysis to identify potential biases. Carol Prins hosted the ninety-minute workshop, which showcased Dataiku as an accessible entry point for organizations exploring machine learning use cases.

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!