Creating Automation Scenarios in Dataiku


Creating Automation Scenarios in Dataiku

The wheel. The lightbulb. The internet. Dataiku. All revolutionary creations that disrupted life itself. Okay, well, maybe Dataiku won’t disrupt life itself. It will, however, continue to push the data industry in a great direction.

So, what is Dataiku? It’s a company that develops collaborative data science software called the Data Science Studio (DSS). DSS is loaded onto a server that allows for maximum collaboration and comes with loads of features that support everything from ETL to scoring and deploying robust machine learning models without having to touch code—unless you want to.

Automation Scenarios

The feature we’re going to explore in this post is automation scenarios. These scenarios are a set of actions you define that can be triggered in a few different ways. You want your data to update at the top of every hour? Create a scenario for that. You want to run some Python code when you upload new data? Create a scenario for that. You want to train a machine learning model every time a button on your web page is clicked? It’s scenario time.

For our demo, we’ll retrieve New Orleans, LA (NOLA) weather data from the Dark Sky API, do some data preparation to get the schema where we want it and add the data to a table in Snowflake. We’ll then set up a scenario to build the dataset and add the new data to the Snowflake table every hour. The flow looks like this:

creating automation scenarios in Dataiku for Snowflake

Let’s build this thing! We chose to use the Python script as opposed to the HTTP input because the Dark Sky API returns more information than we want (current and forecasted weather), and the script allows us to be selective. Quick note: a huge selling point for Dataiku is the many built-in features that allow you to avoid coding altogether, but you always have the ability to drop in Python or R code blocks for any function that might not be included. Here is the code we used for the API call:

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import requests
import json

# REST API call, API Key redacted, Coordinates for New Orleans, LA
response = requests.get("*API-KEY*/29.9433,-90.0641").content

# Getting just current weather
data = json.loads(response)['currently']

# Creating DataFrame to be loaded into Dataiku Dataset
nola = pd.DataFrame(data, index=[0])

# Write recipe outputs
nola_Weather_Data = dataiku.Dataset("NOLA_Weather_Data")

The output is one row with the current weather data for New Orleans:

data output when creating automation scenario with Dataiku

Next, we added a prep step to format the time feature to be more useful.  We did this in two quick built-in functions:

prep step when creating automation scenarios with Dataiku

Now that our data is prepped, all we have to do is add the sync step to insert the new data into our Snowflake table to complete our flow! Notice that we have “Append instead of overwrite” checked:

automation scenarios in Dataiku for Snowflake

With the flow complete, we can create a scenario that will automatically build the prepared dataset and insert it into the Snowflake table.

automation scenarios with Dataiku and Snowflake

Under the menu icon (highlighted above) that can be found at the top of the screen, select Scenarios and then select + NEW SCENARIO on that page.

auto-triggers in automation scenarios with Dataiku and Snowflake

We chose to have auto-triggers activate, so we could hit the scenario with the Dataiku DSS API to get the current weather at that moment. Setting the triggers is very simple. We just choose to add a time-based trigger to go every hour and it’s done!

Now that we have the triggers set up, let’s add some steps for the scenario to execute. The primary thing the scenario will be doing is building the prepared dataset and adding it to the Snowflake table. To do this, we add a build/train step and select the Snowflake dataset with the “Force-build dataset and dependencies” build mode.

build/train step in automation scenarios

At this point, the scenario will get new weather data from Dark Sky, prepare it, and insert it into the Snowflake table at the top of every hour (or when the DSS API call is made). Something to note is that every time this happens, a log is created and stored on the server where your Dataiku instance lives. If you don’t clear these logs, it can fill up your available storage, so we added the Clear Logs macro and the Clear Old Exports macro to the steps in our scenario to prevent this. If you’re triggering the scenario at a higher frequency than every hour, make sure your Dataiku admins are monitoring disk space and doing appropriate server maintenance that goes beyond these macros:

managing Dataiku storage space

The scenario is complete! We took this a bit further and created a Tableau dashboard using the Snowflake table as the data source, added that dashboard to the InterWorks Portals for Tableau demo and used Wordsmith to add narrative to accompany the dashboard. That demo can be found here.

You can do many, many different things with scenarios. It’s a huge benefit to choose Dataiku DSS as your data science tool. Our demo here just scratches the surface. If you have any questions about whether Dataiku is the right tool for your business or how it can take your business further, reach out to us. We’d love to help! You can also do a 14-day free trial of Dataiku on your own if that’s more your speed.

More About the Author

Justin Lyons

Curator Architect
Portals for Tableau 101: Page Power Improvements Do you like tools that are difficult to use? We don’t either. A big focus our team has for Portals moving forward is making them easier ...
Using the Extract Refresh Command in Portals for Tableau Everyone hates an old data source extract. Well, maybe not everyone, but data analysts sure do! If you’re an analyst or someone who ...

See more from this author →

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK


Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Geschäftsführer: Mel Stephenson

Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072


Love our blog? You should see our emails. Sign up for our newsletter!