Speed Up Your Analysis in Alteryx using R and SQLite

Data

Speed Up Your Analysis in Alteryx using R and SQLite

by Michael Treadwell
//

When dealing with large data sets speed can be crucial. Fortunately, Alteryx Predictive Tools utilizes R to provide a seamless framework to extend the toolset that comes with Alteryx Designer. This means that Alteryx users benefit from all of the powerful packages developed by the R community to add functionality and increase the speed of their modules.

If you have Alteryx Predictive Tools installed, you can use the R Tool to execute R code in your Alteryx module. While it is by far the simplest way to use R with Alteryx, it may not always be the quickest way to analyze your data when dealing with larger data sets.

For Alteryx and R to be used in tandem, there must be data passed between the Alteryx Engine and R. This data handoff can be complicated, so Alteryx has developed a custom package called AlteryxRDataX containing functions that take care of this transfer for you. Many of these functions can be auto-populated from a drop-down menu in the R Tool configuration:

Alteryx: Read input

These functions work reliably and are easy to understand. Unfortunately, I have seen some performance issues with larger data sets. Let’s explore the benefits substituting the Run Command Tool for the R Tool to take advantage of some speedy functions in a couple of popular R packages.

In order to follow along, you will need to have Alteryx Predictive Tools. Visit the Alteryx Download page to get Predictive Tools and see the list of R packages that come standard with it.

Installing the Necessary Packages

To get started, there are three R packages we need: dplyr, data.table and RSQLite. Because CRAN only keeps source versions of older packages and not binaries, I would recommend using the mirror at Revolution Analytics to download the packages.

The primer on how to install new packages into your Alteryx R library can be found here.

However, since we are using a different R repository and installing multiple packages, I have provided the modified code below. Open Alteryx, drag an R Tool into the canvas and paste the code block below into the tool configuration. Running the module should install the packages, but be aware: The directory of your R library may be different than mine. It will be located wherever you have Alteryx installed inside the directory R-library. If you have the latest version installed in the default location: C:Program FilesAlteryxR-3.1.3library.

pkgs <- c('dplyr', 'data.table', 'RSQLite')
# your library path may be different
alteryx.lib <- 'C:/Program Files/Alteryx/R-3.1.3/library'
revo.cran <- 'http://cran.revolutionanalytics.com'
for (i in 1:length(pkgs)) {
  if (pkgs[i] %in% rownames(installed.packages(lib.loc = alteryx.lib)) == FALSE) {
    install.packages(pkgs[i], lib = alteryx.lib, repos = revo.cran)
  }
}

If you receive a message that the directory is not writeable, then you probably have the administrator-required version of Alteryx installed. Running Alteryx as an admin will solve this issue.

Testing the Alternatives

I have provided three sample modules so that you can follow along with the testing. Download those and other relevant files here. The first module (r-tool) is simply an Input Tool, an R Tool and a Browse. The code run by the R tool is two simple lines:

df <- read.Alteryx("#1", mode="data.frame")
write.Alteryx(df, 1)

R tool code

This module is the absolute minimum required to use the R Tool. Pass data from Alteryx to R for processing and then back to Alteryx. Run it and you will see that it takes a considerable amount of time simply to pass data from Alteryx to R and back again. That is without any other calculations performed inside of R. Repeated testing gave me an average run time for this model of about 62 seconds.

Let’s look at a couple of ways to speed this process up using the Run Command Tool to run a .R file with Rscript.exe.

The second module (command-tool-csv) is an Input Tool, a Run Command Tool and a Browse. To run this module successfully, you may need to change Command from C:Program FilesAlteryxR-3.1.3binx64Rscript.exe to the location of Rscript.exe in your Alteryx installation just as we did above with the R library.

Run Command Tool

With this module, instead of passing data from Alteryx to R using the commands provided by Alteryx in the R Tool, we are writing the data stream to a CSV (Write Source in the Run Command configuration is write-input.csv) and then reading that CSV inside of the write-csv.R file run using RScript.exe. The write-csv.R file is simply a text file with the following R code:

library(methods)

#read data from Write Source in Alteryx into R
t <- data.table::fread('./write-input.csv', sep = '|')

#write csv to be read in Alteryx by Read Results
write.table(t, './write-output.csv', quote = FALSE, sep = '|', row.names = FALSE)

Again, this is the absolute minimum required to run R in tandem with Alteryx. Pass data from Alteryx into R for processing and back again. Except this time, instead of using the functions provided by Alteryx, we are writing a CSV of the Alteryx data stream and then reading it into R with the speed of fread() – a data.table function. Once the data is inside of R, we write it back to a CSV that will be ingested by Alteryx (Read Results in the Run Command configuration is write-output.csv). Unfortunately, there isn’t currently an fwrite() function, so we will use the R standard write.table(). Testing on this module shows an average of 33 seconds which is a considerable performance increase.

The third module is able to shave another 14 seconds off of the average run-time by utilizing Alteryx’s ability to read from SQLite. This module looks similar to the second, but it has two key differences:

Third module

Instead of reading write-output.csv into the Alteryx data stream from R, ‘Read Results’ in the Run Command configuration now points to a table called “output” in a SQLite database, write.sqlite. Inside of the Rscript, instead of using write.table() to create write-output.csv, we use copy_to() from the dplyr package to write the output to SQLite:

library(methods)

#read data from Write Source in Alteryx into R
t <- data.table::fread('./write-input.csv', sep = '|')

#establish connection to sqlite database
sl <- dplyr::src_sqlite(path = './write.sqlite')

#drop existing output table and recreate with the new data to be read by Alteryx
if(dplyr::db_has_table(sl$con, 'output')) {DBI::dbGetQuery(sl$con, 'DROP TABLE "output"')}
invisible(dplyr::copy_to(sl, t, 'output', temporary = FALSE))

Testing on this module shows an average of 19 seconds runtime. Below you can see the distribution of the runtime for all three modules:

Runtime distribution

Moving Data Takes Time

In a real-world scenario, we would be doing more than passing data back and forth between Alteryx and R. Between the read and write steps there would be several lines of data manipulation logic. The examples here are a base-case experiment for speeding up data transfer between Alteryx and R. I’ve seen this implementation save hours on processes for clients.

Aside from a few tricks you’ve picked up along the way, the major takeaway is, in any use case, always understand where data lives, which applications are moving data around and keep your calculations as close to the data as possible.

More About the Author

Michael Treadwell

Data Lead
Introducing the Snowflake Data Cloud: Data Science When you think of data science (for the purposes of this blog, this will encompass all machine learning and AI activities), you may ...
The Migratory Patterns of the Common Alteryx Workflow Prior to Alteryx Server version 2018.4, migrating workflows was a three-step process: Deny the problem exists Procrastinate Acquiesce ...

See more from this author →

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!