This is the third part of a short series called “Natural Language Processing with Dataiku,” where we are analysing airline reviews. In the first part, we did very basic data preparation, used the Dataiku sentiment analysis plugin and evaluated the model with the help of a confusion matrix. In the second part, we applied some NLP pre-processing tasks and ran the Dataiku sentiment analysis again. Upon evaluating once more, we were able to see how the performance increased. In this third section, we will use the native Dataiku machine learning (ML) algorithms to predict polarity based on text.
Data Preparation
To start with, we will divide our prepared dataset in two datasets for training and testing. If you were following the previous two blog posts, you will see that we applied some text pre-processing techniques to the column “customer_review.” For the purpose of this exercise, I have slightly modified the pipeline to keep “customer_review” with the raw data and created a new field, “customer_review_simplified,” with the pre-processed text. To easily achieve this, we will use a split recipe to randomly split the dataset. As we are using a random way to dispatch the data, it is very important to select “Set a random seed” so that it always does it the same way and the results are reproducible:
To make sure that the train and test datasets remained balanced (having the same percentage for positive and negative reviews), you can use the “Analyze” capability of Dataiku. To do this, double click in the train data set, scroll to the “recommended_label” field and click on “Analyze” like you see here:
Select “Whole Data” and click “Compute.” We can see that the datasets are not too unbalanced:
Creating a Baseline Model
The first model that we are going to create will use the “customer_review” field (without any preprocessing) to predict “recommended_label.” Dataiku has powerful algorithms within its AutoML functionality. To access this, select the “train dataset/lab/AutoML Prediction” to create a prediction model on “recommended_label” like you see here:
Dataiku rejects text features by default, so we need to manually go in the Design tab and reject all features except “customer_review.” As this is the first experiment, we will leave all the defaults including:
“Text handling = Term hashing + SVD”
“Algorithms to test = Random Forest”
“Logistic Regression”
“Hyperparameters = default”
Click “Save,” then click “Train.” We will name this “training session = baseline” like you see here:
In this case, optimising for the metric “ROC AUC,” the Logistic Regression performed better than the Random Forest:
In the previous blogs of this NLP with Dataiku series, we tested the predictions and produced a confusion matrix to visually see the correct predictions percentages. In this exercise, we will use the Evaluation Store and focus on the metric we are optimising for “ROC AUC” and also on the “Cost Matrix Gain.” We can still see each of the confusion matrices if we need to by double clicking in each of the individual evaluations.
Deploying the Model and Testing
The next step is to deploy our Logistic Regression model in the pipeline to be able to evaluate it. To achieve this, click in the Logistic Regression Model, and hit “Deploy” in the top right:
This will deploy two objects to our Dataiku flow: the train recipe and the model, visualized here:
The next step is to use that model to score our test dataset. For this, click on the “airlines_review_test” dataset and then on the recipe “Predict.” Give a good name to this scored dataset, in my case I used “airlines_review_test_scored_baseline_logreg.” Click “Run” to predict:
Once this is predicted, we will use an “Evaluate” recipe and keep the results in an Evaluation Store to be able to easily compare the results:
Second Experiment With Pre-Processed Text
Our next experiment will use the same AutoML configurations like the baseline modeling, but using our “customer_review_simplifed” feature. As a reminder, we used Dataiku’s pre-processor “Simplify Text” to get the stem of the words and to remove the stop words. If you want to read more about why we would like to do this, please read the second part of this series.
Above: A view of the Simplify Text feature from the second blog post.
After retraining the model with this “customer_review_simplified” feature, we deployed to the flow (updating the current model) and ran the full pipeline to be able to compare the results in the Evaluation Store. As we can see from the previous optimisation screenshot, the metric we are optimising for ROC AUC got slightly better. We are on a good track to optimise the performance of this polarity prediction.
You’ll notice here there are two evaluations with this second model. This is due to us running it with “customer_review_simplified” ordered both alphabetically and not alphabetically, which didn’t make significant difference in the performance: