This is the third part of a short series called “Natural Language Processing with Dataiku,” where we are analysing airline reviews. In the first part, we did very basic data preparation, used the Dataiku sentiment analysis plugin and evaluated the model with the help of a confusion matrix. In the second part, we applied some NLP pre-processing tasks and ran the Dataiku sentiment analysis again. Upon evaluating once more, we were able to see how the performance increased. In this third section, we will use the native Dataiku machine learning (ML) algorithms to predict polarity based on text.
Data Preparation
To start with, we will divide our prepared dataset in two datasets for training and testing. If you were following the previous two blog posts, you will see that we applied some text pre-processing techniques to the column “customer_review.” For the purpose of this exercise, I have slightly modified the pipeline to keep “customer_review” with the raw data and created a new field, “customer_review_simplified,” with the pre-processed text. To easily achieve this, we will use a split recipe to randomly split the dataset. As we are using a random way to dispatch the data, it is very important to select “Set a random seed” so that it always does it the same way and the results are reproducible:
To make sure that the train and test datasets remained balanced (having the same percentage for positive and negative reviews), you can use the “Analyze” capability of Dataiku. To do this, double click in the train data set, scroll to the “recommended_label” field and click on “Analyze” like you see here:
Select “Whole Data” and click “Compute.” We can see that the datasets are not too unbalanced:
Creating a Baseline Model
The first model that we are going to create will use the “customer_review” field (without any preprocessing) to predict “recommended_label.” Dataiku has powerful algorithms within its AutoML functionality. To access this, select the “train dataset/lab/AutoML Prediction” to create a prediction model on “recommended_label” like you see here:
Dataiku rejects text features by default, so we need to manually go in the Design tab and reject all features except “customer_review.” As this is the first experiment, we will leave all the defaults including:
“Text handling = Term hashing + SVD”
“Algorithms to test = Random Forest”
“Logistic Regression”
“Hyperparameters = default”
Click “Save,” then click “Train.” We will name this “training session = baseline” like you see here:
In this case, optimising for the metric “ROC AUC,” the Logistic Regression performed better than the Random Forest:
In the previous blogs of this NLP with Dataiku series, we tested the predictions and produced a confusion matrix to visually see the correct predictions percentages. In this exercise, we will use the Evaluation Store and focus on the metric we are optimising for “ROC AUC” and also on the “Cost Matrix Gain.” We can still see each of the confusion matrices if we need to by double clicking in each of the individual evaluations.
Deploying the Model and Testing
The next step is to deploy our Logistic Regression model in the pipeline to be able to evaluate it. To achieve this, click in the Logistic Regression Model, and hit “Deploy” in the top right:
This will deploy two objects to our Dataiku flow: the train recipe and the model, visualized here:
The next step is to use that model to score our test dataset. For this, click on the “airlines_review_test” dataset and then on the recipe “Predict.” Give a good name to this scored dataset, in my case I used “airlines_review_test_scored_baseline_logreg.” Click “Run” to predict:
Once this is predicted, we will use an “Evaluate” recipe and keep the results in an Evaluation Store to be able to easily compare the results:
Second Experiment With Pre-Processed Text
Our next experiment will use the same AutoML configurations like the baseline modeling, but using our “customer_review_simplifed” feature. As a reminder, we used Dataiku’s pre-processor “Simplify Text” to get the stem of the words and to remove the stop words. If you want to read more about why we would like to do this, please read the second part of this series.
Above: A view of the Simplify Text feature from the second blog post.
After retraining the model with this “customer_review_simplified” feature, we deployed to the flow (updating the current model) and ran the full pipeline to be able to compare the results in the Evaluation Store. As we can see from the previous optimisation screenshot, the metric we are optimising for ROC AUC got slightly better. We are on a good track to optimise the performance of this polarity prediction.
You’ll notice here there are two evaluations with this second model. This is due to us running it with “customer_review_simplified” ordered both alphabetically and not alphabetically, which didn’t make significant difference in the performance:
Third Experiment: Adding New Features
In the spirit of science (hence the data “science” term), let’s do another experiment. This time, we will add few new features based on the customer_review.
- First, we will add two simple features: “length_raw” and “length_simplified,” which will count the number of terms in each of the fields:
- Then, we will add the ratio between these two lengths and call the new feature “length_ratio,” like you see here:
Once these features are added in the initial data prep step, we’ll run the split recipe as well to surface these new features to “airlines_review_train” and “_test” datasets and use them in the next model training session. We call the next training session “extra_features” like so:
The Logistic Regression algorithm again gets the higher ROC AUC, so we deploy this new model to the flow, use it to score our test dataset and evaluate it. These new features didn’t increase the performance of ROC AUC, but they slightly increased the cost matrix gain:
So, we’ll take the increase and move on to our next and final experiment.
Fourth Experiment: Handling Text Features in the ML Model Design
We have added a few extra features in the previous steps. Now, we will experiment with the different techniques to handle text features in the ML model. Our first baseline we left it with the default value ‘Tokenization, hashing and SVD.” This time we will select “Count vectorization” in the Design tab:
Then we do our workflow one more time:
- Train (Name the session “Count_vectorization,” do not drop and recreate the train and test datasets as the features haven’t changed, accept the warning for sparse feature)
- Select the model with highest performance
- Deploy to our flow
- Score the test dataset
- Evaluate the results
It is possible that random forest can’t run with this text handling technique due to memory issues. Count vectorization basically creates an occurrence matrix per term, which will be pretty large and sparse. Random forest algorithms are not great for these large and sparse matrices, as each tree can be really deep and have thousands of nodes which will make the memory consumption of the Random Forest grow very fast:
On the good side, we can see that with this new text handling technique our ROC AUC increased. Once we deploy, score the test dataset and evaluate the performance. We can see the full details compared to the previous experiments in the evaluation store:
With a few experiments, we were able to increase the performance of our ML model. We experimented with new features and with different text handling techniques, but it doesn’t need to stop there. Dataiku offers varied algorithms natively within the platform, and several hyperparameters to handle within each of these for you to try different approaches. If this is not enough, Dataiku also allows your custom python models to be imported to the platform and tested through the same workflow we demonstrated here. With Dataiku, we can do as much experimenting as we need, and it provides us with the framework to easily compare between experiments and go back to versions when we need to. Dataiku makes data science easier!
Are you excited to try out Dataiku? Contact us to chat about how we can empower you to start your data science journey today.