This is the first in a blog series aimed at highlighting the use of natural language processing in Dataiku. Stay tuned to the blog for more installments coming soon!
The Trip
I recently travelled from Sydney to Mexico to visit my family, and I had my last flight leg (Dallas -> Mexico) cancelled due to bad weather. As the frustration of all those affected by this overflowed and people complained loudly to the customer service members, I wondered what kind of reviews this airline was going to have with so many angry customers. This wondering evolved into me actually wanting to analyse the data, and see what I could find.
One of my favourite data science areas is natural language processing (NLP). To me, making a computer understand natural language is something of utter importance with all the applications down the line. We have things like sentiment analysis, topic analysis, chatbots and more with different applications from business to social good. I’ve heard of few chatbots for social good trialed in community engagement, public services FAQ and even personalised education chatbots. Recently, I read about a project where topic analysis is being applied to medical research papers in order to help categorise them in a more efficient and reliable way. NLP aides in add useful numeric structure to text data, helping to resolve ambiguity in language and many more applications to make better use of our text data around the world.
With the inspiration that natural language processing brings, and the idea of analysing the reviews of different airlines around the world, I set out to find a dataset that could give me something to try, and of course Kaggle didn’t disappoint. This is a great dataset to get started with some NLP analysis, the Skytrax Airline Reviews.
The pipeline for this project ended up looking like this:
Data Ingestion
Dataiku gladly surprised me. I initially investigated the dataset with external Python scripts, and it had half of the records emtpy. With Python, I cleaned it easily. But with Dataiku, when I grabbed the .xlsx file and uploaded it to Dataiku, I literally didn’t have to do anything. It cleaned these empty records automatically!
Above: Some of the cleaned records.
Initial Data Prep
Before using the evaluation recipe, there is some data clean up that I need to do. I had to clean up the recommended label and the review date:
I created a feature recommended_label and map yes = positive and no = negative:
I might want to later do some analysis with the customer review date, so I first cleaned it using a regex (to get rid of the letters from the day, eg: 3rd, 4th, 21st), and then, upon analyzing I discover few typos with Augu (instead of August), I cleaned that as well. Then, I used the Parse Date recipe to get the date parsed:
Above: Before.
Above: And after.
Using the Sentiment Analysis Plugin and Preparing to Evaluate Its Results
One of the advantages of Dataiku is the ability to extend its power with the use of plugins. A plugin is a package of components that can be reused and even shared with others. Dataiku has a growing number of plugins that can get you started with different types of analysis such as geospatial, NLP, deep learning, computer vision amongst many more. You can find the current list of available plugins here.
In this case I wanted to do a sentiment analysis on the customer reviews, and I found a great plugin to try out called, fittingly, “Sentiment Analysis.”
I used the recipe in the customer reviews with a binary output (positive/negative) and prediction confidence scores. Upon skimming through some records, it appeared the prediction is doubtful. I got records labelled as positive, with overall recommendation being “no.” I wanted to investigate further and put some statistics around this. I’m interested in visualising a confusion matrix using the feature recommended as a proxy to evaluate, but of course acknowledging that a more accurate way would be to have a human labelled dataset with “positive/negative” noted per record.
Above: What I settled on for this section.
In order to perform the evaluation, I needed to do few more data prep steps after the prediction. There were a few records that did not have a value in the “recommended” feature, so I filtered out those ones as we will not be using them for our model evaluation:
The evaluation recipe utilises a python numpy function np.isnan() to detect if there are empty values in the records used to evaluate. As our current classes have string values: “positive” and “negative,” I utilised an extra prep step to map these to “1” and “0” in order for the Evaluation step to work correctly. I also made sure that the storage type of both columns were integer, as I got a python error when I accidentally left it as string:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced
The evaluation recipe in Dataiku takes an evaluation dataset and a model. But, because a plugin is considered an “External Model,” I needed to use a SER (Standalone Evaluation Recipe.) This recipe uses the evaluation dataset as input, and its output is an Evaluation Store. Every time the evaluation runs, a new Model Evaluation is added to the Evaluation Store. The evaluation dataset will need to contain the predicted value. Most of the time, a field with the “ground truth” will be necessary to compare and get some model performance metrics, but even if there is no “ground truth” field, it is still possible to utilise the SER component as some of the analysis (like drift and prediction drift) remain possible.The sampling and evaluation naming I left it default.
The model part I configured:
- prediction type = two-class classification as we are predicting positive/negative
- prediction column = empty, as it is a classification problem we used the Probabilities section to put the different classes being predicted
- labels column = recommended_label_n
- Probabilities = 1 and 0; both in the column predicted_sentiment_n
Above: The configured parts
For the cost matrix, I assigned the same weights for true and false, as I’m giving the same priority to both predictions. In some other use cases, it is possible to play with these values if the business penalises more false negatives, false positives, etc. Talking about false negatives, and false positives, this is one of my favourite stats and data science memes of all times, as it clearly explains these concepts:
For example: In a customer churn model, I could consider a higher cost to predict a customer as not churning, and they effectively churn (Type II error, false negative), so I could actually set this “false, but it is actually true” as a gain of -5 considering that once a client has churned it is very difficult to get it back.
Analysing the Model Evaluation
The first time it runs, the threshold of 0 was found by optimising the F1-score, which is the harmonic mean between precision and recall.
As a side note:
- Precision: Proportion of correct (“positive“) predictions among “positive” predictions.
- Recall: Proportion of (correct) “positive“ predictions among “positive” records.
For those that want to dive a bit deeper in the stats behind this, here is a fantastic resource that discusses the evaluation metrics in Dataiku
Looking at the confusion matrix for this first run, it didn’t look overly encouraging. The F1-score, precision and recall hit nearly a ~19%. The accuracy only 15%.
Our dataset is fairly balanced with ~51% positive reviews and 49% negative ones. So in this case we can count on accuracy as a fair metric to evaluate this model, in highly unbalanced datasets accuracy is not a great metric. For example in a dataset where 90% are positive reviews, the accuracy of predicting positive might be 90% or more; but it would not automatically be a great model because it would almost never predict the negative class.
Above: The Confusion Matrix
Although the predictions didn’t really hit the mark in this first go, it inspired me with several ideas that I want to investigate. On one end, I’d like to investigate what the plugin is doing and review if there are few parameters that I could try out to see if I can increase the accuracy. On the other side, I really want to investigate the customer reviews with a few other techniques. One NLP technique that I have applied before is called, “bag of words.” The bottomline is, you start by extracting features or words from the text, dismissing all grammatical structures and the order of the words. The idea is just to obtain the words and their counts within the text. Then, these are compared with two baseline datasets (or bags of words), one of them containing only words associated with being positive and the other one having the words associated with being negative. This helps to give the customer review a score based on how many positive or negative occurrences the text has.
Natural Language Processing is quite a difficult task, but also a very exciting one with lots of possibilities. A part of data science is about experimenting with the data and the algorithms, and searching for ways where we can get better predictions. Dataiku offers a great framework and tool to make this experimentation easier! Once I’m ready to start another experiment, or move features/parameters in my pipeline, I will run it with the new configuration, evaluate it and be able to easily come back and compare it with this one I first did.
Are you excited to try out Dataiku? Contact us to chat about how we can empower you to start your data science journey today.