This is the first in a blog series aimed at highlighting the use of natural language processing in Dataiku. Stay tuned to the blog for more installments coming soon!
I recently travelled from Sydney to Mexico to visit my family, and I had my last flight leg (Dallas -> Mexico) cancelled due to bad weather. As the frustration of all those affected by this overflowed and people complained loudly to the customer service members, I wondered what kind of reviews this airline was going to have with so many angry customers. This wondering evolved into me actually wanting to analyse the data, and see what I could find.
One of my favourite data science areas is natural language processing (NLP). To me, making a computer understand natural language is something of utter importance with all the applications down the line. We have things like sentiment analysis, topic analysis, chatbots and more with different applications from business to social good. I’ve heard of few chatbots for social good trialed in community engagement, public services FAQ and even personalised education chatbots. Recently, I read about a project where topic analysis is being applied to medical research papers in order to help categorise them in a more efficient and reliable way. NLP aides in add useful numeric structure to text data, helping to resolve ambiguity in language and many more applications to make better use of our text data around the world.
With the inspiration that natural language processing brings, and the idea of analysing the reviews of different airlines around the world, I set out to find a dataset that could give me something to try, and of course Kaggle didn’t disappoint. This is a great dataset to get started with some NLP analysis, the Skytrax Airline Reviews.
The pipeline for this project ended up looking like this:
Dataiku gladly surprised me. I initially investigated the dataset with external Python scripts, and it had half of the records emtpy. With Python, I cleaned it easily. But with Dataiku, when I grabbed the .xlsx file and uploaded it to Dataiku, I literally didn’t have to do anything. It cleaned these empty records automatically!
Above: Some of the cleaned records.
Initial Data Prep
Before using the evaluation recipe, there is some data clean up that I need to do. I had to clean up the recommended label and the review date:
I created a feature recommended_label and map yes = positive and no = negative:
I might want to later do some analysis with the customer review date, so I first cleaned it using a regex (to get rid of the letters from the day, eg: 3rd, 4th, 21st), and then, upon analyzing I discover few typos with Augu (instead of August), I cleaned that as well. Then, I used the Parse Date recipe to get the date parsed:
Above: And after.
Using the Sentiment Analysis Plugin and Preparing to Evaluate Its Results
One of the advantages of Dataiku is the ability to extend its power with the use of plugins. A plugin is a package of components that can be reused and even shared with others. Dataiku has a growing number of plugins that can get you started with different types of analysis such as geospatial, NLP, deep learning, computer vision amongst many more. You can find the current list of available plugins here.
In this case I wanted to do a sentiment analysis on the customer reviews, and I found a great plugin to try out called, fittingly, “Sentiment Analysis.”
I used the recipe in the customer reviews with a binary output (positive/negative) and prediction confidence scores. Upon skimming through some records, it appeared the prediction is doubtful. I got records labelled as positive, with overall recommendation being “no.” I wanted to investigate further and put some statistics around this. I’m interested in visualising a confusion matrix using the feature recommended as a proxy to evaluate, but of course acknowledging that a more accurate way would be to have a human labelled dataset with “positive/negative” noted per record.
Above: What I settled on for this section.
In order to perform the evaluation, I needed to do few more data prep steps after the prediction. There were a few records that did not have a value in the “recommended” feature, so I filtered out those ones as we will not be using them for our model evaluation:
The evaluation recipe utilises a python numpy function np.isnan() to detect if there are empty values in the records used to evaluate. As our current classes have string values: “positive” and “negative,” I utilised an extra prep step to map these to “1” and “0” in order for the Evaluation step to work correctly. I also made sure that the storage type of both columns were integer, as I got a python error when I accidentally left it as string:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced
The evaluation recipe in Dataiku takes an evaluation dataset and a model. But, because a plugin is considered an “External Model,” I needed to use a SER (Standalone Evaluation Recipe.) This recipe uses the evaluation dataset as input, and its output is an Evaluation Store. Every time the evaluation runs, a new Model Evaluation is added to the Evaluation Store. The evaluation dataset will need to contain the predicted value. Most of the time, a field with the “ground truth” will be necessary to compare and get some model performance metrics, but even if there is no “ground truth” field, it is still possible to utilise the SER component as some of the analysis (like drift and prediction drift) remain possible.The sampling and evaluation naming I left it default.
The model part I configured:
- prediction type = two-class classification as we are predicting positive/negative
- prediction column = empty, as it is a classification problem we used the Probabilities section to put the different classes being predicted
- labels column = recommended_label_n
- Probabilities = 1 and 0; both in the column predicted_sentiment_n