Challenges of NLP and Solutions with Dataiku


Challenges of NLP and Solutions with Dataiku


In the previous post (that you can read here), we started doing some analysis on airline reviews. We did very basic data preparation, used the Dataiku Sentiment Analysis plugin and evaluated the model with the help of a confusion matrix. In this blogpost, we will go one step further and apply some Natural Language Processing (NLP) pre-processing tasks, then we will use the Dataiku sentiment analysis plugin again and compare the results with the first experiment.

Challenges of NLP

Human language is unstructured and messy. Machine learning is based on trying to find patterns in the training data. The challenge of NLP is to turn raw text data into features that an ML algorithm can process and search for patterns. The most basic approach of turning this messy, unstructured data into features is to consider natural language as a collection of categorical features, in which each word is a category of its own. So, for example, these three sentences…

  1. “The airline lost my luggage.”
  2. “This is what happens when traveling with such low quality airlines.”
  3. “Airlines like this should be banned”

…will have 22 features:

Because of the unstructured and messy nature of human natural language, we face few challenges such as:

  1. Sparse features:
    You will notice that the features of these three sentences are very spare. The words as they are only appear in one single sentence. This sparsity will make difficult for the algorithm to search for similarities between sentences and find patterns.
  2. Redundant features:
    The three sentences are talking about an airline/airlines, but given that two of those words are plural and one is capitalized, without some form of pre-processing, these are taken as three separate features.
  3. High dimensionality:
    These three short sentences generated 22 features. If we would go to analyse a full paragraph, full article or even a full book, you can just imagine how many hundreds or even millions of features we would end up with. This high dimensionality is a problem because the more features, the more storage and memory you need to process them. This is why for text analysis we would ideally apply some techniques to reduce the dimensionality.

Pre-Processing Steps To Deal With These Challenges

To deal with these three problems, there are three basic and very valuable data cleaning techniques for NLP:

1. Normalizing text

The objective of this technique is to transform the text into a standard format, which includes:

  • converting all characters to the same case
  • removing punctuation and special characters
  • removing diacritical accent marks

So in the case of our three sentences with this step we can go from 22 features to 20:

2. Removing stop words

Stop words are a set of commonly used words in any language. In this step, we remove them to be able to focus on the important words that convey more meaning.

In our example, this step takes us from 20 features to 10.

3. Stemming

This step transforms the words to its “stem” word. So, for example, “airlines” and “airline” will become “airlin.” Now, we will go down to nine features.

Cleaning Our Airline Reviews

Let’s apply some of these techniques to our airline reviews, and then process them again with the Sentiment Analysis plugin to compare the results with the previous run.

1. The first step is to analyse the data. We can use the values clustering on the “customer_review.” Dataiku identifies few clusters with very similar reviews:

2. From the previous analysis, we can see that few of the records have the first sentence of the review containing if the trip is verified or not. This is not adding any value to the sentiment analysis, so the first pre-processing step that we want to do is to get rid of that part of the first sentence. For this, we use a “Find and Replace” data processor to replace the strings:

✅ Trip Verified | → No Value
✅ Verified Review | → No Value
Not Verified | → No Value

Just for reviewing the recipe, I’m using an output column, but before running and saving I’ll delete so that the replacement is done in place. The rest of the flow uses the feature “customer_review.”