Challenges of NLP and Solutions with Dataiku

Data

Challenges of NLP and Solutions with Dataiku

//

In the previous post (that you can read here), we started doing some analysis on airline reviews. We did very basic data preparation, used the Dataiku Sentiment Analysis plugin and evaluated the model with the help of a confusion matrix. In this blogpost, we will go one step further and apply some Natural Language Processing (NLP) pre-processing tasks, then we will use the Dataiku sentiment analysis plugin again and compare the results with the first experiment.

Challenges of NLP

Human language is unstructured and messy. Machine learning is based on trying to find patterns in the training data. The challenge of NLP is to turn raw text data into features that an ML algorithm can process and search for patterns. The most basic approach of turning this messy, unstructured data into features is to consider natural language as a collection of categorical features, in which each word is a category of its own. So, for example, these three sentences…

  1. “The airline lost my luggage.”
  2. “This is what happens when traveling with such low quality airlines.”
  3. “Airlines like this should be banned”

…will have 22 features:

Because of the unstructured and messy nature of human natural language, we face few challenges such as:

  1. Sparse features:
    You will notice that the features of these three sentences are very spare. The words as they are only appear in one single sentence. This sparsity will make difficult for the algorithm to search for similarities between sentences and find patterns.
  2. Redundant features:
    The three sentences are talking about an airline/airlines, but given that two of those words are plural and one is capitalized, without some form of pre-processing, these are taken as three separate features.
  3. High dimensionality:
    These three short sentences generated 22 features. If we would go to analyse a full paragraph, full article or even a full book, you can just imagine how many hundreds or even millions of features we would end up with. This high dimensionality is a problem because the more features, the more storage and memory you need to process them. This is why for text analysis we would ideally apply some techniques to reduce the dimensionality.

Pre-Processing Steps To Deal With These Challenges

To deal with these three problems, there are three basic and very valuable data cleaning techniques for NLP:

1. Normalizing text

The objective of this technique is to transform the text into a standard format, which includes:

  • converting all characters to the same case
  • removing punctuation and special characters
  • removing diacritical accent marks

So in the case of our three sentences with this step we can go from 22 features to 20:

2. Removing stop words

Stop words are a set of commonly used words in any language. In this step, we remove them to be able to focus on the important words that convey more meaning.

In our example, this step takes us from 20 features to 10.

3. Stemming

This step transforms the words to its “stem” word. So, for example, “airlines” and “airline” will become “airlin.” Now, we will go down to nine features.

Cleaning Our Airline Reviews

Let’s apply some of these techniques to our airline reviews, and then process them again with the Sentiment Analysis plugin to compare the results with the previous run.

1. The first step is to analyse the data. We can use the values clustering on the “customer_review.” Dataiku identifies few clusters with very similar reviews:

2. From the previous analysis, we can see that few of the records have the first sentence of the review containing if the trip is verified or not. This is not adding any value to the sentiment analysis, so the first pre-processing step that we want to do is to get rid of that part of the first sentence. For this, we use a “Find and Replace” data processor to replace the strings:

✅ Trip Verified | → No Value
✅ Verified Review | → No Value
Not Verified | → No Value

Just for reviewing the recipe, I’m using an output column, but before running and saving I’ll delete so that the replacement is done in place. The rest of the flow uses the feature “customer_review.”

3. The next step is to apply the Simplify text processor, which contains four different processes for simplification of text columns:

  • Normalize text: Transform to lowercase, remove punctuation and accents and perform Unicode NFD normalization (like Café -> cafe).
  • Stem words: Transform each word into its “stem,” i.e. its grammatical root. For example, “grammatical” is transformed to “grammat.” This transformation is language-specific.
  • Clear stop words: Remove so-called “stop words” (the, I, a, of, …). This transformation is language-specific.
  • Sort words alphabetically: Sorts all words of the text. For example, “the small dog” is transformed to “dog small the,” allowing strings containing the same words in different order to be matched.

Once more, I used the output column as “customer_review_simplified” for reviewing, but before running the recipe, I will delete it so that the simplification is done on the same column, “customer_review.”

Run the ML Model With the Clean and Simplified Reviews

Once we finished our pre-processing, let’s run again the model using the Dataiku Sentiment Analysis plugin and compare the results. One of my favourite components of Dataiku is the Model Evaluation Store. This component saves the metrics of the evaluation every time it runs. With this, it is very easy to see the previous runs and compare the performance. Let’s look at the confusion metrics from the previous blogpost and the current run after we cleaned and simplified the text:

Previous Run (2023-03-11 00:44:53) Current Run (cleaned and simplified customer reviews) (2023-03-22 19:30:57)

As you can see above, we did increase all our metrics and we went to positive territory on the average gain per record. This pre-processing for natural language processing was well worth it. A part of data science is all about experimenting, and while we did experiment by doing pre-processing, there are still more ideas to explore! Have a look inside the Dataiku Sentiment Analysis plugin, use other algorithms and apply other pre-processing techniques. With Dataiku, we can do as much experimenting as we need, and it provides us with the framework to easily compare between experiments and go back to versions when we need to. Dataiku makes data science easier!

Are you excited to try out Dataiku? Contact us to chat about how we can empower you to start your data science journey today.

More About the Author

Azucena Coronel

Services Lead
ML in Snowflake with Snowpark for Python Earlier this year, Snowflake announced the availability of Snowpark ML during its Snowflake Summit. Snowpark ML is a set of tools, ...
Sentiment Analysis With Native Algorithms in Dataiku This is the third part of a short series called “Natural Language Processing with Dataiku,” where we are analysing airline ...

See more from this author →

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!