Intro to Natural Language for Business Intelligence

Transcript
Good afternoon or good morning, depending on whereabouts in the world you're joining us from today. Welcome to today's webinar, Intro to Natural Language for Business Intelligence. We're delighted to have you on this call. Just before we start, I wanted to run through a little bit of housekeeping. So throughout the duration of this webinar, everyone's lines will remain muted. We do have the facility for you to ask questions in the chat or alternatively in the Q&A box, which you can find at the bottom of the Zoom portal. Please feel free to ask any questions. We'll either answer them whilst the webinar is running or alternatively, if we can't get to them during the webinar, we will get to them at the end of the webinar. So thank you for bearing with us. The screen that you're seeing in front of us at the moment is a little bit of a rundown of InterWorks' own solutions. We have InterWorks Assist, which is the access to consultants remotely, where you can jump online and book a one-to-one slot with one of our skilled consultants who will be able to support you with your own specific Tableau or data-related query. We have Curator by InterWorks, which is our fully branded customized data analytics portal, and ServiceCare. So that is your managed server solutions for Tableau software. So moving on, just to give you a high-level overview of what you need to know about InterWorks. So hopefully there's a lot of returning faces as well as some new faces. Our main role in the world of data analytics is to do the best work for the best people with the best clients. We focus on working on the entire data landscape for our clients and partner with some of the best-of-breed solutions out there from the likes of Snowflake, Matillion, Dataiku, and many, many more. We were the first-ever Tableau Gold Partner and we operate globally. So we have offices in the US, in the UK, the DACH region, Singapore, Australia. So truly we do have a lot of support available to you. We also have a world-famous blog. So we release new content on a daily basis. So if you haven't already, or you're looking for any help topics, please do head over to the InterWorks blog. Just a reminder as well that this webinar is being recorded and we will get a follow-up email to you in the next couple of days. Included in that will be a link to the recording so that you can access or share this webinar with colleagues if you choose. Next slide please, Jack. Wonderful. So here is the array of solutions that we have. So we focus, as I said, on the entire data landscape from IT, custom solutions, analytics. We are one hundred percent focused on enabling our clients to be able to walk away from using another party and to be able to manage their analytics in-house. So if any of these little boxes that you're seeing in front of you tick any of your boxes, please do get in contact with us. We are more than happy to help. And then finally, it leaves me with great pleasure to firstly introduce Jack. Jack Hulbert will be leading today's session. He's one of our analytics consultants and is based in the US. And the person who's been wittering on to you for the last couple of minutes is myself, Vicky Lockett. I'm the marketing manager for Europe. So with that, I'm going to pass you over to Jack. If you're happy to start, let's get started. Yes, thank you, Vicky. And thank you everybody for joining today. I'm Jack Hulbert and I'm an analytics consultant here at InterWorks. I'm calling out of Portland, Oregon today. So, yeah, today we're going to talk about natural language for business intelligence. A little bit about why I care about this topic and why you should care about this topic. A little bit of background is, well, I guess my interest started when I was, or I guess my interest in natural language processing predates my career in analytics. I was once an aspiring journalist, and I was primarily interested in building tools for other journalists that could help them translate and identify trustworthy sources of information. So what really led me down this rabbit hole of natural language processing was wanting to help others understand complex text-based data sources and get insight and help making the analysis of those data sources more efficient. Lo and behold, I have found myself in the business intelligence space for about four years now. And prior to InterWorks, I worked as a consultant for a company that built natural language processing and text analytics tools, particularly for business intelligence teams. So in that experience, I was often tasked with helping BI teams enrich their reporting datasets with natural language processing outputs. So they could bring new capabilities and filters for not only their analytics teams, but also their other end users, make their lives easier as consumers of the data and users of the tools. So that's a little bit about the background of why this topic has come to the forefront for me. And today we're going to dive into this world of natural language processing, some popular industry use cases for NLP. So we're going to break it down into the different sectors, different problem spaces that NLP can be used to tackle problems and naming a few data sources. And also how BI teams can leverage natural language processing in what I'd consider like a standard BI workflow using tools like Tableau, data preparation tools, data warehousing, ETL-type platforms. So we're going to talk about the different areas that NLP can fit into those different steps of the BI process. So a little bit about NLP. It's an essential service that helps power the technologies that help us communicate, retrieve, and classify information that we get in our everyday lives. So in recent years, the number of use cases for using NLP has grown as government entities, companies, and researchers have begun to explore the vast quantities of text data sources that we are generating every day. And as these use cases grow, so has the demand for making the ability to access and analyze this information for people like data analysts and business intelligence teams. So now there's an increase in demand and increasing supply of tools on the market that make these tasks easier. And we're going to go through some of those tools. So a few things I want to accomplish today is I want to leave you, the audience, with a basic understanding of the various natural language processing tasks. We're going to cover a few main ones that particularly are relevant to business intelligence and structured text data. We're also going to identify a couple of common data sources that can be augmented with NLP quite easily, and also where the various areas fit in the modern BI technology stack. So at a glance, natural language processing is an umbrella term for a suite of data processing tasks and algorithms designed for computers to derive and understand the patterns of our complicated language. So think about when teaching a toddler a language, you're teaching them what nouns and verbs and qualifying words are, possessives, things like that, those idiosyncrasies and rules of our language. Natural language is the process of doing that, except we're building in those rules and telling a computer how to interpret them. So it's the systemic breaking down of language into smaller chunks so that our machines can interpret them and help with modeling and different tasks that we'll use them for. So the goal of NLP is to break down these complexities and idiosyncrasies of our language to help us better understand, summarize, and classify data that have free text and lacking structure. So most of the formal applications of NLP are done on more unstructured, disparate datasets that we typically don't come across in the business intelligence realm where we live in a world with a lot more structured data. But still, there are plenty of opportunities and methods that we can use to bring value from the existing text data that we do have. So this is all really important because eighty percent of the data that is generated, according to some IBM research published a few years ago, is text data. So yeah, think of that. In the BI realm where we're typically working with spreadsheets, data warehouses, database tables, that world only accounts for a small subset of all data collected. And the vast majority that is often left untouched is indeed text data. And the reason that much of this data goes untouched from an analytics perspective is because it has often been difficult to determine the strategic value of looking into this data and then also being able to justify the cost of exploring that data. But like I said, within this eighty percent, there are plenty of opportunities for individuals with skills in business intelligence to begin extracting relevant valuable information that can help drive value for the business, for stakeholders, and bring it into your analytics program. So things like customer feedback and product reviews, that's one example of what fits into this bottom eighty percent of the iceberg. Surveys and social media interactions, that is a massive amount of data that usually comes in a structured format with free text, highly subjective free text. And also things like phone calls, email interactions, chat interactions in a customer service function. All of those bring the insight of the relationship between companies and customers in a more candid, nuanced way, and all of that can come in as free text. And lastly, the more difficult data sources that usually come in a little bit less structured format that are in this cohort are documents and things like change logs, generic notes, the things that just sort of sit on your computer that you don't really know what to do with. So that's also technically in this text data cohort, but we do have a lot of areas that we could apply NLP in a BI context in this eighty percent. Also natural language processing is all around us. It powers the technologies that we've become heavily reliant on in our personal and work lives. So one example is, you know, for many Tableau users, right, you've probably come across an issue where you're trying to build a calculated field or write a piece of code, and you're not quite sure what the bug is or what's wrong with it. So then you type a convoluted search into Google using some random keywords and then somehow that Stack Overflow article pops to the top that answers your question exactly, even though you provided like two or three keywords. It's almost like it's reading your mind. That's NLP at work. It's taking and matching those keywords that you're typing into the search bar to those searched by other users and then bringing the most popular content to the top. So that's one. And that's a very simplified overview of how that is working using NLP. Another is things like Amazon Alexa, Apple Siri, those smart assistants and smartwatch services. That is also NLP. What's happening there is essentially taking your voice, the language you're speaking, transcribing it into text data, and then interpreting and pulling apart that piece of text to look for the most important pieces of information. So for instance, if I were to ask my Apple Watch about the weather in my city, so if I were to say "weather Portland, Oregon," it would take the context of those two keywords used together to warrant a response or a query that would give me the answer. So that's how that typically works. And then one growing use case, and this is especially relevant if you've looked for a job or applied for a job in the last five or so years, is resume parsing software. So if you've ever uploaded your resume to apply for a job and you noticed after you upload it, it parses out all of the relevant information, that is natural language processing at work. It's extracting all the most important themes or keywords that have been identified as important and putting it into a data structure for the end user to view and to sift through. I know those don't always work super great, but that's just an example of NLP in the wild. And so while NLP has traditionally lived in the scope of software engineering and data science, because it's become more accessible and there's been a demand to make it more accessible, there's also an increasing demand for business-facing analysts and developers to have a working knowledge in NLP and knowing how to leverage the outputs that make the lives of customers and non-technical users easier, and also helping continue to innovate and add additional capabilities to the data tools and data products that we build. So in fact, due to the highly subjective nature of text analysis and their outputs, and the quality control measures needed, analytics teams are the ideal candidate for someone who can serve as the liaison between technical teams and business-facing, non-technical stakeholders to help define and control what is being measured, what type of NLP tasks are being performed, and what's going to be the best fit to give the end users insight that is also going to be the most efficient from an engineering perspective and where to place that in the overall flow of data. So the analytics professional is at the middle of this Venn diagram, if you will. So in the wild, there are quite a plethora of different sub-domains and domains where NLP can be applied, different problem spaces. So yeah, ultimately the problem that NLP helps solve is one of data volume and high subjectivity. So a few different areas where NLP could be applied using different classes of the business. Marketing, probably the most popular in terms of NLP being used for social media and brand monitoring analysis. So there are plenty of tools out there that use NLP to analyze Twitter, Instagram, Facebook comments that are coming in and doing things like sentiment analysis and keyword mining and helping build reporting capabilities there. Survey analysis is a very popular one. Oftentimes in surveys, you'll have numeric scores to rate an experience or answer a question on a one-to-ten scale. Oftentimes there's a free text field in those same datasets that accompany that, allowing the respondent to provide more context and a more candid response. Those are prime for doing NLP tasks on because it can allow us to get the additional insight that is not in the story that's not told by a numeric score. So that's a very popular example of where NLP is being applied in practice. Product management is a growing space. So digital product management, right, where there's a lot of consumer and user feedback being collected. And also maybe you have like a help desk with help desk tickets with different priorities. Oftentimes there's free text being collected there that helps describe issues in more detail. That's also prime and good structured data for natural language processing. And then another popular, I'd say probably one of the oldest examples in the business realm is in customer service. So a lot of the more legacy enterprise natural language processing tools are designed for customer service and call centers so that companies can understand what type of problems people are calling about just by looking and hearing the voice of the customer and then transcribing that to text, and then also monitoring the compliance of their employees. That's a very popular use case in customer service. And really, it exists in every sub-domain. These are just some of the more popular ones. A few of the, breaking down the most important types of methods for NLP in the business intelligence realm. I mean, there are plenty out there in the wild, but these are the three that you will come across the most and that are most applicable to the BI workflow and the BI realm. The first is data labeling. And essentially what data labeling is, is it's taking your dataset and increasing the dimensionality, widening your dataset by using NLP to add natural language outputs to widen your dataset. So for instance, let's just say you have a survey dataset where you have a free text response. You run it through a few NLP tasks. It'll add on the output of those NLP tasks, like sentiment. It can append a record whether that piece of text is classified as negative, positive, or neutral, or extracting relevant keywords. It'll append and augment your dataset and help with labeling. So then you have additional fields that you can bring into the reporting layer and expose for things like semantic filtering and give end users a little bit more functionality as well as those who are procuring the data. The second type of task is classification. And so what classification is, is it's taking these free text records and it's assigning structure or a higher-level grouping to them based on the type of words or the topic or theme of that piece of text. So for instance, if you have, let's just say you wanted to write a program that extracts all of the instances of product names or certain cities being mentioned, classification would be able to append your dataset and classify your records by those different things. And then also similar to data labeling, expose that as a filter and give you a little bit more flexibility in how your data is structured. And it can also, this can be really helpful, especially if we have highly subjective text records. This can help us put higher-level classifications to more subjective, nuanced data points and give it a higher-level grouping to help cut through larger quantities of data. And the last one is summarization. This is a growing use case in business intelligence. So while classification and data labeling typically deal with the data processing layer, summarization actually exists more in the consumption layer. And usually what this deals with is taking a dataset or a data tool, and it's helping write a natural language summary or report using trends or generic themes in the data without having an analyst needing to write that themselves. So it's like an automation of basic report writing, writing insights. And there's a lot of existing technologies out there that can help you do this without needing to learn how to code. So we're going to talk through that a little bit. So these are the three main areas. The first one among these tasks I want to talk about is sentiment analysis. I want to start with sentiment analysis because it's one of the oldest methods and tasks in natural language processing as a whole. And it's one of the more popular places that people start when they're getting into NLP. This is where I started personally. So really what sentiment deals with is it's the process of analyzing emotion within a piece of text and classifying a record as positive, negative, or neutral. So basically by running sentiment analysis on things like social media posts, product reviews, surveys, or customer feedback points, businesses can gain valuable insights around brand perception and ultimately help determine what is, combined with other dimensions in the data, it can help add a little bit more of a subjective filter. So in this example, these would be classified respectively as positive and negative interactions using keywords like "love," "recommend," things that would be perceived or skewed towards a positive sentiment, the identification of those keywords would help us label this record as positive or help the sentiment model label it as positive. Whereas similarly here, "painstakingly," "confused," right, words that skew towards negativity in our language lexicon would classify it as negative. So now imagine if you had the ability to classify records like this and you had a dataset of twenty thousand survey reviews or so. You can see at scale how this could be very powerful in cutting through large quantities of text data. The next I want to talk about is themes, or I guess you could talk about this also as topics. Theme and topic analysis in NLP, there's a few different ways to approach this. So some models, some topic models or theme models will take a more abstract approach to classifying text data by how often certain types of words appear in conjunction with each other, and will cluster these into topics. So similar to how our Google search example works, it's looking and comparing how often words are used together to classify that as a topic in a more abstract way. So this is great because this can help you cluster certain records together if you don't know exactly what you're looking for in a text analysis. The only downside of using this approach that is more abstract is that you're allowing the model to do all of the work, and it may not always cluster records or data in a way that is meaningful or makes intuitive sense to humans, but at least it's a starting point. So the alternative to this, and for many business applications this is popular, companies will use or teams will use a custom classification model. And really what this deals with is deliberately defining and controlling what type of language is measured and captured in your text data sources, and then having it roll into a higher-level classification. So in our example here, let's just say you're a digital product manager and you want to understand the types of keywords that people are using when they're talking about pricing or price perception of your product. What this custom approach would do is define all of the different keywords or permutations of keywords that you think would be used when a consumer is talking about that. And then you can begin to start measuring that way. And as you learn more, as you train, that list will grow. And so you have a little bit more control over the model. So that's definitely a great alternative if you don't want to use the abstract approach, and it usually results in a higher accuracy. The only downside is it does require more effort upfront and more maintenance, if you will. But that's usually the trade-off, right, is accuracy and time spent, especially with things like theme and topic analysis. And the last one is keyword analysis. Keyword analysis is the automated extraction of the most important words or concepts within a piece of text. So things like named entities, things like cities, states, possessive pronouns, things like that, or locations or company names, all those things, there are certain models that will automate the extraction of all of those for you. And that can be especially helpful if you're looking for when people are mentioning hyper-specific things in a bunch of text records. And yeah, so oftentimes this keyword will kind of feed into like our last, what we just went over, topic analysis. This will be like the underlying layer under topic analysis where you're actually looking at individual keywords. So in this case, right, things like "software," "videos," "value," more like objective things, these would be considered keywords or perceived as keywords in a text analysis. So in the BI ecosystem, NLP fits in three different areas. And we're going to talk through the different ways and the benefits of using NLP in each of these different areas in the BI ecosystem. We're going to start in the ETL layer. Then we're going to go into the data preparation layer, talk about the benefits of using it there and when you'll want to decide to use NLP in that ecosystem. And then the interaction layer, which sort of has its own unique, more narrow use cases for where NLP is going to be used. So we're going to talk through all of these. So starting in the ETL layer. So the ETL layer is ideal for natural language processing for a few different reasons. So for larger data sources, this layer will allow you to have more ample resources for processing. If you have a larger dataset and you're running a large number of natural language processing tasks, this will just be a better place to do that at scale rather than doing it in, say, a data preparation tool and doing all of this locally. Doing it in the cloud, you have access to a lot more services and tools so you don't have to build a model from scratch. You can just use a service or a pre-trained model that with a couple of lines of code, can just get the outputs that you want and have it in the ETL step. Another benefit of doing it here is that you can actually, upon ingestion, you can run it through an NLP task and then have that output from the NLP model or service added into your data storage. Then other consumers of the data or those procuring the data can use it for either data science, data modeling, or data analysis. So if you're building dashboards or you're building a predictive model, you have access to all those dimensions. Those are all the benefits of doing it here. And it's not ideal to do NLP in this layer when, A, you don't know what tasks you exactly want to do. So you're in that exploratory phase of NLP where you don't know exactly what you want to measure. It's probably not best to start here because you're going to need a little bit more control and you want to experiment a little bit more and you don't necessarily need those resources. If you need to perform a really specific natural language processing task that isn't really generic enough to be placed in the ETL layer. And some tools that fit in this layer: Python, right? There's a lot of open source models, or if you want to build your own model, you can do that in Python or PySpark. Google Cloud NLP, Amazon, Salesforce, all have their different services that you can put your data through and it'll generate natural language outputs. And you can add that into Snowflake tables or wherever. So in the data preparation layer, it's also ideal for doing NLP for a few use cases. The benefit of doing it here is users have more control over the NLP outputs and can set more hyper-specific goals for the analysis. So like I said, if you are performing a really specific NLP task or doing more of an ad hoc analysis, you'd probably want to do it in the data preparation layer. So for instance, if you had a sample of two hundred surveys, you would want to bring it into a tool here because you're only looking at a sample. If it's a survey that's going to change or if it's not like a standardized data structure, it's probably going to be best to do a more ad hoc approach here. Yeah, and you want to be able to understand the idiosyncrasies of that text and be able to compare the output versus the original. And this is a great environment for experimentation and building proof of concepts. You would also want to do this in the data prep layer if you're unsure which NLP tasks you want to be using on a data source. So again, for experimentation purposes, if you want to do more of an exploratory analysis and start running the different NLP tasks on your data, this is a great place to start that process because you can work with smaller subsets of the data and have a little bit more control and see how the sausage is made, as I like to say. And also, it can allow you to work with end users and give them samples of texts and outputs easier to see if it will meet their needs. If you're, say, building a dashboard that's going to have sort of a natural language output or semantic filter. Yeah, and the other benefit of the data preparation layer for natural language processing is that a lot of tools such as Alteryx, Dataiku, and even Tableau Prep, they have built-in text cleaning functions. So we're going to talk about text preprocessing and text cleaning here in a few, but these tools have built-in functions so you don't have to write any code to do the cleaning and preparation for these NLP tasks, which is extremely helpful if you're on a tighter timeline or you have a team that isn't necessarily wanting to write all the code and manage all the code. And in the interaction layer, a less common but growing use case is in the analytics interaction layer. So in recent years in the BI community, there has been a push for no-code tools that allow end users who are building dashboards to augment their dashboards with a service or a tool that will allow them to write summaries, natural language summaries using the data in the dashboard and being able to distribute that to users through various channels such as Slack or email. So there are some suites of tools and extensions for platforms like Tableau that allow you to generate these summaries without needing to write anything yourself. So this is a growing subset of BI, and I'm really excited to see where this is going. An example of this is actually built into Tableau Server. So Ask Data, that's one example of being able to use natural language to interact with the data source and actually query the data in a natural language fashion and have a little bit more of a curated customizable response. So that's one example of NLP at work in the interaction layer. But also, there are some other tools and add-ons that can also help with the automation of writing reports and natural language querying here. So next, we're going to talk about NLP in practice and some actual tactical real-world examples of how this could be implemented in a BI project. So first, we're going to outline the process of going through and starting your NLP pipeline or project. So there are four stages. The first, setting a goal. So being able to assess what is the goal of my analysis? Is it going to be exploratory or do I know exactly what I want to know? Like, I know what I'm looking for in the text data source. Am I going to want to be doing classifications by topics or themes or extracting certain keywords? Or do I just want to measure things like sentiment and brand perception? So being able to outline and set some hypotheses of what you'll want to know ahead of time is usually a good first step. Second, identifying your data source. So, I mean, oftentimes before you start, you'll know what your data source is, but doing a little bit more prodding to understand is the data in the right structure, what will be the level of effort to get the text data into a format that is usable in a BI tool. So will it be structured as a table or is it going to be semi-structured, like in a JSON, or will it be a completely unstructured piece of text? So depending on that, that will impact how much effort you'll want to allocate and basically lead you to an evaluation point of if you want to pursue that project or not. But let's just say your evaluation goes smoothly, your data source is good and valid, and you've identified the effort to do NLP on your text data. Then you want to understand what kind of resources do you have to do NLP. Do you have the budget or the capability to build or use an API service or a software offering that can help with the automation and helping automate these NLP tasks? Do you have a specific timeline that you're adhering to, or is it sort of an open-ended project? Sometimes NLP projects can take a little bit longer just because it is highly subjective data and it does require some manual review and constant calibration just because our lexicon and our language is always changing. You definitely want to budget a little bit more time there. And then also technical capabilities. So if you have a team of coders or non-coders, that can definitely impact how an NLP project will be rolled out and what functionalities or what features are used. And the last step is visualization and analysis. So being able to understand what outputs are going to be most valuable to the end user. Are you going to want to set up something like a free text search, which is more simple, or depending on how you set it up, do you want to look at what are the most relevant keywords or most frequent keywords? Do you want to look at trends or frequency, things like that? Or do you want to use just things like semantic filtering? Being able to give the end user more capabilities on what they can filter and explore the data by that are not just dimensions in a dataset, things that are a little bit more nuanced and subjective like sentiment or like a topic that could be traced back to the text records. So in the Tableau ecosystem, just as an example, NLP can live in these three different areas using the Tableau Python and R analytics extensions. So I really like these. I've written a couple of blogs and we have a suite of blogs on InterWorks where we talk about how you can get started using these services. But yeah, basically a lot of natural language processing models that are ready out of the box for you to use on structured data, you can access in Python. And you can access that using either the Python or the R server, and you can connect that to a Tableau workbook, a Tableau Prep flow. And then also for what I spoke about earlier in terms of the curation and exposing NLP in the interaction layer, Tableau Server has Ask Data and Explain Data available. So that technically is natural language processing. So those are the three areas within a Tableau context that we could be using this. So now we're going to go through a sample project of how we can apply these concepts. So I'm going to go through a project that I worked on personally. And basically what I did was running text analysis on Reddit data. So Reddit, for those of you who don't know, it's a social media platform. You could call it social media. It's really just a forum for different hobbies, topics where people will post comments and posts around their interests in the topic and people can have conversations, upvote, add pictures and videos. So a little bit of background on Reddit. But what I wanted to understand here, to look at this data through the lens of someone who's monitoring brand perception or brand sentiment using social media. So I put my hat on as if I worked for Toyota, and I'm a Toyota car owner, that's why I chose this. And I looked at all of the different subreddit communities that were exclusively Toyota owners. What I did there is I extracted that data, put it into a database in Postgres, and then brought it into the Tableau ecosystem from there. So from here, I went through my process of setting the goal of the analysis and the four different steps that we outlined earlier. So first, I set the hypothesis. Then I took a small sample, determined which features I wanted to include in the analysis, and then validated that those were going to work. So in building the hypothesis, what I wanted to understand as a Toyota owner or with the hat that I put on is, A, are consumers who are posting on social media, are they happy with their vehicles? So that's tied directly to sentiment. I want to understand each post, if it's going to be coming in and labeled as a positive, neutral, or negative interaction, and then use that as a filter to expose in my dashboard or in my analysis. So, you know, perception there. What years, makes, models of the car are coming up the most frequently? So I wanted to use keyword extraction and entity extraction to do that. And then also, what components and features of the car are going to be mentioned most frequently? So I would think as someone wanting to understand brand perception or consumer sentiment, I'd want to understand why they're leaning towards posting something positive or negative about their car or their product. So being able to get a little bit more specific here. So the next step is I went in and sampled the data. So using a platform like Reddit, you can very easily go in and vet whether the data source is going to give you enough detail for NLP to make sense. So in this case, I wanted to validate that A, there was enough detail, so enough text for it to make sense to go through the process of writing code to do NLP and that the end user would be able to cut through it and it would make logical sense and that it was accessible in a format that would work for the tool that I had at my disposal. And since Reddit has an API, you can very easily get the data directly from the platform into a structured dataset that you can then bring into a BI workflow using Tableau Prep or Tableau Desktop or Alteryx, whatever. The third is I wanted to determine what's important here. So again, I wanted to capture the make, model, and year of different cars being mentioned in this data source, the sentiment of the post, and then the different car parts and features that users are mentioning, and being able to label my dataset with all of these different things and bring it into a format that could be used in reporting. The last step, I wanted to validate that the tooling that I had at my disposal would work. And so at my disposal, I had Python, Tableau Prep, and Tableau Desktop. And because of the integration with Tableau and Python, I was able to successfully use Python in a Tableau Prep environment to do NLP. And a little bit about how I got started there. Using Python, I typically start using something like a Jupyter Notebook before I plug my code directly into a tool like Tableau Prep. Alteryx and Dataiku both have this capability too, where you can write your own custom Python code and then plug it into a workflow. But I like to start here because it gives you the ability to write cleaner documentation, run samples, do a little bit more exploratory analysis before you plug your code into the pipeline or the flow. So yeah, that was where I wanted to start testing. So before I went through and did my NLP analysis, I started with text preprocessing. And really what this is, is in order to do NLP, it's a best practice to strip down your text data into a format to where it minimizes the amount of waste or filler words that are not going to be relevant to the different models being used in Python models or packages. So things we call stop words, if you know regex, regular expressions, being able to remove punctuation or characters, bad characters that are not text, these are the steps that you would want to take before you do any of your NLP analysis. But I definitely recommend always retaining the original record just so you can give that piece of text to end users in the context that it was originally put. But you definitely want to, whatever you're going to run through your NLP model, you're going to want to process it and clean it of all these things. So a few examples here. One is lemmatization. So I always go through lemmatization when doing NLP or preprocessing because it helps strip down different words into its base or yeah, its base word or its lexical root. So in this case, "exercise" is being clipped from "exercising," "exercises," "exercise" into one word. That allows us to give it a higher level of classification across all these records that are using the word in different contexts. But it allows us to provide the higher-level grouping and makes it easier for the model. Stop word removal, as I mentioned. So as you can see here, it's just removing filler words, "the," "was," "from" text. So it's stripped down and only retaining the most important pieces of text that will be meaningful and result usually in a higher accuracy for, say, a sentiment model looking at these words alone. And tokenization. So tokenization, I always, I mean, yeah, a lot of models will do this inherently. And what this does is it's basically breaking a chunk of text or a sentence into smaller chunks, more meaningful, so that rather than being evaluated as a large piece of text, it's actually evaluating each individual word, its meaning, and its context standalone. So that's used extensively in sentiment modeling to get the actual meaning or connotation of a word as negative or positive, and then being able to aggregate that across a sentence to get a classification. So these are all the steps that I typically go through that are standard best practice. And then lastly, starting in the pipeline, sentiment and classification. So my goal here was to classify the different Reddit comments and posts as negative, positive, or neutral. So I used the Natural Language Toolkit, which is a package that you can use in Python. And it has some out-of-the-box models that are very easy to set up. And all that you really need to do is with one line of code, you can run your column or your text field through it, and it will add a label that you can add to your dataset. And so I did this and eventually plugged this piece of code into Tableau Prep. And usually what this will return is if we have our original text here, each record, it'll initially assign a score that indicates if it's skewing more negative, positive, or neutral. And so from here, we can start to classify this and begin to put this into a context that end users could understand. So while the output is usually like a raw numeric score, we can use and bin these into a way that makes sense for the end user. So they can only filter on, say, negative or positive things in the dashboard or the dataset. Next one is keyword extraction. So again, my goal here was to extract the relevant entities and keywords from this analysis. So the first thing I wanted to extract was the years that people were mentioning, so the year that their car was made. And in a line of code, it was easily able to search all of the text records and retrieve all of the records that had mentioned a year between nineteen forty and twenty twenty-two. So those are just the years that I know that Toyotas were produced. So I wanted to extract that piece of information and append my dataset with that information. The model of the car. So this was derived from the name of the Reddit community, which made it pretty easy. But you could also set this up to extract the information manually. And also, the final step was to extract the names of car parts and different attributes of the vehicle. So here, what I had to do, going back to using abstract topics versus custom classification, I used custom classification here because I was looking for very specific pieces of data. So I went through and I pulled keywords that were related to car parts, and I did what's called a natural language join, which is basically looking for all the instances of these keywords being mentioned in my dataset and then appending those records with this label and this topic at the end of it. So if you can imagine here, we have, and this is our Tableau Prep flow. Using that extraction step, we had our original text body, which is the source data for the social media post, the subjects or the topic. We have that as a record. We have the model year as extracted and the name of the Toyota model. So these are all now dimensions in our dataset and can now give end users more, yeah, easier capabilities with filtering and getting down to the records that they really want to look at by these different meaningful dimensions. And similarly here with sentiment analysis, we have the original text body and some labels. So these are all now part of our dataset. And from here, we can bring this into a dashboard. So without going into too much process around how I built this dashboard, what you should know is that here, what's being shown is a few thousand Reddit interactions across various communities where we're measuring and extracting keywords related to Toyota vehicles. We're able to measure the sentiments associated with those keywords. So we can click in and filter this dashboard by certain keywords and understand whether most of those interactions are positive, negative, or neutral, and then what model it's being spoken about, and also the year of that model. So you can see how this can begin to be pretty powerful when you have more of these nuanced text data sources that can be quite laborious to manually filter through. So just an example here. One insight that I found, and because I did this a number of years ago, it may not be an insight anymore, but one of the first things I noticed when I first did the analysis and helped validate my findings is that two thousand and seven Toyota Priuses had a problem with catalytic converter theft, which I know that's more of a common thing, at least in larger cities. But this is one of the first insights that popped up to the top when I visualized the data here. That helps validate my findings here. But you can begin to see how this could be useful if you had data for your business in a similar format. So this is just an example project. But I wanted to wrap this up by leaving you with some features and some BI tools that we like at InterWorks that have NLP functionality built into it. So Dataiku, as a data preparation and data science tool, there are a number of places where you can plug in and use NLP either in a no-code or a low-code environment. So you can either use tools like PySpark or Python in a workbook, or you can use plugins or recipes that are no-code. In Tableau, you have the capability to use the Python or R integrations to do NLP, and you can use that in either Tableau Desktop or Tableau Prep. And finally, Alteryx. There are certain steps and tools and functions in Alteryx that'll allow you to also do NLP using Python and R integrations, add-ons, and also the regular expression, Regex tool, and the formula tool all have these capabilities. And some resources to get started. Some Python packages. This book, this O'Reilly book is what actually got me into NLP personally, so I highly recommend it. And also, we have content on the InterWorks blog. I'm continuing to write more blogs on the topic, so you can expect more there in upcoming months. So final quote. Please feel free to start asking questions because I know we're coming up on time. But I just wanted to leave with a final quote. NLP can unlock answers to questions that numbers often do not tell. It can help you further unpack the why behind different difficult problems by looking at data where individuals are expressing what they want, intend, and how they feel. So all this goes to say, NLP can help us explain what numbers don't explain. And that's why it's a critical source of insight and why BI professionals should look for these opportunities internally to leverage NLP and text analytics to augment their current analytics program. All right. Thank you so much, Jack. As he mentioned, if you do have a question, please feel free to pop it in the chat or alternatively within the Q&A. As it stands, I think you've covered everything absolutely beautifully. So we have no questions at the moment. So we're just going to leave the line open for a couple more minutes just to see if somebody does have a question and they're frantically typing away. They've got the opportunity to ask that. And as Jack said, we do have a lot of blog content that you are more than welcome to jump on and have a look at. And Jack is quite a keen writer of said blogs and I'm sure he'll be doing some more following on from this presentation. Okay. So we do have a question. So I'm an R user, should I change to Python? That is, yeah. And this is, I would say if you're comfortable in R, I know that R has a ton of packages that are akin to those and models akin to those that are built in Python. I guess I always default to Python because it's, I mean, in the data engineering world or at least where I've worked, it's just where I'm more comfortable. But I would say that R is equally as good as Python for NLP. I mean, I think they're just different in implementation. And to be honest, I'm not personally as strong of an R user, so I've never been able to compare them one-to-one. All I know is that a lot of the teams that I've personally worked with have been using Python. Those are just the models I'm familiar with. But I think R probably has a very similar core community for people doing NLP within RStudio. So I'd say stick to what you're comfortable with. But I think sometimes Python might have, there's just a few packages where it might be a little bit easier to do certain tasks than others with fewer lines of code. But, yeah, I don't know. I guess it's all just about what you're comfortable with. Thank you. We've also had a request. Could you please show the Tableau dashboard once again? Yes. There you go. For anybody interested in whether or not they should buy a Toyota, here is your answer. Yeah, and this was particularly, this resonated with our Portland, Oregon colleagues. This was a problem, or it has been for the last few years. So this was sort of funny when I presented this to some colleagues of mine. Then we have, I think we have one more. We do. What ML model would you recommend for NL2SQL? NL2SQL? I'm not entirely sure. I'm assuming, I don't know if that's a particular program or if that just means natural language to SQL. I guess, yeah, that would just depend on what the underlying task is. I mean, there's different models for different types of tasks that will give you an output that could be put and used in SQL. I'm not entirely sure. I might need to do a little offline research and answer that. That's not a problem at all. We'll pop an email through. We are at time. Thank you so much, everybody, for joining us. Just a reminder, please go ahead and visit the blog. We've got more webinars listed on the schedule in the events section. And finally, a copy of the recording will be with you shortly. Jack, thank you so much. And thank you everybody for joining us. Thanks everybody.

In this webinar, Jack Hulbert introduced natural language processing for business intelligence applications. He explored how NLP techniques including sentiment analysis, theme classification, and keyword extraction can unlock insights from unstructured text data like customer feedback, surveys, and social media. Jack explained where NLP fits within the modern BI stack across ETL, data preparation, and interaction layers, discussing tools like Tableau, Alteryx, and Dataiku. He demonstrated a practical implementation using Reddit data to analyze Toyota brand sentiment, extracting vehicle models, years, and car components while classifying posts as positive, negative, or neutral. Vicky Lockett hosted the session, which emphasized NLP’s ability to answer questions that numeric data cannot address.

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!