The basics of NLP and real time sentiment analysis with open source tools
?zgür (Ozzie) Genc
Senior Tech Leadership | Digital Transformation (ex-P&G, ex-Bain) | CIO
There are 500 million tweets per day and 800 million monthly active users on Instagram while 90 percent of whom are younger than 35. Users make 2.8 million Reddit comments per day and 68% of Americans use Facebook. There is staggering amount of data generated at every moment and it is getting extremely difficult to get the relevant insights out of all that clutter. Is there a way to get a grasp of that for your niche in real time? I will show you one way if you read the rest of this article :) I also deployed a simple real life example at my social listening website (not active anymore) for you to try out…
Image credit: Domo
What is NLP and why is it important?
Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. The goal is for computers to process or “understand” natural language in order to perform various human like tasks like language translation or answering questions.
With the rise of voice interfaces and chatbots, NLP is one of the most important technologies of the 4th Industrial Revolution and become a popular area of AI. There’s a fast-growing collection of useful applications derived from the NLP field. They range from simple to complex. Below are a few of them:
- Search, spell checking, keyword search, finding synonyms, complex questions answering
- Extracting information from websites such as: products, price, dates, locations, people or names
- Machine translation (i.e. Google translate), speech recognition, personal assistants (think about Amazon Alexa, Apple Siri, Facebook M, Google Assistant or Microsoft Cortana)
- Chat bots/dialog agents for customer support, controlling devices, ordering goods
- Matching online advertisements, sentiment analysis for marketing or finance/trading
- Identifying financials risks or fraud
How are words/sentences represented by NLP?
The genius behind NLP is a concept called word embedding. Word embeddings are representations of words as vectors, learned by exploiting vast amounts of text. Each word is mapped to one vector and the vector values are learned in a way that resembles an artificial neural network.
Each word is represented by a real-valued vector with often tens or hundreds of dimensions. Here a word vector is a row of real valued numbers where each number is a dimension of the word’s meaning and where semantically similar words have similar vectors. i.e. Queen and Princess would be closer vectors.
If we label 4 words (King, Queen, Woman, Princess) with some made up dimensions in a hypothetical word vector, it might look a bit like below:
Source: the morning paper
The numbers in the word vector represent the word’s distributed weight across dimensions. The semantics of the word are embedded across these dimensions of the vector. Another simplified example across 4 dimensions is as below:
Source: medium / Jayesh Bapu Ahire
These hypothetical vector values represent the abstract ‘meaning’ of a word. The beauty of representing words as vectors is that they lend themselves to mathematical operators thus we can code them! They are then can be used as inputs to an artificial neural network!
We can visualize the learned vectors by projecting them down to simplified 2 dimensions as below and it becomes apparent that the vectors capture useful semantic information about words and their relationships to one another.
Source: Google TensorFlow blog
These are distributional vectors based on the assumption that words appearing within similar context possess similar meaning. For example, in the figure below, all the big cats (i.e. cheetah, jaguar, panther, tiger and leopard) are really close in the vector space.
Source: medium / Jose Camacho Collados
The word embedding algorithm takes as its input from a large corpus of text and produces these vector spaces, typically of several hundred dimensions. A neural language model is trained on a large corpus (body of text) and the output of the network is used to each unique word to be assigned to a corresponding vector. The most popular word embedding algorithms are Google ‘s Word2Vec, Stanford ‘s GloVe or Facebook ‘s FastText.
Word embeddings represent one of the most successful AI applications of unsupervised learning.
Potential short comings
There are short comings as well like conflation deficiency that is the inability to discriminate among different meanings of a word. For example, the word “bat” has at least two distinct meanings: a flying animal, and a piece of sporting equipment. Another challenge is a text may contain multiple sentiments all at once. For instance (source)
“The intent behind the movie was great, but it could have been better”.
The above sentence consists of two polarities of Positive and Negative. So how do we conclude whether the review was Positive or Negative?
The good news is Artificial Intelligence (AI) now delivers a good enough understanding of complex human language and its nuances at scale and at (almost) real time. Thanks to pre-trained and deep learning powered algorithms, we started seeing NLP cases as part of our daily lives.
Latest and greatest popular news on NLP
Pre-trained NLP models could act like humans and can be deployed much faster using reasonable computing resources. And the race is on!
source: the Verge
A recent popular news on NLP is the controversy that OpenAI has published a new GPT-2 language model but they refused to open source the full model due to its potential dark uses! It was trained via 8 million web pages and GPT-2 can generate long paragraphs of human-like coherent text and has potential to create fake news or spoof online identities. It was basically found too dangerous to make public. This is just the beginning. We will see a lot more discussion about the dangers of unregulated AI approaches in Natural Language Generation field.
Recently there was also news that Google has open sourced its natural language processing (NLP) pre-training model called bidirectional encoder representations from transformers (BERT). Then Baidu (kind of “Google of China”) announced its own pre-trained NLP model called “ERNIE”. Lastly the large tech companies and publishers including Facebook or Google Jigsaw are trying to find ways to detoxify the abundant abuse and harassment on the Internet. Though thousands of human moderators are still needed to avoid scandals until AI and NLP catch up. Stay tuned for more progress & news on NLP!
Source: Vice — Chris Kindred - The Impossible Job: Inside Facebook’s Struggle to Moderate Two Billion People
Social media sentiment analysis
How much one can read or how many people one can follow to get the crux of a matter? Maybe you are watching the Super Bowl and curious about what all the other people thing about the latest ad during the breaks. Maybe you would like to spot a possible social media crisis, reach out to unhappy customers or help run a marketing/political campaign. Maybe you want to avoid (online) crises or identify the top influencers…
source: https://unamo.com/blog/social/sentiment-analysis-social-media-monitoring
Sentiment Analysis (also known as opinion mining) is a sub-field of NLP that tries to identify and extract opinions within a given text across blogs, reviews, social media, forums, news etc. Sentiment Analysis can help craft all this exponentially growing unstructured text into structured data using NLP and open source tools. For example Twitter is a treasure trove of sentiment and users are making their reactions and opinions for every topic under the sun.
Source: talkwalker
The good news is in the new world of ML driven AI, it is possible and getting better everyday to analyze these text snippets in seconds. Actually there are a lot of off the shelf similar commercial tools available though you can build your own do-it-yourself app just for fun!
Streaming tweets is a fun exercise in data mining. Enthusiasts typically use a powerful Python library called tweepy for real time access to (public) tweets. The simplified idea is that we first (1) generate Twitter API credentials online and then (2) use tweepy together with our credentials to stream tweets based on our filter settings. We can then (3) save these streaming tweets in a database so that we can perform our own search queries, NLP operations and online analytics. That is about it!
What is VADER?
source: Jodie Burchell at https://t-redactyl.io/
The good news is you do not need to be a deep learning or NLP expert to start coding for your ideas. One of the readily available pre-trained algorithms is called VADER (Valence Aware Dictionary and sEntiment Reasoner) that is a lexicon (dictionary of sentiments in this case) and a simple rule-based model for general sentiment analysis. Its algorithms are optimized to sentiments expressed in social media like Twitter, online news, movie/product reviews etc. VADER can give us a Positivity and Negativity score that can be standardized in a range of -1 to 1. VADER is able to include sentiments from emoticons (e.g, :-)), sentiment-related acronyms (e.g, LoL) and slang (e.g, meh) where algorithms typically struggle. Thus Vader is an awesome tool for fresh online text.
Source: Medium — David Oti
While VADER has advantages on social media type text, it also doesn’t require any training data as it is based on valence-based, human-curated standard sentiment lexicon. What was also important for me is it is fast enough to be used online with real time streaming data. The developers of VADER have used Amazon’s Mechanical Turk to get most of their ratings and the model is described fully in an academic paper entitled “VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text.” .
The incoming sentences are first split up into several words via a process called “Tokenization”. Then it is much easier to look at the sentiment value of each word sentence via comparing within the sentiment lexicon. Actually there is no machine learning going on here but this library parses for every tokenized word, compares with its lexicon and returns the polarity scores. This brings up an overall sentiment score for the tweet. VADER also has an open sourced python library and can be installed using regular pip install. It does not require any training data and can work fast enough to be used with almost REAL TIME streaming data thus it was an easy choice for my hands on example.
Basic Data Clean up
Any NLP code would need to do some real time clean up to remove the stop words & punctuation marks, lower the capital cases and filter tweets based on a language of interest. Twitter API (tweepy) has an auto-detect feature for the common languages where I filtered for English only. There are also some other popular NLP techniques you can further apply including Lemmatisation (converting words to dictionary form) or Stemming (reducing words to their root form) to further improve the results.
Source: hands on example — Vader sentiment for the “Game of Thrones”
Hands on MVP example using live Twitter data:
Finally I deployed an example model at my demo website to show the power of pre-trained NLP models using real time twitter data with English tweets only. This minimum viable product is done with only open source tools. The inspiration and the original code is from python programming You tuber Sentdex at this link. I added extra functionalities like Google-like search experience, US States sentiment map to capture tweets with users’ location meta-data, word cloud for the searched terms, and error handling to avoid break downs. I figured out the Twitter users do not maintain their “location” much thus the US map includes less tweets. You can download the modified code from my GitHub repository and follow these instructions for deployment on a cloud. The code is messy as I edited it at a limited time and open to any help to make it look better.
A word-cloud example from the real time Twitter feed for “cobra kai”
Dependencies: Open source tech & cloud
Significant part of the work is get all these components installed and work together, data clean up and integrate the open source analytics libraries while the Vader model itself is only few lines of basic code.
Snapshot: Location based sentiments using the real time tweets that include the word “twitter” at a random time in mid-April 2019.
Open Source Tech: I used Python 3.7 together with various open source libraries. Main ones are (1) Tweepy: Twitter API library to stream public tweets in JSON format (2) SQlite3: Widely used light-weight relational database (3) Panda: Great for reading and manipulating numerical tables and twitter time series (4) NLTK: Natural Language Toolkit (5) wordcloud: Obvious huh!(6) Flask: Micro web framework for web deployment. Love it! (7) Dash: enables you to build awesome dashboards using pure Python (8) Plotly: Popular python graphing library for interactive and online graphs for line plots, scatter plots, area charts, bar charts … As mentioned first you need to register for the Twitter API, install the dependencies, write your code and deploy it to your laptop or on a cloud.
Cloud Deployment: I used Digital Ocean (Disclaimer: this is a referral link) as my Virtual Private Server (VPS) provider. I used a low tier frugal server thus the speed/performance of my demo is not super fast but it works and is relatively secure with SSL. Initially I deployed with AWS Beantalks though I had to pivot as the available Amazon Linux distribution was not compatible with the latest SQlite3 database version I wanted to integrate with my code. Actually these are typical challenges when you deal with open source tools and requires the data scientists to be flexible with a LOT of patience. Interestingly Digital Ocean gave me more control in this case and it is easy to set up. Another good option is Heroku. If you use AWS, I recommend to leverage one of mySQL, MongoDB etc like database services (vs. SQLite3).
Conclusions
The Vader model demonstrated that it is not perfect but quiet indicative. There are some false negatives or positives as with any algorithm though more advanced and accurate ML algorithms are coming our way.
Source: Michael used to override KITT’s AI whenever needed :)
These pre-trained NLP capabilities could be easily reapplied with emails, Facebook, Twitter, Instagram, YouTube, Reddit, IMDB, eRetailer reviews, news blogs and the public web. The insights could be parsed by location, demographics, popularity, impact… It has never been easier to measure the pulse of the net or generate human-like content! The scary thing is these could be easily used for computational propaganda by social media bots or other… beware!
AI/Machine Learning democratizes and enables real time access to critical insights for your niche. Though tracking itself may not be worth it if you’re not going to act on the insights.
Future survivors will need to transform their processes & resources to adopt and adapt to this new age of abundant data and algorithms.
Happy learning! And…
Do not forget to stay in the moment!
Source: John Blanding/The Boston Globe via Getty Image
Credits and Resources:
Most of the Python code is re-applied or adopted from sentdex (YouTube link below)
https://datameetsmedia.com/vader-sentiment-analysis-explained/
Thanks Loucas Kyprianou for sharing Ozgur’s post.